Skip to content

improvement: Use tracking timestamps instead of data timestamps for stale-entry eviction in ACTIVE_OBJECT_STORE_SYNC_FILES #1634

@coderabbitai

Description

@coderabbitai

Summary

In src/storage/object_storage.rs, the stale-entry cleanup in collect_upload_results currently uses extract_datetime_from_parquet_path_regex to compare the parquet file's data timestamp (extracted from the filename) against a 5-minute threshold, rather than the time the path was added to ACTIVE_OBJECT_STORE_SYNC_FILES.

Problem

  • Historical ingestion (with time partition): entries are immediately eligible for cleanup because the data timestamp (from the filename) is old, potentially causing races where a currently-uploading file gets evicted from the tracking set prematurely.
  • Near-real-time ingestion: works correctly since data timestamps closely match wall-clock time.

Proposed Improvement

Change the tracking structure from HashSet<PathBuf> to a HashMap<PathBuf, DateTime<Utc>> (or equivalent), recording Utc::now() when each path is inserted. Update the retain logic to compare now - tracked_at >= Duration::minutes(5) instead of parsing the data timestamp from the filename.

This ensures accurate duration-based eviction regardless of whether data is near-real-time or historical.

Context

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions