Skip to content

Sort IMAS file globs to make checksums platform-independent#94

Open
SimonPinches wants to merge 1 commit into
iterorganization:developfrom
SimonPinches:fix/glob-sort-checksum-platform
Open

Sort IMAS file globs to make checksums platform-independent#94
SimonPinches wants to merge 1 commit into
iterorganization:developfrom
SimonPinches:fix/glob-sort-checksum-platform

Conversation

@SimonPinches

@SimonPinches SimonPinches commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

imas.checksum.checksum() feeds the files returned by imas_files() into a single running SHA1 hash in iteration order. imas_files() built the HDF5 and ASCII file lists from Path.glob(), which returns entries in an arbitrary, filesystem-dependent order. As a result the same byte-identical IMAS data could hash to different checksums on Windows vs Linux.

This sorts the glob results explicitly by file name (key=lambda p: p.name) so the iteration order is deterministic and identical across platforms. Sorting by p.name rather than relying on Path comparison avoids the platform-dependent case-folding of Path ordering (Windows folds case, Linux does not).

Adds regression tests for imas_files ordering (tests/test_imas_utils.py).

Reported issue: glob sorting in src/simdb/imas/utils.py is platform dependent.

imas.checksum.checksum() feeds the files returned by imas_files() into a
single running SHA1 hash in iteration order. imas_files() built the HDF5 and
ASCII file lists from Path.glob(), which returns entries in an arbitrary,
filesystem-dependent order. As a result the same byte-identical IMAS data
could hash to different checksums on Windows vs Linux.

Sort the glob results explicitly by file name so the iteration order is
deterministic and identical across platforms. Sort by p.name rather than
relying on Path comparison, which is itself platform-dependent (Windows folds
case, Linux does not).

Add regression tests for imas_files ordering.
@SimonPinches SimonPinches force-pushed the fix/glob-sort-checksum-platform branch from d12c397 to 1fca9cd Compare June 18, 2026 14:24
@olivhoenen

Copy link
Copy Markdown
Contributor

Yes, this would be fine moving forward but as discussed earlier we want to come up with a practical solution that may work with existing databases, if possible without having to recalculate all checksums.

@SimonPinches

Copy link
Copy Markdown
Contributor Author

Yes, this would be fine moving forward but as discussed earlier we want to come up with a practical solution that may work with existing databases, if possible without having to recalculate all checksums.

I missed the meeting, but this is reasonable. I guess we need a deterministic approach that recovers the checksums on Linux...

@SimonPinches

Copy link
Copy Markdown
Contributor Author

Following up on the point about finding a practical solution for existing databases without recalculating all checksums — unfortunately, investigation suggests that goal isn't achievable, for a fairly fundamental reason.

The old Linux checksum was never a deterministic function of the data. Path.glob() returns files in raw filesystem directory-entry order, not sorted. That ordering is a property of the specific filesystem and directory state at the time the data was written (creation order on some filesystems, hash-tree order on ext4 with dir_index, something else elsewhere). It is not stored anywhere and is not derivable from the file contents. So there is no deterministic way to reproduce the old Linux checksum on Windows — there is simply nothing to replay, because the information that fixed the order (the original on-disk directory-entry order) was never recorded. It isn't even reproducible on a different Linux machine, e.g. after a pull re-writes the files, or on a different filesystem; the legacy values are only reproducible on the original directory, on the original filesystem, while unmodified.

The consequence is that a migration is unavoidable regardless of the fix chosen. Any scheme that makes the checksum deterministic — the filename sort in this PR, or an order-independent hash — will necessarily differ from the legacy order-dependent values, because those values are effectively random with respect to the data. There is no "compatible" deterministic algorithm.

It's worth noting this fragility already exists today: cross-machine validation, re-download, and filesystem migration all already break legacy checksums. This PR exposes and fixes the root cause rather than introducing the problem. Where it currently bites:

  • validate re-reads files, recomputes, and compares to the stored checksum → every pre-existing entry would report a spurious mismatch after the upgrade.
  • Server-side push verification recomputes and compares → a post-fix client pushing data whose server record is pre-fix gets rejected.
  • Dedup keys on the checksum → identical data ingested before vs after stops de-duplicating.

Given a migration can't be avoided, the options for handling it are:

  1. One-time re-checksum migration — recompute and overwrite stored checksums for all existing entries from the data, run on Linux where the data lives. This replaces the old values rather than reproducing them, so it only works where the underlying IMAS data is still accessible (which is also the only place the legacy checksums were ever verifiable).

  2. Checksum-algorithm version tag — add a version field (there is only a single flat checksum value today). Grandfather existing v1 rows (verifiable only on their original Linux filesystem, exactly as today), compute v2 (sorted) for all new ingests, and lazily migrate v1→v2 whenever an entry is re-read. This is the lowest-disruption path — no flag-day recompute, and old entries break only in the scenarios that already break today.

  3. Order-independent checksum — sort the per-file digests, or combine them commutatively (XOR/sum), instead of sorting filenames. Arguably the most robust deterministic design, but it still differs from the legacy values, so it doesn't avoid the migration — it only changes what v2 looks like.

A reasonable combination would be to keep the determinism fix (option 1's algorithm, or option 3) together with the version tag (option 2), so the upgrade is non-breaking and migration can happen lazily or on a schedule rather than as a flag day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants