Sort IMAS file globs to make checksums platform-independent#94
Sort IMAS file globs to make checksums platform-independent#94SimonPinches wants to merge 1 commit into
Conversation
8d8f11d to
d12c397
Compare
imas.checksum.checksum() feeds the files returned by imas_files() into a single running SHA1 hash in iteration order. imas_files() built the HDF5 and ASCII file lists from Path.glob(), which returns entries in an arbitrary, filesystem-dependent order. As a result the same byte-identical IMAS data could hash to different checksums on Windows vs Linux. Sort the glob results explicitly by file name so the iteration order is deterministic and identical across platforms. Sort by p.name rather than relying on Path comparison, which is itself platform-dependent (Windows folds case, Linux does not). Add regression tests for imas_files ordering.
d12c397 to
1fca9cd
Compare
|
Yes, this would be fine moving forward but as discussed earlier we want to come up with a practical solution that may work with existing databases, if possible without having to recalculate all checksums. |
I missed the meeting, but this is reasonable. I guess we need a deterministic approach that recovers the checksums on Linux... |
|
Following up on the point about finding a practical solution for existing databases without recalculating all checksums — unfortunately, investigation suggests that goal isn't achievable, for a fairly fundamental reason. The old Linux checksum was never a deterministic function of the data. The consequence is that a migration is unavoidable regardless of the fix chosen. Any scheme that makes the checksum deterministic — the filename sort in this PR, or an order-independent hash — will necessarily differ from the legacy order-dependent values, because those values are effectively random with respect to the data. There is no "compatible" deterministic algorithm. It's worth noting this fragility already exists today: cross-machine validation, re-download, and filesystem migration all already break legacy checksums. This PR exposes and fixes the root cause rather than introducing the problem. Where it currently bites:
Given a migration can't be avoided, the options for handling it are:
A reasonable combination would be to keep the determinism fix (option 1's algorithm, or option 3) together with the version tag (option 2), so the upgrade is non-breaking and migration can happen lazily or on a schedule rather than as a flag day. |
imas.checksum.checksum()feeds the files returned byimas_files()into a single running SHA1 hash in iteration order.imas_files()built the HDF5 and ASCII file lists fromPath.glob(), which returns entries in an arbitrary, filesystem-dependent order. As a result the same byte-identical IMAS data could hash to different checksums on Windows vs Linux.This sorts the glob results explicitly by file name (
key=lambda p: p.name) so the iteration order is deterministic and identical across platforms. Sorting byp.namerather than relying onPathcomparison avoids the platform-dependent case-folding ofPathordering (Windows folds case, Linux does not).Adds regression tests for
imas_filesordering (tests/test_imas_utils.py).Reported issue: glob sorting in
src/simdb/imas/utils.pyis platform dependent.