Skip to content

Squash data files by merging repeated sublists#38

Open
ChrisJefferson wants to merge 1 commit intogap-packages:masterfrom
ChrisJefferson:squish-data
Open

Squash data files by merging repeated sublists#38
ChrisJefferson wants to merge 1 commit intogap-packages:masterfrom
ChrisJefferson:squish-data

Conversation

@ChrisJefferson
Copy link
Copy Markdown
Member

This is an alternative to #23 , which starts by looking for repeated sublists.

The basic idea is we first make a list of every list of length one, then output them, for example, Endom16_2-8_5.txt becomes:

local P,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10;
A1:=[1,4,2,1,4,7,4,1,2,7,4,7,2,1,7,2];
A2:=[1,2,16,4,11,10,7,14,13,6,5,15,9,8,12,3];
A3:=[1,1,4,1,1,4,1,1,4,4,1,4,4,1,4,4];
A4:=[1,1,5,1,1,5,1,1,5,5,1,5,5,1,5,5];
A5:=[1,4,8,1,4,14,4,1,8,14,4,14,8,1,14,8];
A6:=[1,2,13,4,11,15,7,14,16,12,5,10,3,8,6,9];
A7:=[1,7,6,4,11,3,2,8,12,16,5,9,15,14,13,10];
A8:=[1,1,14,1,4,14,1,4,14,8,4,14,8,4,8,8];
A9:=[1,4,11,1,1,5,4,4,11,11,1,5,5,4,11,5];
A10:=[1,1,7,1,4,7,1,4,7,2,4,7,2,4,2,2];
P:=[
[A1,A2,A3,A4,A5,A6,A7],
[A1,A2,A3,A4,A8,A6,A7],
[A1,A2,A3,A9,A5,A6,A7],
[A1,A2,A3,A9,A8,A6,A7],
[A10,A2,A3,A4,A5,A6,A7],
[A10,A2,A3,A9,A5,A6,A7]
];
return P;

Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.

I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.

squish.zip

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.85%. Comparing base (c6a04d2) to head (a57629f).

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #38   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files           5        5           
  Lines         709      709           
=======================================
  Hits          708      708           
  Misses          1        1           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fingolfin
Copy link
Copy Markdown
Member

Nice! With this, Endom is just 92 MB for me without compression (instead of over 2 GB). Which means it could be stored uncompressed in the repo and only be compressed for releases.

I think further savings are possible. The largest file is Endom/32/Endom32_37-16_11.txt. The content is highly structured. For example, lines 190-16573 start with A22; they later repeat, just with the first entry changed to A61.

There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped.

This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

ChrisJefferson commented Apr 30, 2026

The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

This is all files still gzipped:

  ┌──────────┬───────┬──────────┬──────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │
  └──────────┴───────┴──────────┴──────────┴───────┘

@ChrisJefferson
Copy link
Copy Markdown
Member Author

ChrisJefferson commented Apr 30, 2026

Based on @fingolfin , I added some simple run-length encoding:

  ┌──────────┬───────┬──────────┬──────────┬───────┬─────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │  Delta  │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │ 2.8 MB  │ 0.54% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │ 386 KB  │ 0.15% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │ 2.5 MB  │ 1.10% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │ 798 KB  │ 0.52% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │ 916 KB  │ 3.08% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │ 7.3 MB  │ 0.62% │
  └──────────┴───────┴──────────┴──────────┴───────┴─────────┴───────┘

@ChrisJefferson
Copy link
Copy Markdown
Member Author

Now 2.1MB (compressed). I added a function ProduceSHAs which reads a directory of files which can be loaded with ReadAsFunction, calls HexSHA256 on each of their outputs, and I used this to check I'm generating exactly the same files (this file squish.g should probably be stored somewhere, like this repo, but I'm not sure if it's worth making visible to users, probably not).

squish.g.gz

@fingolfin
Copy link
Copy Markdown
Member

Awesome!

@fingolfin
Copy link
Copy Markdown
Member

This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4.

(The result should perhaps then be 1.1.0 and not 1.0.5 ...)

@olexandr-konovalov
Copy link
Copy Markdown
Member

This is impressive! i wonder if one could make A a list, then P will have numbers of positions in the list A, instead of variables A1, A2 etc. You'd have to put n pairs of [...] if A has the length n, but if P is sufficiently long, you save on not needing to type "A" each time. But then instead of return P you would have

return List(P, t -> List(t, i -> A[i]));

Also, I opened two files for different orders, and both have the same line

_R:=function(rows) local r,p,e,i; r:=[]; p:=[]; for e in rows do if Length(e)>0 and IsInt(e[1]) then p:=ShallowCopy(p); for i in [1,3..Length(e)-1] do p[e[i]]:=e[i+1]; od; else p:=e; fi; Add(r,p); od; return r; end;

Is there a way to eliminate this duplication (I guess there is more)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants