Squash data files by merging repeated sublists by ChrisJefferson · Pull Request #38 · gap-packages/LocalNR

ChrisJefferson · 2026-04-30T11:55:55Z

This is an alternative to #23 , which starts by looking for repeated sublists.

The basic idea is we first make a list of every list of length one, then output them, for example, Endom16_2-8_5.txt becomes:

local P,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10;
A1:=[1,4,2,1,4,7,4,1,2,7,4,7,2,1,7,2];
A2:=[1,2,16,4,11,10,7,14,13,6,5,15,9,8,12,3];
A3:=[1,1,4,1,1,4,1,1,4,4,1,4,4,1,4,4];
A4:=[1,1,5,1,1,5,1,1,5,5,1,5,5,1,5,5];
A5:=[1,4,8,1,4,14,4,1,8,14,4,14,8,1,14,8];
A6:=[1,2,13,4,11,15,7,14,16,12,5,10,3,8,6,9];
A7:=[1,7,6,4,11,3,2,8,12,16,5,9,15,14,13,10];
A8:=[1,1,14,1,4,14,1,4,14,8,4,14,8,4,8,8];
A9:=[1,4,11,1,1,5,4,4,11,11,1,5,5,4,11,5];
A10:=[1,1,7,1,4,7,1,4,7,2,4,7,2,4,2,2];
P:=[
[A1,A2,A3,A4,A5,A6,A7],
[A1,A2,A3,A4,A8,A6,A7],
[A1,A2,A3,A9,A5,A6,A7],
[A1,A2,A3,A9,A8,A6,A7],
[A10,A2,A3,A4,A5,A6,A7],
[A10,A2,A3,A9,A5,A6,A7]
];
return P;

Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.

I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.

squish.zip

codecov · 2026-04-30T11:58:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.85%. Comparing base (c6a04d2) to head (a57629f).

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #38   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files           5        5           
  Lines         709      709           
=======================================
  Hits          708      708           
  Misses          1        1

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fingolfin · 2026-04-30T12:43:42Z

Nice! With this, Endom is just 92 MB for me without compression (instead of over 2 GB). Which means it could be stored uncompressed in the repo and only be compressed for releases.

I think further savings are possible. The largest file is Endom/32/Endom32_37-16_11.txt. The content is highly structured. For example, lines 190-16573 start with A22; they later repeat, just with the first entry changed to A61.

There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files)

ChrisJefferson · 2026-04-30T12:54:38Z

I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped.

This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them)

ChrisJefferson · 2026-04-30T13:05:22Z

The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :)

ChrisJefferson · 2026-04-30T13:42:19Z

This is all files still gzipped:

  ┌──────────┬───────┬──────────┬──────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │
  └──────────┴───────┴──────────┴──────────┴───────┘

ChrisJefferson · 2026-04-30T14:59:18Z

Based on @fingolfin , I added some simple run-length encoding:

  ┌──────────┬───────┬──────────┬──────────┬───────┬─────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │  Delta  │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │ 2.8 MB  │ 0.54% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │ 386 KB  │ 0.15% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │ 2.5 MB  │ 1.10% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │ 798 KB  │ 0.52% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │ 916 KB  │ 3.08% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │ 7.3 MB  │ 0.62% │
  └──────────┴───────┴──────────┴──────────┴───────┴─────────┴───────┘

ChrisJefferson · 2026-04-30T16:48:26Z

Now 2.1MB (compressed). I added a function ProduceSHAs which reads a directory of files which can be loaded with ReadAsFunction, calls HexSHA256 on each of their outputs, and I used this to check I'm generating exactly the same files (this file squish.g should probably be stored somewhere, like this repo, but I'm not sure if it's worth making visible to users, probably not).

squish.g.gz

fingolfin · 2026-04-30T16:54:57Z

Awesome!

fingolfin · 2026-04-30T21:20:52Z

This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4.

(The result should perhaps then be 1.1.0 and not 1.0.5 ...)

olexandr-konovalov · 2026-04-30T21:56:03Z

This is impressive! i wonder if one could make A a list, then P will have numbers of positions in the list A, instead of variables A1, A2 etc. You'd have to put n pairs of [...] if A has the length n, but if P is sufficiently long, you save on not needing to type "A" each time. But then instead of return P you would have

return List(P, t -> List(t, i -> A[i]));

Also, I opened two files for different orders, and both have the same line

_R:=function(rows) local r,p,e,i; r:=[]; p:=[]; for e in rows do if Length(e)>0 and IsInt(e[1]) then p:=ShallowCopy(p); for i in [1,3..Length(e)-1] do p[e[i]]:=e[i+1]; od; else p:=e; fi; Add(r,p); od; return r; end;

Is there a way to eliminate this duplication (I guess there is more)?

Squash data files by merging repeated sublists

a57629f

ChrisJefferson force-pushed the squish-data branch from 07476fa to a57629f Compare April 30, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squash data files by merging repeated sublists#38

Squash data files by merging repeated sublists#38
ChrisJefferson wants to merge 1 commit intogap-packages:masterfrom
ChrisJefferson:squish-data

ChrisJefferson commented Apr 30, 2026

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 •

edited

Loading

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 •

edited by olexandr-konovalov

Loading

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

olexandr-konovalov commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChrisJefferson commented Apr 30, 2026

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 • edited by olexandr-konovalov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

olexandr-konovalov commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 30, 2026 •

edited

Loading

ChrisJefferson commented Apr 30, 2026 •

edited

Loading

ChrisJefferson commented Apr 30, 2026 •

edited by olexandr-konovalov

Loading