Squash data files by merging repeated sublists#38
Squash data files by merging repeated sublists#38ChrisJefferson wants to merge 1 commit intogap-packages:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38 +/- ##
=======================================
Coverage 99.85% 99.85%
=======================================
Files 5 5
Lines 709 709
=======================================
Hits 708 708
Misses 1 1 🚀 New features to boost your workflow:
|
|
Nice! With this, I think further savings are possible. The largest file is There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files) |
|
I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped. This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them) |
|
The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :) |
|
This is all files still gzipped: |
|
Based on @fingolfin , I added some simple run-length encoding: |
07476fa to
a57629f
Compare
|
Now 2.1MB (compressed). I added a function |
|
Awesome! |
|
This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4. (The result should perhaps then be 1.1.0 and not 1.0.5 ...) |
|
This is impressive! i wonder if one could make Also, I opened two files for different orders, and both have the same line Is there a way to eliminate this duplication (I guess there is more)? |
This is an alternative to #23 , which starts by looking for repeated sublists.
The basic idea is we first make a list of every list of length one, then output them, for example,
Endom16_2-8_5.txtbecomes:Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.
I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.
squish.zip