-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857
base: master
Are you sure you want to change the base?
Conversation
Spell checker encountered unrecognized words in the in-game text added in this pull request. See below for details. Click to expand
This alert is automatically generated. You can simply disregard if this is inaccurate, or (optionally) you can also add the new words to Hints for adding a new word to the dictionary
|
I've tried zstd with MA overmaps:
Dictionary trained with
Did not benchmark speed at all. |
Advantage of this format is aggregating sub-4k files together to avoid wasted bytes in the allocation. I found the best dictionary was with d=20 on --train-cover, but on Windows that hits a pessimal behavior in qsort so you should train on mac/linux or fix it like I did (rewrite into c++ and use std::stable_sort parallel). |
3129354
to
613d683
Compare
An interesting thing happens with json around the -13 level (I saw this in development too). With no dictionary:
With dictionary
Thing is, compression speed is irrelevant because the mod content is static. Or static enough. Generally better rations compress better and decompress faster. Oh and just for fun, as a zzip at compression level 7:
And at 15:
|
As far as I understand, this will cause two potential issued:
|
|
I wouldn't change anything about the bug report upload, just the maps files inside the save, which would still be a tarball or whatever. Is GitHub complaining about the archive containing those? Like it doesn't support uploading that because it can't virus scan it or something? |
I'd suggest testing if github is uncooperative by tar:ing up a save compressed with your format and post it in a comment. |
Well yes that is the empirical method, but if I could do that right now I would have :) |
Yeah github dgaf. |
6b463da
to
024a9a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read through the code left some comments.
The only important one is that we might not be particularly corruption-resistant if flexbuffers can segfault when trying to parse garbage data.
The other note that I feel strongly about (and that doesn't fit into github code comments) is please please please don't put the dict files into the root project folder (especially with their current cryptic names), we have enough random junk there as is. data/raw/zstd_dicts
or something would be a much nicer place.
My thinking is, because the flexbuffer validates the object is an object/dict when we create it, and it does that from the last byte(s) of the file, if that passes then everything before those bytes is also still valid. The initial validation just checks some tag bytes and doesnt seek anywhere else in the file so I believe it should work out. I'll move the dicts, that makes sense. |
Addressed review comments. Significant refactoring due to inserting a footer checksum frame at the front of the file, DRYing some of frame reading/writing code into private functions, added delete functionality, more comments, just a nontrivial amount got changed. |
f7e28db
to
12c41a3
Compare
… options to toggle it on/off.
Actually, overmaps are easy. They are never deleted and only read/generated in like one place. I can handle those in this PR. Map memory, maybe slightly more complicated. But I haven't looked too closely. If it's not bad then we can just do everything relevant in this PR and not need any followup. |
Is it still WIP? |
Yep. I have overmaps done in the dev branch but need to refactor some stuff. Map memory is being annoying but I'll get there. |
Summary
Infrastructure "Zstandard compressed save data for disk space savings."
Purpose of change
Long games get big. Very big. I've been given a sample of one which is almost 4GB of just
maps/
data, not including map memory. Sky Islands mod type games also generate lots of terrain. Multiple dimensions are coming. All of these incur more disk pressure from save data. Compression is the natural answer for this, because save data is almost never meant to be read or modified by humans anyway.Describe the solution
The solution is multipart.
First, pick a compression algorithm.
zstd
) is a modern compression algorithm that generally beats both at speed or performance, and sometimes both at once. It seemed like a plausible option to shoot for first. The library source is not large and is straightforward to integrate with the project due to only having internal dependencies.zstd
.Next, pick a format. Part of the problem with saves is the many large files which individually end up less than a page large, but because modern hard disks have a minimum allocation size of 4KB end up wasting large amounts of disk. So some sort of archive format like zip or tar should be used.
Additional context
.The logistics of managing compressed saves is relatively straightforward.
zstd
supports using dictionaries for best compression ratios of small data. The presence of these dictionaries inside the world folder is used as a proxy for whether compression is enabled or not. This also means that save portability is trivial - the dictionary is encoded per-save, and is independent of the dictionary in the repo itself. Users can substitute their own dictionaries without hassle if desired.Implementing the format was the hard part. Once it's done, it's trivial to wire up in the appropriate save code paths.
Describe alternatives you've considered
Testing
Create worlds, run around, save, load, it works. Enable and disable compression on these worlds and others. Intentionally hack the zzip implementation to not save a footer, and confirm the data is successfully recovered through the scan and checksumming. Test compaction works by setting a low compaction threshhold and saving after every step in a game, seeing sizes bounce up and down. Fix many bugs I hit along the way.
Some performance numbers using the quick timer I added in this stack:
Fresh world saving after teleporting around a bunch, writing data both to zzips and to the original files, on a Windows 11 desktop with a Threadripper PRO 7975WX:
This demonstrates the way the timers can nest. I added a timer at the root of
mapbuffer::save
which encapsulates the whole duration. Breaking it down -After teleporting some more, revisiting previously seen maps, and spending a turn in each location:
For the save above:
Additional context
On the zzip format
zstd
has a concept of "frames", which are independent chunks of semantically meaningful 'data' tozstd
. I put data in quotes because there are compressed data frames and also 'skippable' frames. Skippable frames are essentially arbitrary non-compressed data that the zstd library user can encode in a matter that zstd can understand and handle using specific apis.The
zzip
format encodes some limited per-file metadata using skippable frames, and a richer index as a flexbuffer footer. The flexbuffer format is convenient in that you do not need to know the length of the flexbuffer, you just need to know where it ends. This means we can always put it at the end of the file, and in constant time, read it to access the zzip index. However, for corruption detection, the front of thezzip
contains a skippable frame containing the length of the footer and its hash. We read this first, then hash the footer, before trusting its contents.For each file, we encode the filename, a hash of the compressed frame, and then finally the compressed frame itself. We store things in this order to help recover from corruption in case of a crash or power loss event.
1234
,0xABCD
cows/steak.json
0xDEADBEEF
zstd data
index
When loading a
zzip
, we try to access the footer. If it is not there, or if the mandatory metadata is missing, we can recover the zzip by scanning it from the front (which is slower than just reading a flexbuffer). In sequence, we can read the filename, the hash of the compressed frame, and then verify the compressed frame is intact using that hash. The zstd frame header contains the frame size, so if we can read the header, we can test the rest of the frame. If at any point there is an issue, we assume the rest of the file is corrupted. Then a fresh footer can be written after the last intact entry.This assumption is safe because zzips are append-only. To update a file in a zzip, we simply stick the new copy at the end, overwriting the footer if it was in the way, and write a new footer afterward. The old version is orphaned inside the zzip until compaction is triggered. Compaction will write only the latest versions of all files into a new zzip and replace it atomically with filesystem operations, the same way save files were overwritten before.
The compaction algorithm is a little dumb right now. We end up triggering compaction more often than necessary because we first extend the zzip with enough space for the worst case compression result, which ends up padding with a lot of wasted bytes. This causes compaction to happen sooner than the 2x it is currently encoded for. But perfect is the enemy of good, and this is pretty good already right now.
TODO: