Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

akrieger
Copy link
Member

@akrieger akrieger commented Dec 30, 2024

Summary

Infrastructure "Zstandard compressed save data for disk space savings."

Purpose of change

Long games get big. Very big. I've been given a sample of one which is almost 4GB of just maps/ data, not including map memory. Sky Islands mod type games also generate lots of terrain. Multiple dimensions are coming. All of these incur more disk pressure from save data. Compression is the natural answer for this, because save data is almost never meant to be read or modified by humans anyway.

Describe the solution

The solution is multipart.

  • First, pick a compression algorithm.

    • gzip is terrible and slow.
    • lzma is good but slow.
    • Zstandard (zstd) is a modern compression algorithm that generally beats both at speed or performance, and sometimes both at once. It seemed like a plausible option to shoot for first. The library source is not large and is straightforward to integrate with the project due to only having internal dependencies.
    • I chose zstd.
  • Next, pick a format. Part of the problem with saves is the many large files which individually end up less than a page large, but because modern hard disks have a minimum allocation size of 4KB end up wasting large amounts of disk. So some sort of archive format like zip or tar should be used.

    • tar is not suitable because it does not support random access to compressed entries. Map data is aggregated in folders where only subsets are loaded / accessed at a time. Linearly decompressing from the beginning would be slow and wasteful.
    • zip supports random access, but I could not find a good open source library for it.
    • dar was considered as it does support random access to entries, but overall it seems like an extremely overkill library as it is intended for disk archiving, not just a grabbag of files.
    • So I settled on a proprietary format. There are many advantages. They will be detailed in Additional context.

The logistics of managing compressed saves is relatively straightforward. zstd supports using dictionaries for best compression ratios of small data. The presence of these dictionaries inside the world folder is used as a proxy for whether compression is enabled or not. This also means that save portability is trivial - the dictionary is encoded per-save, and is independent of the dictionary in the repo itself. Users can substitute their own dictionaries without hassle if desired.

Implementing the format was the hard part. Once it's done, it's trivial to wire up in the appropriate save code paths.

Describe alternatives you've considered

  • Using a database: ew. Enormous dependency, don't need a query language for functionally a key/value store.
  • Saving saves as flexbuffer binary: doesn't really get good space savings.

Testing

Create worlds, run around, save, load, it works. Enable and disable compression on these worlds and others. Intentionally hack the zzip implementation to not save a footer, and confirm the data is successfully recovered through the scan and checksumming. Test compaction works by setting a low compaction threshhold and saving after every step in a game, seeing sizes bounce up and down. Fix many bugs I hit along the way.

Some performance numbers using the quick timer I added in this stack:
Fresh world saving after teleporting around a bunch, writing data both to zzips and to the original files, on a Windows 11 desktop with a Threadripper PRO 7975WX:

mapbuffer::save: 3375692us (avg: 3375692us) (count: 1)
  mapbuffer::save_quad: 3151013us (avg: 400us) (count: 7875)
    mapbuffer::save_quad serialize JsonOut: 363504us (avg: 554us) (count: 655)
    mapbuffer::save_quad write to file: 1170192us (avg: 1786us) (count: 655)
    mapbuffer::save_quad write to zzip: 1403013us (avg: 2142us) (count: 655)
      mapbuffer::save_quad compact zzip: 957177us (avg: 1461us) (count: 655)

This demonstrates the way the timers can nest. I added a timer at the root of mapbuffer::save which encapsulates the whole duration. Breaking it down -

  • 0.363s is spent serializing the mapbuffers to strings. This is cost shared in both old and new codepaths.
  • 1.17s spent writing the mapbuffers to files in the old codepath
  • 1.4s spent writing to zzips in the new codepath, however:
    • 0.975s is spent 'compacting' the zzips to avoid wasting space, but the compaction heuristic is not well tuned right now.
  • 0.466s writing zzips exclusive of the compaction, which is more than 2x faster than the old codepath.

After teleporting some more, revisiting previously seen maps, and spending a turn in each location:

mapbuffer::save: 5904788us (avg: 2952394us) (count: 2)
  mapbuffer::save_quad: 5469810us (avg: 375us) (count: 14553)
    mapbuffer::save_quad serialize JsonOut: 776581us (avg: 634us) (count: 1223)
    mapbuffer::save_quad write to file: 2203060us (avg: 1801us) (count: 1223)
    mapbuffer::save_quad write to zzip: 2077924us (avg: 1699us) (count: 1223)
      mapbuffer::save_quad compact zzip: 1253596us (avg: 1025us) (count: 1223)

For the save above:

# Compressed, 'raw' after saving.
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
841843  .

# Uncompressed
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
11948820        .

# Ratio = 11948820 / 841843 = 14.2x

# Recompressed through world option
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
600992  .

# Ratio = 11948820 / 600992 = 19.88x

Additional context

On the zzip format

zstd has a concept of "frames", which are independent chunks of semantically meaningful 'data' to zstd. I put data in quotes because there are compressed data frames and also 'skippable' frames. Skippable frames are essentially arbitrary non-compressed data that the zstd library user can encode in a matter that zstd can understand and handle using specific apis.

The zzip format encodes some limited per-file metadata using skippable frames, and a richer index as a flexbuffer footer. The flexbuffer format is convenient in that you do not need to know the length of the flexbuffer, you just need to know where it ends. This means we can always put it at the end of the file, and in constant time, read it to access the zzip index. However, for corruption detection, the front of the zzip contains a skippable frame containing the length of the footer and its hash. We read this first, then hash the footer, before trusting its contents.

For each file, we encode the filename, a hash of the compressed frame, and then finally the compressed frame itself. We store things in this order to help recover from corruption in case of a crash or power loss event.

footer length & hash file name compressed hash compressed file .... preallocated bytes flexbuffer footer
1234,0xABCD cows/steak.json 0xDEADBEEF zstd data ... index

When loading a zzip, we try to access the footer. If it is not there, or if the mandatory metadata is missing, we can recover the zzip by scanning it from the front (which is slower than just reading a flexbuffer). In sequence, we can read the filename, the hash of the compressed frame, and then verify the compressed frame is intact using that hash. The zstd frame header contains the frame size, so if we can read the header, we can test the rest of the frame. If at any point there is an issue, we assume the rest of the file is corrupted. Then a fresh footer can be written after the last intact entry.

This assumption is safe because zzips are append-only. To update a file in a zzip, we simply stick the new copy at the end, overwriting the footer if it was in the way, and write a new footer afterward. The old version is orphaned inside the zzip until compaction is triggered. Compaction will write only the latest versions of all files into a new zzip and replace it atomically with filesystem operations, the same way save files were overwritten before.

The compaction algorithm is a little dumb right now. We end up triggering compaction more often than necessary because we first extend the zzip with enough space for the worst case compression result, which ends up padding with a lot of wasted bytes. This causes compaction to happen sooner than the 2x it is currently encoded for. But perfect is the enemy of good, and this is pretty good already right now.

TODO:

  • Fix Linux build apparently.
  • Debug world / bug report support.
  • Map memory (maybe not this PR).
  • Fix reset world removing compression.
  • Finish the PR summary.
  • Screenshots of compression results.
  • Screenshots of performance benchmarks.
  • More comments, esp. in code documentation of zzip.
  • Overmaps
  • Map memory
  • License notice.
  • Android build
  • Linux release build (c/cxx flags mixing improperly in Makefile)

@github-actions github-actions bot added <Documentation> Design documents, internal info, guides and help. Translation I18n Code: Build Issues regarding different builds and build environments [C++] Changes (can be) made in C++. Previously named `Code` [Markdown] Markdown issues and PRs Character / World Generation Issues and enhancements concerning stages of creating a character or a world Code: Infrastructure / Style / Static Analysis Code internal infrastructure and style labels Dec 30, 2024
Copy link
Contributor

Spell checker encountered unrecognized words in the in-game text added in this pull request. See below for details.

Click to expand
  • Toggle World <C|c>ompression

This alert is automatically generated. You can simply disregard if this is inaccurate, or (optionally) you can also add the new words to tools/spell_checker/dictionary.txt so they will not trigger an alert next time.

Hints for adding a new word to the dictionary
  • If the word is normally in all lowercase, such as the noun word or the verb does, add it in its lower-case form; if the word is a proper noun, such as the surname George, add it in its initial-caps form; if the word is an acronym or has special letter case, such as the acronym CDDA or the unit mW, add it by preserving the case of all the letters. A word in the dictionary will also match its initial-caps form (if the word is in all lowercase) and all-uppercase form, so a word should be added to the dictionary in its normal letter case even if used in a different letter case in a sentence.
  • For a word to be added to the dictionary, it should either be a real, properly-spelled modern American English word, a foreign loan word (including romanized foreign names), or a foreign or made-up word that is used consistently and commonly enough in the game. Intentional misspelling (including eye dialect) of a word should not be added unless it has become a common terminology in the game, because while someone may have a legitimate use for it, another person may spell it that way accidentally.

@github-actions github-actions bot added astyled astyled PR, label is assigned by github actions json-styled JSON lint passed, label assigned by github actions labels Dec 30, 2024
Makefile Outdated Show resolved Hide resolved
@ZhilkinSerg
Copy link
Contributor

ZhilkinSerg commented Dec 30, 2024

I've tried zstd with MA overmaps:

Method Size Disk size
gz 8 744 047 11 485 184
zstd -1 13 407 208 16 068 608
zstd -3 12 106 198 14 794 752
zstd -3 -D 12 300 415 14 962 688
zstd -5 10 265 961 12 959 744
zstd -7 9 598 894 12 263 424
zstd-19 6 893 311 9 363 456

Dictionary trained with zstd --train on all uncompressed omap files with output:

Trying 5 different sets of parameters
k=50
d=8
f=20
steps=4
split=75
accel=1
Save dictionary of size 66802 into file dictionary

Did not benchmark speed at all.

@akrieger
Copy link
Member Author

akrieger commented Dec 30, 2024

Advantage of this format is aggregating sub-4k files together to avoid wasted bytes in the allocation.

I found the best dictionary was with d=20 on --train-cover, but on Windows that hits a pessimal behavior in qsort so you should train on mac/linux or fix it like I did (rewrite into c++ and use std::stable_sort parallel).

@akrieger akrieger force-pushed the zsav branch 2 times, most recently from 3129354 to 613d683 Compare December 31, 2024 03:06
@akrieger
Copy link
Member Author

I've tried zstd with MA overmaps:
...
Did not benchmark speed at all.

An interesting thing happens with json around the -13 level (I saw this in development too). With no dictionary:

$ zstd/contrib/VS2005/bin/x64/Release/zstd.exe -b1e15 -r MA_overmap/
 1# 1337 files       : 140286061 ->  13401860 (x10.47),  385.8 MB/s,  688.8 MB/s
 2# 1337 files       : 140286061 ->  12869576 (x10.90),  398.4 MB/s,  733.8 MB/s
 3# 1337 files       : 140286061 ->  12100850 (x11.59),  423.8 MB/s,  839.0 MB/s
 4# 1337 files       : 140286061 ->  11375873 (x12.33),  224.8 MB/s,  812.2 MB/s
 5# 1337 files       : 140286061 ->  10260613 (x13.67),  156.1 MB/s, 1218.4 MB/s
 6# 1337 files       : 140286061 ->   9877264 (x14.20),  109.1 MB/s, 1170.0 MB/s
 7# 1337 files       : 140286061 ->   9593546 (x14.62),   68.7 MB/s,  977.8 MB/s
 8# 1337 files       : 140286061 ->   9339853 (x15.02),   54.5 MB/s, 1091.4 MB/s
 9# 1337 files       : 140286061 ->   8858622 (x15.84),   45.3 MB/s, 1195.7 MB/s
10# 1337 files       : 140286061 ->   8444538 (x16.61),   31.6 MB/s, 1214.4 MB/s
11# 1337 files       : 140286061 ->   7770721 (x18.05),   11.0 MB/s, 1820.7 MB/s
12# 1337 files       : 140286061 ->   7568403 (x18.54),   8.83 MB/s, 1816.9 MB/s
13# 1337 files       : 140286061 ->   8586961 (x16.34),   7.04 MB/s, 1342.8 MB/s
14# 1337 files       : 140286061 ->   8044176 (x17.44),   6.06 MB/s, 1972.9 MB/s
15# 1337 files       : 140286061 ->   7033479 (x19.95),   4.57 MB/s, 1962.8 MB/s

With dictionary zstd.exe --train-cover=steps=512,d=20 -r MA_overmap/ -o ma_overmap.dict -M1000MB --maxdict=100KB -T0 -7 (ignore compression/decompression speeds, I changed laptop performance profile in between, just look at ratios):

 1# 1337 files       : 140286061 ->  13349825 (x10.51),  664.3 MB/s, 1222.7 MB/s
 2# 1337 files       : 140286061 ->  13690522 (x10.25),  651.0 MB/s, 1210.5 MB/s
 3# 1337 files       : 140286061 ->  12177767 (x11.52),  719.9 MB/s, 1377.6 MB/s
 4# 1337 files       : 140286061 ->  12182817 (x11.52),  618.9 MB/s, 1357.1 MB/s
 5# 1337 files       : 140286061 ->  10714637 (x13.09),  262.3 MB/s  1447.7 MB/s
 6# 1337 files       : 140286061 ->   9801734 (x14.31),  162.1 MB/s, 1901.7 MB/s
 7# 1337 files       : 140286061 ->   9644614 (x14.55),  128.4 MB/s, 1939.1 MB/s
 8# 1337 files       : 140286061 ->   9187718 (x15.27),  106.6 MB/s, 2168.8 MB/s
 9# 1337 files       : 140286061 ->   8702472 (x16.12),   73.8 MB/s, 2345.5 MB/s
10# 1337 files       : 140286061 ->   8274563 (x16.95),   45.5 MB/s, 2595.5 MB/s
11# 1337 files       : 140286061 ->   7556593 (x18.56),   23.8 MB/s, 3356.1 MB/s
12# 1337 files       : 140286061 ->   7322328 (x19.16),   15.9 MB/s, 3476.0 MB/s
13# 1337 files       : 140286061 ->   9136798 (x15.35),   14.8 MB/s, 2419.2 MB/s
14# 1337 files       : 140286061 ->   7879967 (x17.80),   10.8 MB/s, 3081.3 MB/s
15# 1337 files       : 140286061 ->   6781610 (x20.69),   6.44 MB/s, 3444.2 MB/s

Thing is, compression speed is irrelevant because the mod content is static. Or static enough. Generally better rations compress better and decompress faster.

Oh and just for fun, as a zzip at compression level 7:

$ du -shb save/Timersbaby/maps/ma_overmap.zzip
9776444 save/Timersbaby/maps/ma_overmap.zzip

And at 15:

$ du -shb save/Timersbaby/maps/ma_overmap.zzip
6913392 save/Timersbaby/maps/ma_overmap.zzip

@PatrikLundell
Copy link
Contributor

As far as I understand, this will cause two potential issued:

  1. Compressed debug saves. I assume the logic allows for ripping out the unwanted files/folders from a copy of the archive in a modification to the corresponding code (or just copying selected parts), though.
  2. Fault finding and hacking. It's not that rare to have the need to examine what's in the save files in order to understand why things misbehave, or remove/fix things that causes a save to blow up when loaded. Thus, there's a need for a tool that allows for the unpacking and repacking of the data in a file. However, having to do that will make the threshold to attempting such activities higher, resulting in poorer support.

@akrieger
Copy link
Member Author

  1. yes, good point. However given the size savings it's unclear whether we need to support trimming or just package the dictionary with the subset of bundled maps folders needed. The implementation compresses each subfolder of maps/ separately.
  2. there is an option in the main menu for enabling/disabling compression for the whole world at once. It's so fast on normal sized worlds it may as well be instant. For the enormous sample world I have with the 3.5ishGB of data, it takes a couple minutes.

@andrei8l
Copy link
Contributor

However given the size savings it's unclear whether we need to support trimming or just package the dictionary with the subset of bundled maps folders needed

We still need a tar.gz for bug reports
Screenshot From 2024-12-31 15-15-09
And some people have truly enormous save folders so I expect we'll still need trimming too, and that needs only a minor change to work with compressed saves AFAICT.

@akrieger
Copy link
Member Author

I wouldn't change anything about the bug report upload, just the maps files inside the save, which would still be a tarball or whatever. Is GitHub complaining about the archive containing those? Like it doesn't support uploading that because it can't virus scan it or something?

@PatrikLundell
Copy link
Contributor

I'd suggest testing if github is uncooperative by tar:ing up a save compressed with your format and post it in a comment.

@akrieger
Copy link
Member Author

Well yes that is the empirical method, but if I could do that right now I would have :)

@akrieger
Copy link
Member Author

I'd suggest testing if github is uncooperative by tar:ing up a save compressed with your format and post it in a comment.

Maunawili.zip

Yeah github dgaf.

@github-actions github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025
@github-actions github-actions bot removed the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025
@akrieger akrieger force-pushed the zsav branch 2 times, most recently from 6b463da to 024a9a7 Compare January 2, 2025 19:55
@github-actions github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025
Copy link
Contributor

@moxian moxian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through the code left some comments.
The only important one is that we might not be particularly corruption-resistant if flexbuffers can segfault when trying to parse garbage data.

The other note that I feel strongly about (and that doesn't fit into github code comments) is please please please don't put the dict files into the root project folder (especially with their current cryptic names), we have enough random junk there as is. data/raw/zstd_dicts or something would be a much nicer place.

src/worldfactory.cpp Outdated Show resolved Hide resolved
src/zzip.h Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.h Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
src/zzip.cpp Outdated Show resolved Hide resolved
@akrieger
Copy link
Member Author

My thinking is, because the flexbuffer validates the object is an object/dict when we create it, and it does that from the last byte(s) of the file, if that passes then everything before those bytes is also still valid. The initial validation just checks some tag bytes and doesnt seek anywhere else in the file so I believe it should work out.

I'll move the dicts, that makes sense.

src/zzip.cpp Show resolved Hide resolved
src/zzip.cpp Show resolved Hide resolved
@akrieger
Copy link
Member Author

Addressed review comments. Significant refactoring due to inserting a footer checksum frame at the front of the file, DRYing some of frame reading/writing code into private functions, added delete functionality, more comments, just a nontrivial amount got changed.

@github-actions github-actions bot removed the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 20, 2025
@akrieger akrieger force-pushed the zsav branch 2 times, most recently from f7e28db to 12c41a3 Compare January 20, 2025 21:32
@github-actions github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 21, 2025
@github-actions github-actions bot added BasicBuildPassed This PR builds correctly, label assigned by github actions and removed BasicBuildPassed This PR builds correctly, label assigned by github actions labels Jan 21, 2025
@akrieger
Copy link
Member Author

Actually, overmaps are easy. They are never deleted and only read/generated in like one place. I can handle those in this PR.

Map memory, maybe slightly more complicated. But I haven't looked too closely. If it's not bad then we can just do everything relevant in this PR and not need any followup.

@GuardianDll
Copy link
Member

Is it still WIP?

@akrieger
Copy link
Member Author

akrieger commented Jan 29, 2025

Yep. I have overmaps done in the dev branch but need to refactor some stuff. Map memory is being annoying but I'll get there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
astyled astyled PR, label is assigned by github actions BasicBuildPassed This PR builds correctly, label assigned by github actions [C++] Changes (can be) made in C++. Previously named `Code` Character / World Generation Issues and enhancements concerning stages of creating a character or a world Code: Build Issues regarding different builds and build environments Code: Infrastructure / Style / Static Analysis Code internal infrastructure and style <Documentation> Design documents, internal info, guides and help. json-styled JSON lint passed, label assigned by github actions [Markdown] Markdown issues and PRs Translation I18n
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants