Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

akrieger · 2024-12-30T19:33:22Z

Summary

Infrastructure "Zstandard compressed save data for disk space savings."

Purpose of change

Long games get big. Very big. I've been given a sample of one which is almost 4GB of just maps/ data, not including map memory. Sky Islands mod type games also generate lots of terrain. Multiple dimensions are coming. All of these incur more disk pressure from save data. Compression is the natural answer for this, because save data is almost never meant to be read or modified by humans anyway.

Describe the solution

The solution is multipart.

First, pick a compression algorithm.
- gzip is terrible and slow.
- lzma is good but slow.
- Zstandard (zstd) is a modern compression algorithm that generally beats both at speed or performance, and sometimes both at once. It seemed like a plausible option to shoot for first. The library source is not large and is straightforward to integrate with the project due to only having internal dependencies.
- I chose zstd.
Next, pick a format. Part of the problem with saves is the many large files which individually end up less than a page large, but because modern hard disks have a minimum allocation size of 4KB end up wasting large amounts of disk. So some sort of archive format like zip or tar should be used.
- tar is not suitable because it does not support random access to compressed entries. Map data is aggregated in folders where only subsets are loaded / accessed at a time. Linearly decompressing from the beginning would be slow and wasteful.
- zip supports random access, but I could not find a good open source library for it.
- dar was considered as it does support random access to entries, but overall it seems like an extremely overkill library as it is intended for disk archiving, not just a grabbag of files.
- So I settled on a proprietary format. There are many advantages. They will be detailed in Additional context.

The logistics of managing compressed saves is relatively straightforward. zstd supports using dictionaries for best compression ratios of small data. The presence of these dictionaries inside the world folder is used as a proxy for whether compression is enabled or not. This also means that save portability is trivial - the dictionary is encoded per-save, and is independent of the dictionary in the repo itself. Users can substitute their own dictionaries without hassle if desired.

Implementing the format was the hard part. Once it's done, it's trivial to wire up in the appropriate save code paths.

Describe alternatives you've considered

Using a database: ew. Enormous dependency, don't need a query language for functionally a key/value store.
Saving saves as flexbuffer binary: doesn't really get good space savings.

Testing

Create worlds, run around, save, load, it works. Enable and disable compression on these worlds and others. Intentionally hack the zzip implementation to not save a footer, and confirm the data is successfully recovered through the scan and checksumming. Test compaction works by setting a low compaction threshhold and saving after every step in a game, seeing sizes bounce up and down. Fix many bugs I hit along the way.

Some performance numbers using the quick timer I added in this stack:
Fresh world saving after teleporting around a bunch, writing data both to zzips and to the original files, on a Windows 11 desktop with a Threadripper PRO 7975WX:

mapbuffer::save: 3375692us (avg: 3375692us) (count: 1)
  mapbuffer::save_quad: 3151013us (avg: 400us) (count: 7875)
    mapbuffer::save_quad serialize JsonOut: 363504us (avg: 554us) (count: 655)
    mapbuffer::save_quad write to file: 1170192us (avg: 1786us) (count: 655)
    mapbuffer::save_quad write to zzip: 1403013us (avg: 2142us) (count: 655)
      mapbuffer::save_quad compact zzip: 957177us (avg: 1461us) (count: 655)

This demonstrates the way the timers can nest. I added a timer at the root of mapbuffer::save which encapsulates the whole duration. Breaking it down -

0.363s is spent serializing the mapbuffers to strings. This is cost shared in both old and new codepaths.
1.17s spent writing the mapbuffers to files in the old codepath
1.4s spent writing to zzips in the new codepath, however:
- 0.975s is spent 'compacting' the zzips to avoid wasting space, but the compaction heuristic is not well tuned right now.
0.466s writing zzips exclusive of the compaction, which is more than 2x faster than the old codepath.

After teleporting some more, revisiting previously seen maps, and spending a turn in each location:

mapbuffer::save: 5904788us (avg: 2952394us) (count: 2)
  mapbuffer::save_quad: 5469810us (avg: 375us) (count: 14553)
    mapbuffer::save_quad serialize JsonOut: 776581us (avg: 634us) (count: 1223)
    mapbuffer::save_quad write to file: 2203060us (avg: 1801us) (count: 1223)
    mapbuffer::save_quad write to zzip: 2077924us (avg: 1699us) (count: 1223)
      mapbuffer::save_quad compact zzip: 1253596us (avg: 1025us) (count: 1223)

For the save above:

# Compressed, 'raw' after saving.
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
841843  .

# Uncompressed
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
11948820        .

# Ratio = 11948820 / 841843 = 14.2x

# Recompressed through world option
Cataclysm-DDA/save/Schaal/maps
$ du -shb .
600992  .

# Ratio = 11948820 / 600992 = 19.88x

Additional context

On the zzip format

zstd has a concept of "frames", which are independent chunks of semantically meaningful 'data' to zstd. I put data in quotes because there are compressed data frames and also 'skippable' frames. Skippable frames are essentially arbitrary non-compressed data that the zstd library user can encode in a matter that zstd can understand and handle using specific apis.

The zzip format encodes some limited per-file metadata using skippable frames, and a richer index as a flexbuffer footer. The flexbuffer format is convenient in that you do not need to know the length of the flexbuffer, you just need to know where it ends. This means we can always put it at the end of the file, and in constant time, read it to access the zzip index. However, for corruption detection, the front of the zzip contains a skippable frame containing the length of the footer and its hash. We read this first, then hash the footer, before trusting its contents.

For each file, we encode the filename, a hash of the compressed frame, and then finally the compressed frame itself. We store things in this order to help recover from corruption in case of a crash or power loss event.

footer length & hash	file name	compressed hash	compressed file	....	preallocated bytes	flexbuffer footer
`1234`,`0xABCD`	`cows/steak.json`	`0xDEADBEEF`	`zstd data`	...		`index`

When loading a zzip, we try to access the footer. If it is not there, or if the mandatory metadata is missing, we can recover the zzip by scanning it from the front (which is slower than just reading a flexbuffer). In sequence, we can read the filename, the hash of the compressed frame, and then verify the compressed frame is intact using that hash. The zstd frame header contains the frame size, so if we can read the header, we can test the rest of the frame. If at any point there is an issue, we assume the rest of the file is corrupted. Then a fresh footer can be written after the last intact entry.

This assumption is safe because zzips are append-only. To update a file in a zzip, we simply stick the new copy at the end, overwriting the footer if it was in the way, and write a new footer afterward. The old version is orphaned inside the zzip until compaction is triggered. Compaction will write only the latest versions of all files into a new zzip and replace it atomically with filesystem operations, the same way save files were overwritten before.

The compaction algorithm is a little dumb right now. We end up triggering compaction more often than necessary because we first extend the zzip with enough space for the worst case compression result, which ends up padding with a lot of wasted bytes. This causes compaction to happen sooner than the 2x it is currently encoded for. But perfect is the enemy of good, and this is pretty good already right now.

TODO:

github-actions · 2024-12-30T19:34:49Z

Spell checker encountered unrecognized words in the in-game text added in this pull request. See below for details.

Click to expand

Toggle World <C|c>ompression

This alert is automatically generated. You can simply disregard if this is inaccurate, or (optionally) you can also add the new words to tools/spell_checker/dictionary.txt so they will not trigger an alert next time.

Hints for adding a new word to the dictionary

If the word is normally in all lowercase, such as the noun word or the verb does, add it in its lower-case form; if the word is a proper noun, such as the surname George, add it in its initial-caps form; if the word is an acronym or has special letter case, such as the acronym CDDA or the unit mW, add it by preserving the case of all the letters. A word in the dictionary will also match its initial-caps form (if the word is in all lowercase) and all-uppercase form, so a word should be added to the dictionary in its normal letter case even if used in a different letter case in a sentence.
For a word to be added to the dictionary, it should either be a real, properly-spelled modern American English word, a foreign loan word (including romanized foreign names), or a foreign or made-up word that is used consistently and commonly enough in the game. Intentional misspelling (including eye dialect) of a word should not be added unless it has become a common terminology in the game, because while someone may have a legitimate use for it, another person may spell it that way accidentally.

Makefile

ZhilkinSerg · 2024-12-30T23:18:35Z

I've tried zstd with MA overmaps:

Method	Size	Disk size
gz	8 744 047	11 485 184
zstd -1	13 407 208	16 068 608
zstd -3	12 106 198	14 794 752
zstd -3 -D	12 300 415	14 962 688
zstd -5	10 265 961	12 959 744
zstd -7	9 598 894	12 263 424
zstd-19	6 893 311	9 363 456

Dictionary trained with zstd --train on all uncompressed omap files with output:

Trying 5 different sets of parameters
k=50
d=8
f=20
steps=4
split=75
accel=1
Save dictionary of size 66802 into file dictionary

Did not benchmark speed at all.

akrieger · 2024-12-30T23:44:46Z

Advantage of this format is aggregating sub-4k files together to avoid wasted bytes in the allocation.

I found the best dictionary was with d=20 on --train-cover, but on Windows that hits a pessimal behavior in qsort so you should train on mac/linux or fix it like I did (rewrite into c++ and use std::stable_sort parallel).

akrieger · 2024-12-31T04:03:11Z

I've tried zstd with MA overmaps:
...
Did not benchmark speed at all.

An interesting thing happens with json around the -13 level (I saw this in development too). With no dictionary:

$ zstd/contrib/VS2005/bin/x64/Release/zstd.exe -b1e15 -r MA_overmap/
 1# 1337 files       : 140286061 ->  13401860 (x10.47),  385.8 MB/s,  688.8 MB/s
 2# 1337 files       : 140286061 ->  12869576 (x10.90),  398.4 MB/s,  733.8 MB/s
 3# 1337 files       : 140286061 ->  12100850 (x11.59),  423.8 MB/s,  839.0 MB/s
 4# 1337 files       : 140286061 ->  11375873 (x12.33),  224.8 MB/s,  812.2 MB/s
 5# 1337 files       : 140286061 ->  10260613 (x13.67),  156.1 MB/s, 1218.4 MB/s
 6# 1337 files       : 140286061 ->   9877264 (x14.20),  109.1 MB/s, 1170.0 MB/s
 7# 1337 files       : 140286061 ->   9593546 (x14.62),   68.7 MB/s,  977.8 MB/s
 8# 1337 files       : 140286061 ->   9339853 (x15.02),   54.5 MB/s, 1091.4 MB/s
 9# 1337 files       : 140286061 ->   8858622 (x15.84),   45.3 MB/s, 1195.7 MB/s
10# 1337 files       : 140286061 ->   8444538 (x16.61),   31.6 MB/s, 1214.4 MB/s
11# 1337 files       : 140286061 ->   7770721 (x18.05),   11.0 MB/s, 1820.7 MB/s
12# 1337 files       : 140286061 ->   7568403 (x18.54),   8.83 MB/s, 1816.9 MB/s
13# 1337 files       : 140286061 ->   8586961 (x16.34),   7.04 MB/s, 1342.8 MB/s
14# 1337 files       : 140286061 ->   8044176 (x17.44),   6.06 MB/s, 1972.9 MB/s
15# 1337 files       : 140286061 ->   7033479 (x19.95),   4.57 MB/s, 1962.8 MB/s

With dictionary zstd.exe --train-cover=steps=512,d=20 -r MA_overmap/ -o ma_overmap.dict -M1000MB --maxdict=100KB -T0 -7 (ignore compression/decompression speeds, I changed laptop performance profile in between, just look at ratios):

 1# 1337 files       : 140286061 ->  13349825 (x10.51),  664.3 MB/s, 1222.7 MB/s
 2# 1337 files       : 140286061 ->  13690522 (x10.25),  651.0 MB/s, 1210.5 MB/s
 3# 1337 files       : 140286061 ->  12177767 (x11.52),  719.9 MB/s, 1377.6 MB/s
 4# 1337 files       : 140286061 ->  12182817 (x11.52),  618.9 MB/s, 1357.1 MB/s
 5# 1337 files       : 140286061 ->  10714637 (x13.09),  262.3 MB/s  1447.7 MB/s
 6# 1337 files       : 140286061 ->   9801734 (x14.31),  162.1 MB/s, 1901.7 MB/s
 7# 1337 files       : 140286061 ->   9644614 (x14.55),  128.4 MB/s, 1939.1 MB/s
 8# 1337 files       : 140286061 ->   9187718 (x15.27),  106.6 MB/s, 2168.8 MB/s
 9# 1337 files       : 140286061 ->   8702472 (x16.12),   73.8 MB/s, 2345.5 MB/s
10# 1337 files       : 140286061 ->   8274563 (x16.95),   45.5 MB/s, 2595.5 MB/s
11# 1337 files       : 140286061 ->   7556593 (x18.56),   23.8 MB/s, 3356.1 MB/s
12# 1337 files       : 140286061 ->   7322328 (x19.16),   15.9 MB/s, 3476.0 MB/s
13# 1337 files       : 140286061 ->   9136798 (x15.35),   14.8 MB/s, 2419.2 MB/s
14# 1337 files       : 140286061 ->   7879967 (x17.80),   10.8 MB/s, 3081.3 MB/s
15# 1337 files       : 140286061 ->   6781610 (x20.69),   6.44 MB/s, 3444.2 MB/s

Thing is, compression speed is irrelevant because the mod content is static. Or static enough. Generally better rations compress better and decompress faster.

Oh and just for fun, as a zzip at compression level 7:

$ du -shb save/Timersbaby/maps/ma_overmap.zzip
9776444 save/Timersbaby/maps/ma_overmap.zzip

And at 15:

$ du -shb save/Timersbaby/maps/ma_overmap.zzip
6913392 save/Timersbaby/maps/ma_overmap.zzip

PatrikLundell · 2024-12-31T09:48:44Z

As far as I understand, this will cause two potential issued:

Compressed debug saves. I assume the logic allows for ripping out the unwanted files/folders from a copy of the archive in a modification to the corresponding code (or just copying selected parts), though.
Fault finding and hacking. It's not that rare to have the need to examine what's in the save files in order to understand why things misbehave, or remove/fix things that causes a save to blow up when loaded. Thus, there's a need for a tool that allows for the unpacking and repacking of the data in a file. However, having to do that will make the threshold to attempting such activities higher, resulting in poorer support.

akrieger · 2024-12-31T13:12:14Z

yes, good point. However given the size savings it's unclear whether we need to support trimming or just package the dictionary with the subset of bundled maps folders needed. The implementation compresses each subfolder of maps/ separately.
there is an option in the main menu for enabling/disabling compression for the whole world at once. It's so fast on normal sized worlds it may as well be instant. For the enormous sample world I have with the 3.5ishGB of data, it takes a couple minutes.

andrei8l · 2024-12-31T13:19:09Z

However given the size savings it's unclear whether we need to support trimming or just package the dictionary with the subset of bundled maps folders needed

We still need a tar.gz for bug reports

And some people have truly enormous save folders so I expect we'll still need trimming too, and that needs only a minor change to work with compressed saves AFAICT.

akrieger · 2024-12-31T14:54:30Z

I wouldn't change anything about the bug report upload, just the maps files inside the save, which would still be a tarball or whatever. Is GitHub complaining about the archive containing those? Like it doesn't support uploading that because it can't virus scan it or something?

PatrikLundell · 2024-12-31T15:12:16Z

I'd suggest testing if github is uncooperative by tar:ing up a save compressed with your format and post it in a comment.

akrieger · 2024-12-31T16:58:03Z

Well yes that is the empirical method, but if I could do that right now I would have :)

akrieger · 2024-12-31T22:58:49Z

I'd suggest testing if github is uncooperative by tar:ing up a save compressed with your format and post it in a comment.

Maunawili.zip

Yeah github dgaf.

moxian

Read through the code left some comments.
The only important one is that we might not be particularly corruption-resistant if flexbuffers can segfault when trying to parse garbage data.

The other note that I feel strongly about (and that doesn't fit into github code comments) is please please please don't put the dict files into the root project folder (especially with their current cryptic names), we have enough random junk there as is. data/raw/zstd_dicts or something would be a much nicer place.

src/worldfactory.cpp

src/zzip.h

src/zzip.cpp

src/zzip.h

src/zzip.cpp

akrieger · 2025-01-18T15:44:45Z

My thinking is, because the flexbuffer validates the object is an object/dict when we create it, and it does that from the last byte(s) of the file, if that passes then everything before those bytes is also still valid. The initial validation just checks some tag bytes and doesnt seek anywhere else in the file so I believe it should work out.

I'll move the dicts, that makes sense.

src/zzip.cpp

akrieger · 2025-01-20T21:10:44Z

Addressed review comments. Significant refactoring due to inserting a footer checksum frame at the front of the file, DRYing some of frame reading/writing code into private functions, added delete functionality, more comments, just a nontrivial amount got changed.

… options to toggle it on/off.

akrieger · 2025-01-22T04:39:26Z

Actually, overmaps are easy. They are never deleted and only read/generated in like one place. I can handle those in this PR.

Map memory, maybe slightly more complicated. But I haven't looked too closely. If it's not bad then we can just do everything relevant in this PR and not need any followup.

GuardianDll · 2025-01-29T06:42:52Z

Is it still WIP?

akrieger · 2025-01-29T18:17:14Z

Yep. I have overmaps done in the dev branch but need to refactor some stuff. Map memory is being annoying but I'll get there.

github-actions bot added astyled astyled PR, label is assigned by github actions json-styled JSON lint passed, label assigned by github actions labels Dec 30, 2024

akrieger commented Dec 30, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

akrieger force-pushed the zsav branch 2 times, most recently from 3129354 to 613d683 Compare December 31, 2024 03:06

akrieger force-pushed the zsav branch from 613d683 to a805f57 Compare January 2, 2025 01:09

github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025

akrieger force-pushed the zsav branch from a805f57 to ceb7a82 Compare January 2, 2025 19:35

github-actions bot removed the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025

akrieger force-pushed the zsav branch 2 times, most recently from 6b463da to 024a9a7 Compare January 2, 2025 19:55

github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 2, 2025

akrieger force-pushed the zsav branch from 25086b6 to e1bda86 Compare January 18, 2025 03:10

moxian reviewed Jan 18, 2025

View reviewed changes

db48x reviewed Jan 18, 2025

View reviewed changes

src/zzip.cpp Show resolved Hide resolved

db48x suggested changes Jan 18, 2025

View reviewed changes

src/zzip.cpp Show resolved Hide resolved

akrieger added 3 commits January 20, 2025 13:09

Delete legacy save conversion code.

111bb37

cata_timer class for coarse instrumentation without a profiler

152a7dd

json_loader::from_string move strings into buffer.

b3045c9

akrieger force-pushed the zsav branch from e1bda86 to 2290503 Compare January 20, 2025 21:09

github-actions bot removed the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 20, 2025

akrieger added 2 commits January 20, 2025 13:31

Slightly generify mmap_file to support non-const maps

9171913

Integrate zstd into build.

e78ce23

akrieger force-pushed the zsav branch 2 times, most recently from f7e28db to 12c41a3 Compare January 20, 2025 21:32

github-actions bot added the BasicBuildPassed This PR builds correctly, label assigned by github actions label Jan 21, 2025

akrieger added 6 commits January 20, 2025 21:14

Writeable mmapped file interface, with resizing function.

4c31423

zzip: Zstandard compressed 'zip' archive

5a63d57

Handle null json source when throwing flexbuffer json errors.

c08b7a4

Implement optional compression for world maps data, with a World menu…

de419e4

… options to toggle it on/off.

Don't disable compression on world soft reset.

4016eeb

Make debug save archives handle zzip maps.

e71cc29

akrieger force-pushed the zsav branch from 12c41a3 to e71cc29 Compare January 21, 2025 05:14

github-actions bot added BasicBuildPassed This PR builds correctly, label assigned by github actions and removed BasicBuildPassed This PR builds correctly, label assigned by github actions labels Jan 21, 2025

akrieger marked this pull request as draft January 29, 2025 18:17

akrieger mentioned this pull request Feb 13, 2025

Why does the size of the save grow with each save? #79653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

akrieger commented Dec 30, 2024 •

edited

Loading

github-actions bot commented Dec 30, 2024

ZhilkinSerg commented Dec 30, 2024 •

edited

Loading

akrieger commented Dec 30, 2024 •

edited

Loading

akrieger commented Dec 31, 2024

PatrikLundell commented Dec 31, 2024

akrieger commented Dec 31, 2024

andrei8l commented Dec 31, 2024

akrieger commented Dec 31, 2024

PatrikLundell commented Dec 31, 2024

akrieger commented Dec 31, 2024

akrieger commented Dec 31, 2024

moxian left a comment •

edited

Loading

akrieger commented Jan 18, 2025

akrieger commented Jan 20, 2025

akrieger commented Jan 22, 2025

GuardianDll commented Jan 29, 2025

akrieger commented Jan 29, 2025 •

edited

Loading

Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

Are you sure you want to change the base?

Save compression: 10-15x+ smaller maps, faster saves, corruption resistant, also makes julienne fries. #78857

Conversation

akrieger commented Dec 30, 2024 • edited Loading

Summary

Purpose of change

Describe the solution

Describe alternatives you've considered

Testing

Additional context

On the zzip format

github-actions bot commented Dec 30, 2024

ZhilkinSerg commented Dec 30, 2024 • edited Loading

akrieger commented Dec 30, 2024 • edited Loading

akrieger commented Dec 31, 2024

PatrikLundell commented Dec 31, 2024

akrieger commented Dec 31, 2024

andrei8l commented Dec 31, 2024

akrieger commented Dec 31, 2024

PatrikLundell commented Dec 31, 2024

akrieger commented Dec 31, 2024

akrieger commented Dec 31, 2024

moxian left a comment • edited Loading

Choose a reason for hiding this comment

akrieger commented Jan 18, 2025

akrieger commented Jan 20, 2025

akrieger commented Jan 22, 2025

GuardianDll commented Jan 29, 2025

akrieger commented Jan 29, 2025 • edited Loading

akrieger commented Dec 30, 2024 •

edited

Loading

ZhilkinSerg commented Dec 30, 2024 •

edited

Loading

akrieger commented Dec 30, 2024 •

edited

Loading

moxian left a comment •

edited

Loading

akrieger commented Jan 29, 2025 •

edited

Loading