-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save space by compressing at least some of the builds #23
Comments
On Freitag, 12. August 2016 07:51:40 CEST Aleks-Daniel Jakimenko-Aleksejev
Acutally I like the git repo idea very much. With the build files in a git repo Have you run git repack after committing the different versions? That should Stefan |
It is hard to tell if it is going to perform better when we put all builds into it. Currently, with just 7 builds in, 28 MB repo size is equivalent to storing each build separately (≈4 MB per build). Also, I'm not sure if performance is going to be adequate. Bisect has to jump a couple of hundreds commits back and forth, that is definitely slower than just unpacking a 4 MB archive (or am I wrong?).
Well, yes, it says there's nothing to repack (perhaps |
How about testing LZ4 or LZHAM? I suspect they won't compress as well, but they are supposed to be very fast at decompressing, so the trade-off might be worth it. |
@MasterDuke17 I've added lz4 (and a bunch of other stuff) to the main post. LZ4 is actually a very good finding, thank you very much. Indeed, we should probably forget about space savings and think about decompression speed instead. How long does it take to decompress one build compressed with So let's see how fast things decompress:
Almost everything is with default options, so feel free to recommend something specific. As stupid as it sounds, brotli is a clear winner right now (UPDATE: nope. See next comment). It is a bit slow during compression, but I don't mind it at all. |
We have a new winner: https://github.com/Cyan4973/zstd ≈0m0.130s decompression, ≈4.9M size, compression faster than brotli. Basically, it is a winner on all criteria except for file size, and it is only ≈0.4MB worse. Where is the catch?? We can tweak it a bit by using a different compression level. The numbers above are with max level (22), but we can make it ≈10ms faster by sacrificing ≈0.8MB (level 15). I don't care about neither of these. |
@AlexDaniel: were you using "plain" |
@xenu |
By the way, I found this blog post very interesting: http://fastcompression.blogspot.com/2013/12/finite-state-entropy-new-breed-of.html |
Also, bash scripts were replaced with perl 6 code. zstd was chosen for its fast decompression speed and good compression ratio. See issue #23 for more info. Rakudo is now being installed into “/tmp”. This way it will be easier for others to extract archives on their system if we ever publish them (because /tmp is writeable on almost any system). For example, we can let people run git bisect locally by downloading prepared builds. This means that all previous builds have to be rebuilt. That's not a big deal, but it will take up to three days to process. Another side effect of these changes is that bots will not see new commits until they are actually built. Now when running something on HEAD you can be sure that the actual commit that is pointed on by HEAD is actually ready to roll. No more “whoops, try using some older commit” error messages. Currently it is very unstable and more testing is required.
OK, so this has been implement some time ago along with other major changes. Given that everything is written in 6lang now, some things tend to segfault sometimes… but otherwise everything is fine. At least, compression is definitely there, so I am closing this. |
Another article about Zstandard. https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/ |
Some news! I never liked the dependency on I have updated the post, but basically this is my finding:
So However, what I actually care about is decompression speed (because we need fast access to slowly accumulating builds):
So This test is done using 7 builds in a single archive but I guess the next step is to increase the number of builds in zstd archives until I reach the same decompression speed, then compare the ratio. |
(log) Fantastic find! We should definitely bench it one day. |
I think using zstd for everything is a good idea. It'd make some code paths more generic, and it'll drop the dependency on lrzip. dwarfs is fine locally, but whateverable is also serving files for Blin and other remote usages, so zstd is still needed. |
Closing this in favor of #389. |
If the whole architecture was being redone from scratch, using dwarfs for local storage and compressing on the fly with zstd when serving files would be an interesting experiment. |
Yes and no 🤷🏼 Depending on how much you're using the mothership remotely, it might be that you'll have lots of archives locally. If they're compressed in long-range mode, your local setup will be as efficient (storage-wise) as the remote one. Compressing on the fly in long-range mode doesn't work (it takes roughly a minute to do that for 20 builds). Of course, nothing stops you from using dwarfs locally, but the current system with archives seems easier and simpler. |
OK, so here is the math!
Each build uncompressed is ≈28 MB. This does not include unnecessary source files or anything, so it does not look like we can go lower than that (unless we use compression).
So how many builds do we want to keep? Well, one year from now back is about 3000 commits. This gives roughly
84 GB
. And this is just one year of builds for just MoarVM backend. In about 10 years we will slowly start approaching the 1 TB mark. Multiply it by the number of backends we want to track.Is it a lot? Well, no, but for folks who have an SSD (me) this might be a problem.
Given that people commit stuff at a slightly faster pace than the space becomes significantly cheaper, I think that we should compress it anyway (even if it is moved to a server with more space). It is a good idea in a long run. And it will make it easier for us to throw in some extra stuff (JVM builds, or maybe even 32-bit builds or something? You never know).
OK, so what can we do?
Filesystem compression
The most obvious one is to use compression in btrfs. The problem is that it is applied for each file individually, so we are not going to save anything across many builds. Also, it is only viable if you already have btrfs, so it looks like it is not the best option.
Compress each build individually
While it may sound like a great idea to compress all builds together, it does not work that well in practice. Well, it does, but keep reading.
The best compression I got is with
7z
. Each build is ≈4 MB (≈28 MB uncompressed, therefore ≈7 times space saving!)Compressing each build individually is also good for things like local bisect. That is, we can make these archives publicly available, and then write a script that will pull these archives for your local git bisect. How cool is that! That will be about 40 MB of stuff to download per git bisect, and you cannot really compress it any further anyway because you don't know ahead of time which files you would need.
This gives us about ≈120 GB per 10 years. Good enough, I like it.
Is there anything that performs better than 7z? Well, yes:
xz
is much slower during compression and is tiny bit slower when decompressing, so the win is insignificant.lrz
with extreme options is much slower at everything, so forget it.Let's compress everything together!
For the tests, I took only 7 builds:
Now, there are some funny entries here. Obviously,
.tar
is the size of all repos together, uncompressed.git-repo
is a git repo with every build committed one after another in the right order. Look, it performs better than some of the other things! Amazing! And even better if you compress it afterwards. Who would've thought, right?However,
lrz
clearly wins this time. And that's with default settings! Wow.Conclusion
I think that there are ways to fiddle with options to get even better results. Suggestions are welcome!
However, at this point it looks like the best way is to use
7z
to compress each build individually.Messing around with one huge archive is probably not worth the savings. It will make decompression time significantly slower, but we want decompression to be as fast as possible.
The text was updated successfully, but these errors were encountered: