-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for zstd compression #14706
base: develop2
Are you sure you want to change the base?
Conversation
This change adds zstd support to conan in the following ways: 1. The person or build running `conan upload` can set a config value core.upload:compression_format = zstd to upload binaries using zstd instead of gzip. 2. The zstd compression is done entirely in Python using a combination of tarfile and python-zstandard. Then the file is uploaded as normal. 3. When downloading packages, if a .tar.zst file is encountered, the extraction code uses tarfile and python-zstandard to extract. I chose python-zstandard as the library because that is what urllib3 uses.
Because zstd decompression is expected to just work if the server has a .tar.zst file, I am including zstandard in requirements.txt. https://python-zstandard.readthedocs.io/en/latest/projectinfo.html#state-of-project recommends that we "Pin the package version to prevent unwanted breakage when this change occurs!", although I doubt that much will change before an eventual 1.0.
CI is unable to find 0.21.0
I am working through my company's CLA approval process and hope to sign it by end of day today. In the meantime, I wrote a script to test compression and decompression of a test folder using various gzip and zstd compression levels and ran it overnight on a 7.1GB folder with 16000 files. I put the script here in case you all find it useful: https://gist.github.com/grossag/525f3cdaf7d985b625a38df55a7c9087 Run on a VM using shared NAS storage:
My work laptop has widely varying performance right now, where zstd decompression of the same files jumps between 47 and 60 seconds. So here are the first results which I need to rerun:
zstd is interesting because my results are showing that decompression time doesn't change as you increase the compression level, maybe with the exception of the really high levels 20-22. But overall my results are summarized as: on both machines, zstd level 5 shows a size reduction of 30% and a decompression time reduction of 35-40% as compared to gzip level 9. |
Looks like my virus scanner was causing the high variance in hash performance testing. Here are some results, comparing zstd level 9 with gzip level 9 on my laptop: Boost (1.1GB and 15000 files):
Compiler toolset (7.1 GB and 16000 files):
So my tests are still showing 20-50% improvements in decompression time. |
1. Change requirements.txt to allow either zstandard 0.20 or 0.21. That prevents a downgrade for people who already have 0.21 installed, while also allowing CI to find 0.20. 2. Move compressformat parameter earlier in compress_files() function. It made a bit more sense to have it earlier; as long as consumers are correctly using positional kwargs, it shouldn't break anyone.
one way or the other I'll have to implement this + more for my org eventually-- can this be changed to be done in an expandable manner? Something like:
maybe some other variations... bit of a hard problem to make this workable for everyone. |
Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard. |
With respect to native vs non-native (subprocess or python-level), my experience has unfortunately been that the parallel downloads feature is "not actually parallel" because tar extraction is not parallel (partially due to the GIL, partially due to how the code is structured). This was tested on conan 1.5X. On large binary packages and no other core/job restrictions, While at a previous org, monkeypatch-experimenting (because everything is python, yay!) to replace with a native call to This isn't to say I don't want this feature you've written, I do! But with conan's committal to backwards compatibility in 2.0, I would expect that config options need to either have a lot of granularity in order to suffice for future use cases (for example, some binary packaged data that I've had played with over the past year suggests that I'm less so asking for additional customization right now and more so for the config to be structured so that additional customization can be added later. Ex |
This is something I would want direction from the maintainers on if I was to do. In the discussion with @memsharded in #648 , the idea of deferring compression and decompression to native tools was not ideal because of testing and compatibility concerns. That is why I tried to do this in all Python code. I am very happy with the Python zstd decompression performance so far across Windows, Mac, and Linux. The python-zstandard library releases the GiL before calling into the Zstd C library, so you aren’t losing performance there. This change represents the most minimal one I could do while still accomplishing my goals. On the GH issue I referenced, accepting this PR was not guaranteed so I wanted to limit the complexity to show that it is supportable long-term. |
Fair enough; I'm mainly concerned about other package types and how this interacts with compatibility concerns. To be clear I'm not suggesting you implement all of these methods right now, just for the config key chosen to be extensible for the future. |
7d586bc
to
a33394d
Compare
This still ain't looking right. This is under performing by a factor 2-3x compared to what it should look like. There is likely some major overhead as tarfile is serializing an embarrassingly parallel problem by hopping between blocking file system accesses and decompression in a single thread... |
1. Fix bad merge causing uploader.py change to still refer to `self._app.cache.new_config`, when now we are supposed to use `self._global_conf`. 2. Change two output calls in uploader.py to only output the package file basename to be consistent with other existing log lines. 3. Use double quotes instead of single quotes to be more consistent with existing code.
1. Downgrade bufsize to 32KB because that performs well for compression and decompression. The values don't need to be the same, but it happened to be the best value in both compression and decompression tests. 2. Use a context manager for stream_reader as I do for stream_writer. 3. Add some comments about the bufsize value.
@memsharded Are you able to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, sorry that we haven't been able to have time to review this.
This PR as is, is looking a bit risky, one of the main reasons the addition of the new zstandard
library dependency. It is likely that it might be better added as conditional requirement (and protect the import of it with a try-except
with a clear message).
But I'd say that it is not impossible to move it forward, based on the diff, I think the code changes risk might be controlled. Please check the comments.
Thanks again for your contribution.
if f not in zipped_files: | ||
raise ConanException(f"Corrupted {pref} in '{remote.name}' remote: no {f}") | ||
accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME] | ||
package_file = next((f for f in zipped_files if f in accepted_package_files), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, a package could contain both compressed artifacts, but it will prioritize and only download the zstd one if existing?
Wouldn't it be a bit less confusing to not allow to have both compressed formats artifacts in the same package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A package is only supposed to contain one. Let's say an organization switches to zstd compression on Jan 1 2025. The expectation would be that packages produced before then would have .tgz
extension and packages produced after then would have .tzst
extension. I would like to avoid producing both because it would result in unnecessary storage usage in Artifactory.
accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME] | ||
accepted_files = ["conaninfo.txt", "conanmanifest.txt", "metadata/sign"] | ||
for f in accepted_package_files: | ||
if f in server_files: | ||
accepted_files = [f] + accepted_files | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assumed there can only be 1 compressed artifact in one of the formats, this would be simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I'm missing what you are saying here. I don't have the context about if/how these accepted files changed over time. But Artifactory would only have .tgz or .tzst, not both. If that means we can simplify this a bit, that's fine with me.
Still need to do some testing though.
Newer Python has this warning: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Squashed version of PR conan-io#14706 as of 03/10/2024.
Changelog: (Feature): Add support for zstd compression
Docs: Will create one if this PR is acceptable
develop
branch, documenting this one.As discussed in issue #648, this change adds zstd support to conan in the following ways:
conan upload
can set a config valuecore.upload:compression_format = zstd
to upload binaries using zstd instead of gzip.tarfile
andpython-zstandard
. Then the file is uploaded as normal.tarfile
andpython-zstandard
to extract.I chose python-zstandard as the library because that is what urllib3 uses. The package has not yet hit 1.0 but urllib3 is a mature package and it says a lot to me that they chose python-zstandard.
I apologize in advance if I'm missing important parts of the developer workflow. If this approach is acceptable, I'll create a docs PR as requested.
Developer docs on all branches say to open pull requests against
develop
but AFAICT that is Conan 1.x. I'm opening this againstdevelop2
instead because that appears to be Conan 2.x; I hope that's the right thing to do.