Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for zstd compression #14706

Open
wants to merge 15 commits into
base: develop2
Choose a base branch
from

Conversation

grossag
Copy link

@grossag grossag commented Sep 8, 2023

Changelog: (Feature): Add support for zstd compression
Docs: Will create one if this PR is acceptable

  • Refer to the issue that supports this Pull Request.
  • If the issue has missing info, explain the purpose/use case/pain/need that covers this Pull Request.
  • I've read the Contributing guide.
  • I've followed the PEP8 style guides for Python code.
  • I've opened another PR in the Conan docs repo to the develop branch, documenting this one.

As discussed in issue #648, this change adds zstd support to conan in the following ways:

  1. The person or build running conan upload can set a config value core.upload:compression_format = zstd to upload binaries using zstd instead of gzip.
  2. The zstd compression is done entirely in Python using a combination of tarfile and python-zstandard. Then the file is uploaded as normal.
  3. When downloading packages, if a .tar.zst file is encountered, the extraction code uses tarfile and python-zstandard to extract.
  4. Adds a test to cover zstd compression and decompression.

I chose python-zstandard as the library because that is what urllib3 uses. The package has not yet hit 1.0 but urllib3 is a mature package and it says a lot to me that they chose python-zstandard.

I apologize in advance if I'm missing important parts of the developer workflow. If this approach is acceptable, I'll create a docs PR as requested.

Developer docs on all branches say to open pull requests against develop but AFAICT that is Conan 1.x. I'm opening this against develop2 instead because that appears to be Conan 2.x; I hope that's the right thing to do.

This change adds zstd support to conan in the following ways:
1. The person or build running `conan upload` can set a config value
   core.upload:compression_format = zstd
   to upload binaries using zstd instead of gzip.
2. The zstd compression is done entirely in Python using a combination
   of tarfile and python-zstandard. Then the file is uploaded as normal.
3. When downloading packages, if a .tar.zst file is encountered, the
   extraction code uses tarfile and python-zstandard to extract.

I chose python-zstandard as the library because that is what urllib3 uses.
Because zstd decompression is expected to just work if the server has a
.tar.zst file, I am including zstandard in requirements.txt.
https://python-zstandard.readthedocs.io/en/latest/projectinfo.html#state-of-project
recommends that we "Pin the package version to prevent unwanted breakage when this
change occurs!", although I doubt that much will change before an eventual 1.0.
@CLAassistant
Copy link

CLAassistant commented Sep 8, 2023

CLA assistant check
All committers have signed the CLA.

CI is unable to find 0.21.0
@grossag
Copy link
Author

grossag commented Sep 11, 2023

I am working through my company's CLA approval process and hope to sign it by end of day today. In the meantime, I wrote a script to test compression and decompression of a test folder using various gzip and zstd compression levels and ran it overnight on a 7.1GB folder with 16000 files. I put the script here in case you all find it useful: https://gist.github.com/grossag/525f3cdaf7d985b625a38df55a7c9087

Run on a VM using shared NAS storage:

gzip level 7:
	- Compression time: 554.04 seconds
	- Compression size: 1.988 GB
	- Decompression times in seconds: 89.50 mean, 88.05 median, 2.04 stdev
gzip level 8:
	- Compression time: 1079.89 seconds
	- Compression size: 1.978 GB
	- Decompression times in seconds: 87.20 mean, 87.05 median, 1.84 stdev
gzip level 9 (default compression level):
	- Compression time: 2080.31 seconds
	- Compression size: 1.976 GB
	- Decompression times in seconds: 85.34 mean, 86.40 median, 3.10 stdev

zstd level 3 (default compression level):
	- Compression time: 136.69 seconds
	- Compression size: 1.498 GB
	- Decompression times in seconds: 54.33 mean, 54.14 median, 4.04 stdev
zstd level 4:
	- Compression time: 125.80 seconds
	- Compression size: 1.478 GB
	- Decompression times in seconds: 52.21 mean, 52.98 median, 2.19 stdev
zstd level 5:
	- Compression time: 115.56 seconds
	- Compression size: 1.398 GB
	- Decompression times in seconds: 50.87 mean, 51.97 median, 4.49 stdev

My work laptop has widely varying performance right now, where zstd decompression of the same files jumps between 47 and 60 seconds. So here are the first results which I need to rerun:

gzip level 9 (default compression level):
	- Compression time: 1144.69 seconds
	- Compression size: 1.975 GB
	- Decompression times in seconds: 94.60 mean, 97.29 median, 8.86 stdev
zstd level 5:
	- Compression time: 69.69 seconds
	- Compression size: 1.394 GB
	- Decompression times in seconds: 55.63 mean, 59.06 median, 8.00 stdev

zstd is interesting because my results are showing that decompression time doesn't change as you increase the compression level, maybe with the exception of the really high levels 20-22. But overall my results are summarized as: on both machines, zstd level 5 shows a size reduction of 30% and a decompression time reduction of 35-40% as compared to gzip level 9.

@grossag
Copy link
Author

grossag commented Sep 11, 2023

Looks like my virus scanner was causing the high variance in hash performance testing. Here are some results, comparing zstd level 9 with gzip level 9 on my laptop:

Boost (1.1GB and 15000 files):

gzip level 9 (default compression level):
	- Compression time: 178.97 seconds
	- Compression size: 192.741 MB
	- Decompression times in seconds: 9.89 mean, 9.95 median, 0.09 stdev
zstd level 9:
	- Compression time: 12.34 seconds
	- Compression size: 130.144 MB
	- Decompression times in seconds: 6.99 mean, 7.01 median, 0.10 stdev

Compiler toolset (7.1 GB and 16000 files):

gzip level 9 (default compression level):
	- Compression time: 1473.80 seconds
	- Compression size: 1.975 GB
	- Decompression times in seconds: 34.64 mean, 34.54 median, 0.23 stdev
zstd level 9:
	- Compression time: 91.72 seconds
	- Compression size: 1.258 GB
	- Decompression times in seconds: 17.33 mean, 16.95 median, 1.01 stdev

So my tests are still showing 20-50% improvements in decompression time.

1. Change requirements.txt to allow either zstandard 0.20 or 0.21. That
   prevents a downgrade for people who already have 0.21 installed, while
   also allowing CI to find 0.20.
2. Move compressformat parameter earlier in compress_files() function.
   It made a bit more sense to have it earlier; as long as consumers are
   correctly using positional kwargs, it shouldn't break anyone.
@13steinj
Copy link

one way or the other I'll have to implement this + more for my org eventually--

can this be changed to be done in an expandable manner? Something like:

core.packager.binaries.compressor.native = false # one of true, false; true uses a command
core.packager.binaries.compressor = gzip  # one of pigz, gzip, bzip2, xz, lzip, lzma, lzop, gzip, zstd; if native, also arbitrary
core.packager.binaries.compressor.suffix = auto  # one of auto, or if compressor.native AND unknown compressor, custom defaulting to first word of compressor (program)
core.packager.binaries.archiver = python, native # one of python, tar. Defaults to decompressor, native defaults to tar. 
core.packager.binaries.decompressor = python # one of python, tar (auto detect default), or native (based off of suffix
core.packager.binaries.dearchiver = python  # one of python, native (tar). Defaults to decompressor (tar and native defaults to tar). 

maybe some other variations... bit of a hard problem to make this workable for everyone.

@grossag
Copy link
Author

grossag commented Sep 27, 2023

one way or the other I'll have to implement this + more for my org eventually--

can this be changed to be done in an expandable manner? Something like:


core.packager.binaries.compressor.native = false # one of true, false; true uses a command

core.packager.binaries.compressor = gzip  # one of pigz, gzip, bzip2, xz, lzip, lzma, lzop, gzip, zstd; if native, also arbitrary

core.packager.binaries.compressor.suffix = auto  # one of auto, or if compressor.native AND unknown compressor, custom defaulting to first word of compressor (program)

core.packager.binaries.archiver = python, native # one of python, tar. Defaults to decompressor, native defaults to tar. 

core.packager.binaries.decompressor = python # one of python, tar (auto detect default), or native (based off of suffix

core.packager.binaries.dearchiver = python  # one of python, native (tar). Defaults to decompressor (tar and native defaults to tar). 

maybe some other variations... bit of a hard problem to make this workable for everyone.

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

@13steinj
Copy link

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

With respect to native vs non-native (subprocess or python-level), my experience has unfortunately been that the parallel downloads feature is "not actually parallel" because tar extraction is not parallel (partially due to the GIL, partially due to how the code is structured). This was tested on conan 1.5X. On large binary packages and no other core/job restrictions, parallel_downloads was fastest set to 2 rather than 16, with a large amount of time spent wasted in tarfile 😢 .

While at a previous org, monkeypatch-experimenting (because everything is python, yay!) to replace with a native call to tar + pigz (so called "fake" parallelism for decompression) was faster, and I expect pugz to be even faster.

This isn't to say I don't want this feature you've written, I do! But with conan's committal to backwards compatibility in 2.0, I would expect that config options need to either have a lot of granularity in order to suffice for future use cases (for example, some binary packaged data that I've had played with over the past year suggests that bz2 is optimal instead).

I'm less so asking for additional customization right now and more so for the config to be structured so that additional customization can be added later. Ex core.upload may be a poor choice, and there is already, unfortunately, core.gzip:compresslevel that I assume would be better off under some sub-namespace but now has to work for the foreseeable future.

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
@grossag
Copy link
Author

grossag commented Oct 10, 2023

Hey, thanks for the review! What is your use case that requires this additional customization? I found that using a separate tar process made things difficult to manage and was actually a tiny bit slower than in-proc python-zstandard.

With respect to native vs non-native (subprocess or python-level), my experience has unfortunately been that the parallel downloads feature is "not actually parallel" because tar extraction is not parallel (partially due to the GIL, partially due to how the code is structured). This was tested on conan 1.5X. On large binary packages and no other core/job restrictions, parallel_downloads was fastest set to 2 rather than 16, with a large amount of time spent wasted in tarfile 😢 .

While at a previous org, monkeypatch-experimenting (because everything is python, yay!) to replace with a native call to tar + pigz (so called "fake" parallelism for decompression) was faster, and I expect pugz to be even faster.

This isn't to say I don't want this feature you've written, I do! But with conan's committal to backwards compatibility in 2.0, I would expect that config options need to either have a lot of granularity in order to suffice for future use cases (for example, some binary packaged data that I've had played with over the past year suggests that bz2 is optimal instead).

I'm less so asking for additional customization right now and more so for the config to be structured so that additional customization can be added later. Ex core.upload may be a poor choice, and there is already, unfortunately, core.gzip:compresslevel that I assume would be better off under some sub-namespace but now has to work for the foreseeable future.

This is something I would want direction from the maintainers on if I was to do. In the discussion with @memsharded in #648 , the idea of deferring compression and decompression to native tools was not ideal because of testing and compatibility concerns. That is why I tried to do this in all Python code. I am very happy with the Python zstd decompression performance so far across Windows, Mac, and Linux. The python-zstandard library releases the GiL before calling into the Zstd C library, so you aren’t losing performance there.

This change represents the most minimal one I could do while still accomplishing my goals. On the GH issue I referenced, accepting this PR was not guaranteed so I wanted to limit the complexity to show that it is supportable long-term.

@13steinj
Copy link

Fair enough; I'm mainly concerned about other package types and how this interacts with compatibility concerns.

To be clear I'm not suggesting you implement all of these methods right now, just for the config key chosen to be extensible for the future.

@exjam exjam force-pushed the topic/grossag/zstd3 branch from 7d586bc to a33394d Compare October 17, 2023 22:56
@Ext3h
Copy link

Ext3h commented Oct 31, 2023

  • Decompression times in seconds: 17.33 mean, 16.95 median, 1.01 stdev

This still ain't looking right. This is under performing by a factor 2-3x compared to what it should look like.

There is likely some major overhead as tarfile is serializing an embarrassingly parallel problem by hopping between blocking file system accesses and decompression in a single thread...

grossag added 6 commits July 22, 2024 10:46
1. Fix bad merge causing uploader.py change to still refer to `self._app.cache.new_config`, when now we are supposed to use `self._global_conf`.
2. Change two output calls in uploader.py to only output the package file basename to be consistent with other existing log lines.
3. Use double quotes instead of single quotes to be more consistent with existing code.
1. Downgrade bufsize to 32KB because that performs well for compression and
   decompression. The values don't need to be the same, but it happened to be
   the best value in both compression and decompression tests.
2. Use a context manager for stream_reader as I do for stream_writer.
3. Add some comments about the bufsize value.
@grossag
Copy link
Author

grossag commented Aug 26, 2024

@memsharded Are you able to review this PR?

Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, sorry that we haven't been able to have time to review this.

This PR as is, is looking a bit risky, one of the main reasons the addition of the new zstandard library dependency. It is likely that it might be better added as conditional requirement (and protect the import of it with a try-except with a clear message).

But I'd say that it is not impossible to move it forward, based on the diff, I think the code changes risk might be controlled. Please check the comments.

Thanks again for your contribution.

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
if f not in zipped_files:
raise ConanException(f"Corrupted {pref} in '{remote.name}' remote: no {f}")
accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME]
package_file = next((f for f in zipped_files if f in accepted_package_files), None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, a package could contain both compressed artifacts, but it will prioritize and only download the zstd one if existing?

Wouldn't it be a bit less confusing to not allow to have both compressed formats artifacts in the same package?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A package is only supposed to contain one. Let's say an organization switches to zstd compression on Jan 1 2025. The expectation would be that packages produced before then would have .tgz extension and packages produced after then would have .tzst extension. I would like to avoid producing both because it would result in unnecessary storage usage in Artifactory.

conans/client/cmd/uploader.py Outdated Show resolved Hide resolved
Comment on lines +84 to +89
accepted_package_files = [PACKAGE_TZSTD_NAME, PACKAGE_TGZ_NAME]
accepted_files = ["conaninfo.txt", "conanmanifest.txt", "metadata/sign"]
for f in accepted_package_files:
if f in server_files:
accepted_files = [f] + accepted_files
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assumed there can only be 1 compressed artifact in one of the formats, this would be simplified?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I'm missing what you are saying here. I don't have the context about if/how these accepted files changed over time. But Artifactory would only have .tgz or .tzst, not both. If that means we can simplify this a bit, that's fine with me.

conans/model/conf.py Outdated Show resolved Hide resolved
conan/internal/paths.py Outdated Show resolved Hide resolved
conans/requirements.txt Outdated Show resolved Hide resolved
Still need to do some testing though.
Newer Python has this warning:

DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives
and reject files or modify their metadata. Use the filter argument to control
this behavior
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Oct 6, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Oct 7, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Nov 20, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Nov 25, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Dec 3, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Dec 16, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Dec 17, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
bentonj-omnissa added a commit to omnissa-oss-forks/conan that referenced this pull request Dec 20, 2024
Squashed version of PR conan-io#14706 as of 03/10/2024.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants