Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store schemas compressed on disk. #2365

Open
adamchainz opened this issue Apr 27, 2021 · 19 comments
Open

Store schemas compressed on disk. #2365

adamchainz opened this issue Apr 27, 2021 · 19 comments
Labels
feature-request This issue requests a feature. needs-discussion p2 This is a standard priority issue

Comments

@adamchainz
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The data directory of a botocore install is over 50MB. The JSON inside compresses really well - we can see as the PyPI packages are just 7MB.

Describe the solution you'd like

It would be good to keep the schemas compressed on disk and only decompress them when reading into memory. This would save disk space, and probably a little time too since the decompression step is likely to be faster that reading all the bytes from disk.

Python's zlib or zip modules in the standard library can be used.

For an example of a library shipping data in a zip file, see my heroicons package: https://github.com/adamchainz/heroicons/blob/main/src/heroicons/__init__.py

@adamchainz adamchainz added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Apr 27, 2021
@stobrien89
Copy link

Hi @adamchainz,

Thanks for the feature request! I'll review this with the team, although I can't make any guarantees as to when/if this will be implemented.

@stobrien89 stobrien89 removed the needs-triage This issue or PR still needs to be triaged. label Apr 29, 2021
@kdaily
Copy link
Member

kdaily commented May 1, 2021

@adamchainz,

This is an interesting idea. This has been noted previously in a similar scenario with the AWS CLI as well:

aws/aws-cli#5725

The AWS SDKs consume the API models from upstream. Changing the way that they are stored and accessed would be a significant feature. One drawback would be the lack of direct human readability of the API models that are currently available in the Python SDK. It would be difficult to see where API changes were introduced between versions of the SDK. For example, removing the documentation strings from the models would cut 20MB off of the size, which might be useful in a CI/CD environment.

Do you have specific scenarios of your own that a slimmed down version?

@adamchainz
Copy link
Contributor Author

It would be difficult to see where API changes were introduced between versions of the SDK.

One can use the textconv git attribute in the repo to have git decompress the files before comparing them.

Do you have specific scenarios of your own that a slimmed down version?

This affects me in a couple ways:

  1. I bundle boto3 into my lambda functions so I can pin an exact version. The occasional API change can break code. Bundling botocore takes a function over the 50MB limit, which requires ann upload to S3 rather than directly to Lambda, and prevents the console code editor from working.

  2. I have maybe 30 projects using boto3/botocore, each with their own virtual environment. This means I have 1.5GB of botocore, which isn't a great use of disk space.

@benkehoe
Copy link

I'm in favor of this feature as well. They could stay uncompressed in the source code here, but be bundled into a zip for the released wheel. They'd stay programmatically available in botocore exactly as they are today, it would be the Loader that would change to read them out of the zip file rather than directly off disk.

The benefits to install time, artifact size, and Lambda in-console editing would be well-worth the effort imo.

@joguSD
Copy link
Contributor

joguSD commented Jan 14, 2022

Hey all, just wanted to chime in real quick to mention that I took some time today to play around with the ideas here.

I think @benkehoe's suggestion makes a lot of sense, and I took a crack at implementing support for building wheels that include compressed models instead of the plaintext versions. However, rather than modifying the loader to include an additional possible location that checks within a zip, I decided to update the JSONFileLoader to look for either a plaintext .json file or a gzip compressed .json.gz file. This means that a compressed model can be present in any location the Loader class might look (e.g. ~/.aws/models).

In addition to support for loading gzip compressed models, I've added a script to the scripts folder that will modify a botocore wheel in-place replacing all .json files in the data directory with a gzip compressed version. You can take a look at the branch on my fork here.

Using my branch you should be able to generate then modify a wheel that includes the compressed models instead.

$ python setup.py bdist_wheel
$ ./scripts/compress-wheel-data dist/botocore-*-none-any.whl

It'd be great if some of you could test the compressed wheels out as I do have some concerns around compatibility / performance if we were ever to begin publishing wheels like this instead of the uncompressed version.

As for my testing (on an M1 macbook pro) I saw the following:

Install times were marginally in the favor of the wheel with compressed models but it wasn't significant and might have just been margin of error.

Comparing the unzipped wheel I saw about a 5x reduction in disk space going from 66M to 13M:

$ du -h -d 0 gzip/botocore-1.23.32
 13M    gzip/botocore-1.23.32

$ du -h -d 0 normal/botocore-1.23.32
 66M    normal/botocore-1.23.32

I also tried creating a new Session object and creating a client (the largest model is ec2 and the smallest is sagemaker-edge to see how this would impact load times. These results are the average of 100 runs:

ec2 Avg: 0.05411987456999998, Min: 0.03956283299999974, Max: 0.083342042
sagemaker-edge Avg: 0.02530930502999995, Min: 0.0206438750000002, Max: 0.05621566599999994

ec2 Avg: 0.048753524610000036, Min: 0.034418124999999966, Max: 0.08430220900000002
sagemaker-edge Avg: 0.02403186916999993, Min: 0.01891829100000031, Max: 0.057971249999999586

Unfortunately, loading the compressed models is about 10% slower. I'm sure there's different compression algorithms that might produce better results here but I'm concerned about compatibility if we were to use a less ubiquitous algorithm than gzip.

@adamchainz
Copy link
Contributor Author

Do you know what gzip level you used? Python's gzip module defaults to 9, which is the slowest, because it applies the most compression. The gzip CLI uses 6 by default.

Even level 1 would probably provide significant gains given the repetitition in JSON.

@benkehoe
Copy link

Thanks for taking a look at this! Can we get a comparison on wheel size and performance between compressing the files individually versus all together? I get the benefit of allowing non-default locations to have them individually, but if there's a big difference for the primary package it could make sense to special-case that as a single zip.

@joguSD
Copy link
Contributor

joguSD commented Jan 14, 2022

@benkehoe The wheel size wasn't significantly impacted by a a single zip vs individual models.
For the particular botocore version I used I go the following:

Size of the .whl:
Uncompressed model data dir: 8.6M
Individual model file compressed data dir: 8.6M
Single zip for data dir: 8.3M

As for the decompressed package I got:
Uncompressed model data dir: 66M
Individual model file compressed data dir: 13M
Single zip for data dir: 11M

So a slight improvement in favor of a single zip. Getting data on how that affects botocore client load times isn't something I've tested since I haven't implemented it. I do have concerns around the monolithic nature of a single zip and the performance characteristics of random access in the zip.

@adamchainz My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

@adamchainz
Copy link
Contributor Author

My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

Ah, you are right. My bad.

@joguSD
Copy link
Contributor

joguSD commented Jan 14, 2022

@benkehoe

I ran a sanity check comparing all 3 by doing a minimal open directly a model in the data dir or data.zip:

Loading ec2/2016-11-15/service-2.json
normal_open Avg: 0.009063401630000006, Min: 0.008451374999999997, Max: 0.010835124999999945, Sum: 0.9063401630000006
gzip_open Avg: 0.013103255060000008, Min: 0.012516417000000057, Max: 0.015194916000000003, Sum: 1.3103255060000008
nested_zip_open Avg: 0.016699820469999987, Min: 0.015805040999999687, Max: 0.020624874999999765, Sum: 1.6699820469999986


Loading sagemaker-edge/2020-09-23/service-2.json
normal_open Avg: 4.132742999999974e-05, Min: 3.9582999999999285e-05, Max: 8.28330000000009e-05, Sum: 0.0041327429999999735
gzip_open Avg: 6.306003999999984e-05, Min: 6.0208000000002565e-05, Max: 0.00011729200000000148, Sum: 0.006306003999999983
nested_zip_open Avg: 0.0048496287200000005, Min: 0.00480520799999995, Max: 0.005483624999999992, Sum: 0.48496287200000004

The nested zip is the slowest and impacts smaller models pretty significantly. This is only considering loading the .json contents because we already knew the path. I think when you start to consider the nature of the Loader class the overhead of going into the zip file will be even more significant. The Loader traverses sub-directories and lists files to discover available API versions / models, which doesn't really make sense in the context of a zip file. The ZipFileLoader class would likely need to be a significant deviation from the existing one to mitigate the performance overhead and my hunch is that it would still be slower overall.

@benkehoe
Copy link

Awesome, this all makes sense. The small difference in size (that surprised me a bit) combined with individual zips better on both performance and code simplicity makes it no contest. Thanks for humoring me and validating it though!

@gricey432
Copy link

Feels like this is trying to fix similar symptoms as #1543 but in a different way. Though I don't think the two ideas are mutually exclusive, just linking

@joguSD
Copy link
Contributor

joguSD commented Feb 4, 2022

@gricey432 You're absolutely correct that the two approaches aren't mutually exclusive. When I was doing the initial proof of concept script on my branch I was tempted to add a services filter that could allow the built wheel to only include a subset of services but didn't quite have time.

@whardier
Copy link
Contributor

Could save roughly 50 megs in lambda installs by doing this. That means installing botocore/boto3 + telemetry tools + something like pandas usually breaks the bank when deploying to Lambda (even after removing pyc and stripping shared objects)

@RyanFitzSimmonsAK RyanFitzSimmonsAK added p1 This is a high priority issue p2 This is a standard priority issue and removed p1 This is a high priority issue labels Nov 10, 2022
tgbugs added a commit to tgbugs/tgbugs-overlay that referenced this issue Mar 20, 2023
in the context of boto/botocore#2365
.json.gz loading has been merged boto/botocore#2628
so we can take advantage of it by gzing all the json in the data folder
all the relevant tests pass so it seems that we are good to go
@nateprewitt
Copy link
Contributor

Hey everyone, wanted to provide a quick status update.

Starting in 1.32.0, we began compressing select service models (Amazon EC2, Amazon Sagemaker, and Amazon Quicksight) in our .whl files distributed on PyPI. With this change, we were able to reduce the size of botocore by 9.4 MB (11%) to a total of 76.1 MB on disk. This was the final step in a series of changes we've made over the last year to validate and enable today's release.

With 1.32.1, we've rolled this change out to all service models in our .whl files. This allows us to shrink botocore from 85.5 MB in our last 1.31.x release to 19.9 MB for a total savings of 77%. We hope this will be an impactful first step towards making Botocore less difficult to use in space constrained environments.

Going forward, we have additional areas we're looking to improve and will provide updates as we have them. We'd welcome any feedback you might have in the mean time.

@armenak-baburyan
Copy link

Nice work! This is an update I've been waiting for a long time.

for VERSION in 1.31.83 1.31.84 1.31.85 1.32.0 1.32.1; do echo -n "$VERSION  --> " && docker run --rm python:3.11-slim bash -c "pip install --disable-pip-version-check --quiet --root-user-action=ignore botocore==$VERSION && du -h -s /usr/local/lib/python3.11/site-packages/botocore"; done
1.31.83  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.84  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.85  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.32.0  --> 77M	/usr/local/lib/python3.11/site-packages/botocore
1.32.1  --> 24M	/usr/local/lib/python3.11/site-packages/botocore

@benkehoe
Copy link

This is great news! Will this change end up in the CLI as well?

@bbayles
Copy link

bbayles commented Nov 16, 2023

Just a note for people who are excited about the possibilities of smaller Lambda deploy packages: this probably won't help you get under 50 MB, because what you upload to Lambda is typically compressed already.

That is, botocore on its own is now smaller because it's compressed, but your package that includes botocore isn't - you were already compressing botocore yourself. Compressing it twice doesn't help!

@benkehoe
Copy link

I'd also like to drop a plug here for boto/boto3#2702, you tell us botocore version 1.32.1 has this change and then it's work for us to figure out what boto3 version it is (it's 1.29.1), when they should just be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature. needs-discussion p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests