-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unzip while downloading #856
Conversation
Update: I just added the following line to the cargo.toml of my ad-hoc script pasted above: |
Initial results on flyte.txt:
I'll focus on evaluating main with zlib-ng vs main without it, and come back to this branch later. |
Results after merging from main:
|
What command(s) did you use to get those numbers? |
Btw I'm using |
I think we should check the resolved requirements in too so we get stable benchmarks, @charliermarsh what do you think?
Did you use |
You're not gonna believe this :S I forgot to use release (ty for checking!), so here's an update batch of results with release:
Summary:
This makes sense. In theory, the only case when this should be slower than main is when there's more network throughput than we have unzipping troughput on 1 thread. Worth checking if that's the case for "only boto3" If that's a real scenario, then we can combine the two approaches - add a separate download task that writes to a buffer, and if it gets too far ahead, stop the async unzipping and process the rest of the files in rayon. But that's complicated, hopefully there's no need for it, or there's another workaround. Anyway, next steps:
With that said, I'm only here until friday and if yall want me to do things in different order (or do something else) let me know :)
Seems like a good idea |
Testing all 4 versions ...
I'm only listing tests that run a long time because the short ones are noisy if not repeated.
So this PR is 0-10% improvement (on my machine, with low statistical confidence). The zlib-ng PR is noise, and maybe worth reverting because it adds a dependency on cmake. |
Oh no! Well the good news is that this looks like a nice speedup over main? :) |
From the singular value i find it hard to decide if the changes are significant or noise (sorry - my bioinformatics background compels me to be pedantic here). Could you run with hyperfine or a criterion benchmark or some other way that gives us an average and standard error (or a range)? The short benchmarks might pair well with criterion, the longer ones with hyperfine, but would need an extra entrypoint i guess. |
Some early results (the rest of the tests will take a while to run): black.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/black.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 132.4 ms ± 24.7 ms [User: 48.8 ms, System: 55.8 ms]
Range (min … max): 109.8 ms … 316.7 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 126.2 ms ± 16.3 ms [User: 47.9 ms, System: 45.3 ms]
Range (min … max): 100.0 ms … 174.9 ms 100 runs
Summary
./target/release/puffin-async (install-cold) ran
1.05 ± 0.24 times faster than ./target/release/puffin-main (install-cold) boto3.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/boto3.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 1.251 s ± 0.092 s [User: 0.538 s, System: 0.723 s]
Range (min … max): 1.172 s … 1.733 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 1.144 s ± 0.076 s [User: 0.583 s, System: 0.581 s]
Range (min … max): 1.076 s … 1.532 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Summary
./target/release/puffin-async (install-cold) ran
1.09 ± 0.11 times faster than ./target/release/puffin-main (install-cold) This bench script is pretty nice |
Finished running all tests. Summary, sorted by biggest to smallest difference:
Detailed resultsblack.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/black.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 132.4 ms ± 24.7 ms [User: 48.8 ms, System: 55.8 ms]
Range (min … max): 109.8 ms … 316.7 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 126.2 ms ± 16.3 ms [User: 47.9 ms, System: 45.3 ms]
Range (min … max): 100.0 ms … 174.9 ms 100 runs
Summary
./target/release/puffin-async (install-cold) ran
1.05 ± 0.24 times faster than ./target/release/puffin-main (install-cold) boto3.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/boto3.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 1.251 s ± 0.092 s [User: 0.538 s, System: 0.723 s]
Range (min … max): 1.172 s … 1.733 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 1.144 s ± 0.076 s [User: 0.583 s, System: 0.581 s]
Range (min … max): 1.076 s … 1.532 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Summary
./target/release/puffin-async (install-cold) ran
1.09 ± 0.11 times faster than ./target/release/puffin-main (install-cold) dtlssocket.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/dtlssocket.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 4.524 s ± 0.061 s [User: 3.620 s, System: 0.924 s]
Range (min … max): 4.439 s … 4.793 s 100 runs
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 4.528 s ± 0.058 s [User: 3.631 s, System: 0.896 s]
Range (min … max): 4.438 s … 4.780 s 100 runs
Summary
./target/release/puffin-main (install-cold) ran
1.00 ± 0.02 times faster than ./target/release/puffin-async (install-cold) flyte.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 10 \
scripts/requirements/compiled/flyte.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 41.163 s ± 0.453 s [User: 11.127 s, System: 12.223 s]
Range (min … max): 40.576 s … 41.852 s 10 runs
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 38.649 s ± 0.622 s [User: 16.275 s, System: 11.752 s]
Range (min … max): 37.950 s … 39.588 s 10 runs
Summary
./target/release/puffin-async (install-cold) ran
1.07 ± 0.02 times faster than ./target/release/puffin-main (install-cold) pdm2193.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/pdm_2193.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 1.033 s ± 0.053 s [User: 0.632 s, System: 0.778 s]
Range (min … max): 0.979 s … 1.265 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 1.000 s ± 0.063 s [User: 0.613 s, System: 0.578 s]
Range (min … max): 0.942 s … 1.220 s 100 runs
Summary
./target/release/puffin-async (install-cold) ran
1.03 ± 0.08 times faster than ./target/release/puffin-main (install-cold) scispacy.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/pdm_2193.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 6.206 s ± 0.411 s [User: 1.960 s, System: 3.014 s]
Range (min … max): 6.051 s … 9.314 s 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 6.148 s ± 0.254 s [User: 2.224 s, System: 2.303 s]
Range (min … max): 6.004 s … 7.747 s 100 runs
Warning: The first benchmarking run for this command was significantly slower than the rest (7.402 s). This could be caused by (filesystem) caches that were not filled until after the first run. You are already using both the '--warmup' option as well as the '--prepare' option. Consider re-running the benchmark on a quiet system. Maybe it was a random outlier. Alternatively, consider increasing the warmup count.
Summary
./target/release/puffin-async (install-cold) ran
1.01 ± 0.08 times faster than ./target/release/puffin-main (install-cold) trio.txtpython -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
scripts/requirements/compiled/trio.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 889.4 ms ± 74.7 ms [User: 380.0 ms, System: 824.7 ms]
Range (min … max): 809.8 ms … 1282.2 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 806.1 ms ± 41.2 ms [User: 360.4 ms, System: 505.1 ms]
Range (min … max): 756.9 ms … 988.0 ms 100 runs
Summary
./target/release/puffin-async (install-cold) ran
1.10 ± 0.11 times faster than ./target/release/puffin-main (install-cold) boto3==1.34.14python -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 100 \
debug/boto3.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 101.0 ms ± 11.4 ms [User: 33.5 ms, System: 28.4 ms]
Range (min … max): 82.6 ms … 144.3 ms 100 runs
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 105.4 ms ± 11.2 ms [User: 30.0 ms, System: 27.6 ms]
Range (min … max): 85.4 ms … 141.7 ms 100 runs
Summary
./target/release/puffin-main (install-cold) ran
1.04 ± 0.16 times faster than ./target/release/puffin-async (install-cold) torch==1.12.1python -m scripts.bench \
--puffin-path ./target/release/puffin-main \
--puffin-path ./target/release/puffin-async \
--benchmark install-cold \
--min-runs 20 \
debug/torch.txt
Benchmark 1: ./target/release/puffin-main (install-cold)
Time (mean ± σ): 23.736 s ± 0.347 s [User: 5.555 s, System: 5.496 s]
Range (min … max): 23.483 s … 24.739 s 20 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./target/release/puffin-async (install-cold)
Time (mean ± σ): 20.996 s ± 0.413 s [User: 8.371 s, System: 5.076 s]
Range (min … max): 20.695 s … 22.545 s 20 runs
Summary
./target/release/puffin-async (install-cold) ran
1.13 ± 0.03 times faster than ./target/release/puffin-main (install-cold) |
Do you think this is worth productionizing based on the above benchmarks? I only see one regression ( |
The old solution would be faster if the wheel is a zip bomb. It's possible that there exist very compressible packages that take more time to decompress than to download. In that case the "best of both worlds" solution would be to have separate downloader and unzipper tasks and a buffer between them that can grow and spill onto disk if necessary. If the downloader gets too far ahead and the unzipper is not keeping up, we stop the zipper, and unzip the remaining files using rayon. Though this is a complicated solution and I wouldn't code it without evidence that it matters. IMO right now the more important thing is testing infrastructure:
Before that infrastructure exists we shouldn't get too attached to any particular implementation. The +10% speedup from this PR is not huge, but productionizing isn't difficult so I'll finish it up. I'll leave the current unzipper there so that we can switch to it if needed, and benchmark again with better test infra in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Great patch.
target, | ||
filename: wheel_filename, | ||
})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the thinking behind leaving the dead code below? Is it so the code can be changed back to the old strategy easily? I think I'd prefer just deleting the dead code. It can always be revised from source history if necessary.
An alternative would be to make a runtime option to toggle between the two strategies if we wanted to keep both around, but I'm not sure how valuable that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a good chance we'd want this code just because we'll get more tests and it would make sense to compare both versions. Current tests are done on my machine only.
We could dig out this code from history, but there will be merge conflicts with changes that pile up.
If we want to delete this now I'd still do it in a separate "refactor only" PR. There would be too much noise in this PR if I do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lmk what you prefer. I can send a PR deleting this on top of this one. There will be a cascade of related deletes, which is why I think bringing it back will involve merge conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO best practice would be to either: (1) remove, since we can always restore from source control; or (2) retain, but thread through a CLI flag (could be user-hidden) to let us toggle, and ensure there's test coverage. Otherwise, it's very likely to grow stale while still requiring some amount of maintenance, which is kind of the worst of both worlds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, you can do it in a separate PR with this branch marked as upstream if you like.
@@ -170,7 +198,7 @@ impl<'a, Context: BuildContext + Send + Sync> DistributionDatabase<'a, Context> | |||
CacheBucket::Wheels, | |||
WheelCache::Index(&wheel.index) | |||
.remote_wheel_dir(wheel_filename.name.as_ref()), | |||
filename, | |||
filename, // TODO should this be filename.stem() to match the other branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hmm... I would say they should both use .stem()
, yeah. Can you try this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. I ran a few of these locally and the differences held up on my machine too -- I actually saw a 14% improvement for boto3.txt
and a 9% improvement for trio.txt
!
@bojanserafimov - I recommend merging this, then rebasing #879 on |
LGTM! @bojanserafimov - LMK if you want an additional follow-up review, otherwise feel free to merge. |
Results: