-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ingest_zarr_archive
task
#1387
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mvandenburgh
approved these changes
Dec 8, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, some minor suggestions but this LGTM
Co-authored-by: Mike VanDenburgh <37340715+mvandenburgh@users.noreply.github.com>
…rchive into zarr-checksum-optimization
jjnesbitt
force-pushed
the
zarr-checksum-optimization
branch
from
December 8, 2022 17:20
77819c7
to
9848539
Compare
FYI it seems that django-stubs was updated last night, and so caused a small mypy error which 9848539 addresses. It's a pretty minor change, so I've included it here. |
🚀 PR was released in |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This optimization is achieved in two main ways:
ZarrChecksumModificationQueue
class, instead of a normal queueThe main benefit this provides is being able to update the checksum tree strictly from the bottom up, removing duplicate file updates.
Previously, a batch of files were fetched, immediately processed, and the next batch fetched, etc. This resulted in all parent checksum files being updated (up to the root checksum file) in every batch. The optimized method only updates each checksum file once.
Results
The outcome of this optimization is an almost 2x speed improvement. For a zarr archive with ~37k files (52.5GB), I observed an average runtime of ~83s, compared to a previous runtime of ~147s.
In regards to memory usage, it seems the max usage was ~411kb for the same zarr with ~37k files, decreasing as the ingestion goes on. This was the largest zarr I was able to test on, since this is the largest zarr in staging. Also, to clarify, this testing was done by using a local shell pointed at the staging database and s3 bucket, so these times should be representative of what we would observe in production.