Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ingest_zarr_archive task #1387

Merged
merged 6 commits into from
Dec 8, 2022
Merged

Conversation

jjnesbitt
Copy link
Member

This optimization is achieved in two main ways:

  1. Using a priority queue (heap) in the ZarrChecksumModificationQueue class, instead of a normal queue
  2. Storing all s3 files in the queue before doing any processing (see below for memory usage).

The main benefit this provides is being able to update the checksum tree strictly from the bottom up, removing duplicate file updates.

Previously, a batch of files were fetched, immediately processed, and the next batch fetched, etc. This resulted in all parent checksum files being updated (up to the root checksum file) in every batch. The optimized method only updates each checksum file once.

Results

The outcome of this optimization is an almost 2x speed improvement. For a zarr archive with ~37k files (52.5GB), I observed an average runtime of ~83s, compared to a previous runtime of ~147s.

In regards to memory usage, it seems the max usage was ~411kb for the same zarr with ~37k files, decreasing as the ingestion goes on. This was the largest zarr I was able to test on, since this is the largest zarr in staging. Also, to clarify, this testing was done by using a local shell pointed at the staging database and s3 bucket, so these times should be representative of what we would observe in production.

@jjnesbitt jjnesbitt marked this pull request as ready for review November 30, 2022 22:01
Copy link
Member

@mvandenburgh mvandenburgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, some minor suggestions but this LGTM

dandiapi/zarr/checksums.py Outdated Show resolved Hide resolved
dandiapi/zarr/tasks/__init__.py Outdated Show resolved Hide resolved
@jjnesbitt
Copy link
Member Author

FYI it seems that django-stubs was updated last night, and so caused a small mypy error which 9848539 addresses. It's a pretty minor change, so I've included it here.

@jjnesbitt jjnesbitt merged commit 00b93c4 into master Dec 8, 2022
@jjnesbitt jjnesbitt deleted the zarr-checksum-optimization branch December 8, 2022 17:56
@dandibot
Copy link
Member

🚀 PR was released in v0.3.8 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants