Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Given the long run times, make ol-dumps easier to test #5909

Merged
merged 5 commits into from
Dec 3, 2021

Conversation

cclauss
Copy link
Contributor

@cclauss cclauss commented Nov 26, 2021

Closes #5893

Testing (stats as of November 2021):
The oldump cron job on ol-home0 takes 18+ hours to process 192,000,000+ records in 29GB of data!!

These changes enable scripts/oldump.sh to OLDUMP_TESTING=true which will allow the script and openlibrary/data/dump.py to only process the first 1 million lines of the file instead of 192+ million lines. The first step of this script will still take 110 minutes to extract the 29GB of data from our database so it is highly recommended to save a copy of data.txt.gz to skip that step and accelerate the testing of subsequent job steps.

Technical

Call flow:
docker-compose.production.yml defines `cron-jobs` Docker container.
--> docker/ol-cron-start.sh sets up the cron tasks.
    --> olsystem: /etc/cron.d/openlibrary.ol_home0 defines the actual job
        --> scripts/oldump.sh
            --> scripts/oldump.py
                --> openlibrary/data/dump.py

Testing

Technical

Testing

Screenshot

Stakeholders

@cclauss cclauss added Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Module: Data dumps labels Nov 26, 2021
@mekarpeles mekarpeles self-assigned this Nov 29, 2021
@cclauss
Copy link
Contributor Author

cclauss commented Dec 1, 2021

Use Linux logger to write start and finish messages into the system log.

scripts/oldump.sh Outdated Show resolved Hide resolved
Copy link
Member

@mekarpeles mekarpeles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm, I committed a trivial fix to a lint/whitespace issue
Error: The process '/opt/hostedtoolcache/Python/3.9.9/x64/bin/pre-commit' failed with exit code 1

ty @BharatKalluri for helping w/ this code review!

@cclauss cclauss merged commit 7ef68c0 into internetarchive:master Dec 3, 2021
@cclauss cclauss deleted the make-oldumps-easier-to-test branch December 3, 2021 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Module: Data dumps
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add TESTING flag for scripts/oldumps.sh given the long runtimes
3 participants