Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Dumps: Fix remaining archival and sitemaps bugs #6638

Merged

Conversation

cclauss
Copy link
Contributor

@cclauss cclauss commented Jun 8, 2022

Related to #5402

good riddance 🤣
closes #5892
closes #6253
closes #6358
closes #6643

  • Remove the $TMPDIR/sitemaps dir as part of cleanup
  • Unify logging in scripts/sitemaps/sitemap.py with the other Open Library data dumps jobs
  • Fix the scripts/oldump.sh archive step which was failing because of a syntax issue only using [] instead of [[ ]]
  • Fix the unbalanced quote in scripts/oldump.sh
  • Enable Sentry in scripts.oldump.__main__ instead of openlibrary.data.dump.__main__ because the latter is never called in the dumps process.

For readability in the logs of the OpenLibrary data dump, add commas to large numbers:

log(f"read_data_file() processed {i} records in {minutes} minutes.")
# -->
log(f"read_data_file() processed {i:,} records in {minutes:,} minutes.")

Results:
2022-06-08 04:21:36 [openlibrary.dump] read_data_file() processed 202670667 records in 435 minutes.
2022-06-08 04:21:36 [openlibrary.dump] print_dump() processed 202670667 records in 435 minutes.
# -->
2022-06-08 04:21:36 [openlibrary.dump] read_data_file() processed 202,670,667 records in 435 minutes.
2022-06-08 04:21:36 [openlibrary.dump]     print_dump() processed 202,670,667 records in 435 minutes.

Technical

Testing

https://archive.org/details/ol_exports?sort=-publicdate

Screenshot

Stakeholders

@cclauss cclauss added Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Module: Data dumps labels Jun 8, 2022
@cclauss cclauss self-assigned this Jun 8, 2022
@cclauss cclauss requested a review from mekarpeles June 8, 2022 11:06
@cclauss
Copy link
Contributor Author

cclauss commented Jun 8, 2022

2022-06-07 19:12:01 [openlibrary.dump] * 2022-06-06 --archive --overwrite
2022-06-07 19:12:01 [openlibrary.dump] * <host:ol-home0.us.archive.org> <user:openlibrary> <dir:/1/var/tmp>
2022-06-07 19:12:01 [openlibrary.dump] * <cdump:ol_cdump_2022-06-06> <dump:ol_dump_2022-06-06>
2022-06-07 19:12:01 [openlibrary.dump] * Cleaning Up: Found --overwrite, removing old files

2022-06-07 19:12:03 [openlibrary.dump] * === Step 1 ===
2022-06-07 19:12:03 [openlibrary.dump] * generating reading log table: ol_dump_reading-log_2022-06-06.txt.gz

2022-06-07 19:12:18 [openlibrary.dump] * === Step 2 ===
2022-06-07 19:12:18 [openlibrary.dump] * generating ratings table: ol_dump_ratings_2022-06-06.txt.gz

2022-06-07 19:12:19 [openlibrary.dump] * === Step 3 ===
2022-06-07 19:12:19 [openlibrary.dump] * generating the data table: data.txt.gz -- takes approx. 110 minutes...

2022-06-07 21:05:47 [openlibrary.dump] * === Step 4 ===
2022-06-07 21:05:47 [openlibrary.dump] * generating ol_cdump_2022-06-06.txt.gz
2022-06-07 21:05:47 [openlibrary.dump] ['/openlibrary/scripts/oldump.py', 'cdump', 'data.txt.gz', '2022-06-06'] on Python 3.9.4
2022-06-07 21:05:49 [openlibrary.dump] read_data_file(data.txt.gz, max_lines=all)
2022-06-07 21:05:49 [openlibrary.dump] print_dump 0
2022-06-07 21:07:28 [openlibrary.dump] print_dump 1,000,000
2022-06-07 21:09:08 [openlibrary.dump] print_dump 2,000,000
# [...]
2022-06-08 04:16:01 [openlibrary.dump] print_dump 200,000,000
2022-06-08 04:18:07 [openlibrary.dump] print_dump 201,000,000
2022-06-08 04:20:17 [openlibrary.dump] print_dump 202,000,000
2022-06-08 04:21:36 [openlibrary.dump] read_data_file() processed 202670667 records in 435 minutes.
2022-06-08 04:21:36 [openlibrary.dump] print_dump() processed 202670667 records in 435 minutes.
2022-06-08 04:21:36 [openlibrary.dump] * generated ol_cdump_2022-06-06.txt.gz

2022-06-08 04:21:36 [openlibrary.dump] * === Step 5 ===
2022-06-08 04:21:36 [openlibrary.dump] ['/openlibrary/scripts/oldump.py', 'sort', '--tmpdir', '/1/var/tmp'] on Python 3.9.4
2022-06-08 04:21:36 [openlibrary.dump] ['/openlibrary/scripts/oldump.py', 'dump'] on Python 3.9.4
2022-06-08 04:21:37 [openlibrary.dump] sort_dump stdin
2022-06-08 04:21:37 [openlibrary.dump] sort_dump 0
2022-06-08 04:21:38 [openlibrary.dump] read_tsv(<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>)
2022-06-08 04:23:31 [openlibrary.dump] sort_dump 1,000,000
2022-06-08 04:25:29 [openlibrary.dump] sort_dump 2,000,000
2022-06-08 04:27:28 [openlibrary.dump] sort_dump 3,000,000
# [ ... ]
2022-06-08 11:52:29 [openlibrary.dump] sort_dump 179,000,000
2022-06-08 11:54:32 [openlibrary.dump] sort_dump 180,000,000
2022-06-08 11:56:44 [openlibrary.dump] sort_dump 181,000,000
2022-06-08 11:57:01 [openlibrary.dump] sort_dump /1/var/tmp/oldumpsort/00.txt.gz
2022-06-08 11:57:07 [openlibrary.dump] read_tsv 0
2022-06-08 11:57:16 [openlibrary.dump] sort_dump /1/var/tmp/oldumpsort/01.txt.gz
2022-06-08 11:57:26 [openlibrary.dump] read_tsv 1,000,000
2022-06-08 11:57:32 [openlibrary.dump] sort_dump /1/var/tmp/oldumpsort/02.txt.gz
2022-06-08 11:57:46 [openlibrary.dump] read_tsv 2,000,000
# [ ... ]
2022-06-08 13:03:47 [openlibrary.dump] sort_dump /1/var/tmp/oldumpsort/fe.txt.gz
2022-06-08 13:03:57 [openlibrary.dump] read_tsv 180,000,000
2022-06-08 13:04:02 [openlibrary.dump] sort_dump /1/var/tmp/oldumpsort/ff.txt.gz
2022-06-08 13:04:17 [openlibrary.dump] read_tsv 181,000,000
2022-06-08 13:04:18 [openlibrary.dump] sort_dump() processed 181100657 records in 522 minutes.
2022-06-08 13:04:18 [openlibrary.dump] read_tsv() processed 181100657 records in 522 minutes.
2022-06-08 13:04:18 [openlibrary.dump] generate_dump(None) ran in 522 minutes.

2022-06-08 13:04:18 [openlibrary.dump] * === Step 6 ===
2022-06-08 13:04:18 [openlibrary.dump] ['/openlibrary/scripts/oldump.py', 'split', '--format', 'ol_dump_%s_2022-06-06.txt.gz'] on Python 3.9.4
2022-06-08 13:04:20 [openlibrary.dump] split_dump 0
2022-06-08 13:05:33 [openlibrary.dump] split_dump 1,000,000
2022-06-08 13:06:42 [openlibrary.dump] split_dump 2,000,000
2022-06-08 13:07:55 [openlibrary.dump] split_dump 3,000,000
# [ ... ]
2022-06-08 14:27:57 [openlibrary.dump] split_dump 71,000,000
2022-06-08 14:29:04 [openlibrary.dump] split_dump 72,000,000
2022-06-08 14:30:03 [openlibrary.dump] split_dump() processed 72819551 records in 85 minutes.
2022-06-08 14:30:05 [openlibrary.dump] * dumps are generated at /1/var/tmp/dumps
2022-06-08 14:30:05 [openlibrary.dump] * Skipping sitemaps

Resulting files

ol-home0% ls -lhR /1/var/tmp/dumps

/1/var/tmp/dumps:
total 30G
-rw-r--r-- 1 systemd-coredump systemd-coredump  30G Jun  7 21:05 data.txt.gz
drwxr-xr-x 2 systemd-coredump systemd-coredump 4.0K Jun  8 14:30 ol_cdump_2022-06-06
drwxr-xr-x 2 systemd-coredump systemd-coredump 4.0K Jun  8 14:30 ol_dump_2022-06-06

/1/var/tmp/dumps/ol_cdump_2022-06-06:
total 26G
-rw-r--r-- 1 systemd-coredump systemd-coredump 26G Jun  8 04:21 ol_cdump_2022-06-06.txt.gz

/1/var/tmp/dumps/ol_dump_2022-06-06:
total 21G
-rw-r--r-- 1 systemd-coredump systemd-coredump  11G Jun  8 13:04 ol_dump_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump 416M Jun  8 14:30 ol_dump_authors_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump 7.4G Jun  8 14:30 ol_dump_editions_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump 2.5M Jun  7 19:12 ol_dump_ratings_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump  34M Jun  7 19:12 ol_dump_reading-log_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump  34M Jun  8 14:30 ol_dump_redirects_2022-06-06.txt.gz
-rw-r--r-- 1 systemd-coredump systemd-coredump 2.3G Jun  8 14:30 ol_dump_works_2022-06-06.txt.gz

@mekarpeles mekarpeles assigned mekarpeles and unassigned cclauss Jun 8, 2022
@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] Priority: 2 Important, as time permits. [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jun 8, 2022
@cclauss cclauss force-pushed the dumps-add-commas-to-large-numbers branch from 9654865 to 6717e2a Compare June 9, 2022 04:53
@cclauss cclauss added Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] and removed Priority: 2 Important, as time permits. [managed] labels Jun 9, 2022
@cclauss cclauss changed the title Data Dumps: Add commas to large numbers for readability Data Dumps: Fix remaining archival and sitemaps bugs Jun 9, 2022
@mekarpeles
Copy link
Member

🤞

@mekarpeles mekarpeles merged commit 337ccab into internetarchive:master Jun 9, 2022
@cclauss cclauss deleted the dumps-add-commas-to-large-numbers branch June 9, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Admin/Maintenance Issues relating to support scripts, bots, cron jobs and admin web pages. [managed] Module: Data dumps Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants