Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong keys data dump + internal ID field in dumps #6349

Merged
merged 2 commits into from
Mar 31, 2022

Conversation

cdrini
Copy link
Collaborator

@cdrini cdrini commented Mar 30, 2022

Closes #6348

  • Partially revert 9403742 to make sure the process records is json.dumps and printed.
  • Place filter in the righter ordering from 9403742
  • Add unit tests!

Technical

Testing

  • Kick off full data dump
curl -L 'https://archive.org/download/ol_dump_2022-03-29/ol_dump_authors_2022-03-29.txt.gz' | zcat | grep -F '/a/' | head -n1
/type/author      /authors/OL6725672A     2       2009-09-08T12:14:54.827520     {"bio": {"type": "/type/text", "value": "Translator to English of [Astrid Lindgren](/a/OL24950A)'s <i>Pippi Longstocking</i> (from Swedish) and works by [Knut Hamsun](/a/OL41785A) (from Norwegian)."}, "name": "Gerry Bothmer", "created": {"type": "/type/datetime", "value": "2009-09-08T12:12:46.444934"}, "last_modified": {"type": "/type/datetime", "value": "2009-09-08T12:14:54.827520"}, "latest_revision": 2, "key": "/authors/OL6725672A", "type": {"key": "/type/author"}, "revision": 2}

One match in the description -- all good!

Screenshot

Stakeholders

@cclauss @mekarpeles

cdrini added 2 commits March 30, 2022 16:14
* Partially revert 9403742 to make sure the process records is json.dumps and printed.
* Place filter in the righter ordering from 9403742
* Add unit tests!
Copy link
Contributor

@cclauss cclauss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM... Great to see the tests!

@cdrini
Copy link
Collaborator Author

cdrini commented Mar 30, 2022

Going to use this in a full data dump before merging to confirm everything is correct 👍

@cdrini
Copy link
Collaborator Author

cdrini commented Mar 31, 2022

All looks good!

@cdrini cdrini merged commit 3765615 into internetarchive:master Mar 31, 2022
@cdrini cdrini deleted the fix/wrong-keys-data-dump branch March 31, 2022 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Dump keys sometimes have wrong format
2 participants