-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added logic for page dump and commented out test line #9127
Added logic for page dump and commented out test line #9127
Conversation
@merwhite11 Very excited for this and pleasantly surprised how simple the solution is :) Hope Jim can review it soon! You may want to update the doc string here: openlibrary/openlibrary/data/dump.py Line 212 in dda2d57
|
dda2d57
to
91fb9a7
Compare
@cdrini, blocking, please see #8401 (comment) |
@cdrini is there any way we can do a test run that doesn't upload to any items? We want to avoid preferences, anything store related (which already shouldn't be there), PII (personally identifiable info), etc |
Taking a look at this now; checking latest dumps to see what types of records we would get. This is currently running. (Written with some help from ChatGPT!) python3 <<'EOF' | gzip > ol_dump_other.txt.gz
import gzip
import requests
import sys
url = 'https://openlibrary.org/data/ol_dump_latest.txt.gz'
exclude_types = {
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list"
}
with requests.get(url, stream=True) as r:
r.raise_for_status()
with gzip.GzipFile(fileobj=r.raw) as f:
for line in f:
line = line.decode('utf-8')
if line.split('\t', 1)[0] not in exclude_types:
print(line, end='')
EOF |
As I mentioned two months ago #8401 (comment) all this does is split the complete dump, which is already filtered. If there's a question about the contents, the complete dump creation is where it should be reviewed/fixed. |
Yes, this is mostly about getting transparency into what's there. Here's the breakdown by type: $ zcat ol_dump_other.txt.gz | cut -f1 | sort | uniq -c
1 /type/about
7 /type/backreference
3 /type/collection
2583653 /type/delete
5 /type/doc
11 /type/home
966 /type/i18n
34 /type/i18n_page
467 /type/language
324 /type/library
14 /type/local_id
126 /type/macro
5 /type/object
439 /type/page
12 /type/permission
1 /type/place
47 /type/rawtext
1 /type/scan_location
2 /type/scan_record
3 /type/series
91400 /type/subject
14 /type/tag
300 /type/template
48 /type/type
1 /type/uri
2 /type/user
19 /type/usergroup
107 /type/volume This seems fine. Lots of ancient legacy stuff here :P This |
78 MB for "other" doesn't seem excessive when compared to the other files sizes. It's certainly a lot better than the 13.6 GB currently required to get any of the data that isn't broken out separately. One might even argue that editions, works, authors, and "other" would be an adequate breakdown. Reading log, redirects, lists, ratings, and everything else would still be less than 200 MB and less than half the size of the authors file.
|
I have no strong opinion on splitting. Just will be happy once it is easier to get access to these smaller sections of the dump. Seems ok to put stuff in this other dump even if it could be moved out to another dump years down the line. Maybe it is slightly better for consumers to have just one to get with all this stuff rather than many small ones? |
Oh agreed; I mean we could down the line decide to split some of these out; e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm! Didn't testing running this since you gave it a test already, just tested what the slice would contain using the above snippet.
Errr actually I think we should split out the deletes now ; since we already have redirects in a separate, and the "other" dump will be like 95% deletes otherwise, which I think makes it less useful. |
@cdrini I don't know how to make a However, I updated the docs with placeholders :) Also, I've gotten confused about this before but how can I find old ratings dumps? I don't see them in this url https://archive.org/details/ol_exports?tab=collection&query=ratings I tried searching IA for the file name but no luck. |
Ah they're inside the dump, eg https://archive.org/download/ol_dump_2024-03-31 . Is that what you're looking for? |
@RayBB I updated the endpoint to include the short link for |
Ok this looks good to me! I've sent it to testing to test the new endpoints. @merwhite11 would you mind giving it another test to make sure my last changes didn't break anything? :P Then should be good to merge! |
Confirmed new endpoints work on testing 👍 |
Docs are updated with the forthcoming dump links |
421d7e3
to
e28449e
Compare
for more information, see https://pre-commit.ci
e28449e
to
4198add
Compare
@merwhite11 tested and it correctly generated all the files 👍 Lgtm! |
Closes #8401
This is a refactor that allows all dump file types that are NOT
to be sorted into a
misc
category. This category catches all/type/page
dump files, in addition to all other types that are not in above list.This misc files should help provide a comprehensive inventory of pages in the dump that is used to generate the sitemap.
Technical
I only tested these changes with a subset of full data (commented in line 38 of /scripts/oldump.sh) .
When line 38 was commented in, I also had to change the
-z
in line 133 of /scripts/oldump.sh to-n
to avoid an error in /data/dump.py.Testing
Screenshot
Ran
docker compose run --rm home make test
Stakeholders
@jimchamp @RayBB