-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Wikisource trusted book provider data #9671
Comments
I have a few questions about this feature,
Knowing whether this the main value of this feature is to:
would possibly help focus effort. Some Wikisource texts appear to come from Project Gutenberg texts, and that makes me worry about some of the lack-of-provenance issues such PD texts might have. I'm not 100% sure how we do handle Project Gutenberg texts on OL, are they their own editions, do they change over time? That's probably a different topic though. |
I don't love the
Wikisource/Wikidata not explicitly differentiating what counts as a "book" has been a real thorn in my side. For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.
This was my understanding of the main purpose here when Drini and I were first discussing the project. I've been writing the import record script with the understanding that we'd like to import items from more Wikisource language bases than just English in the future. |
I share @hornc 's concerns and would like to see this much more tightly specified.
is a pretty terse description of request which could take a variety of different forms. Wikisource is mostly made up of transcriptions of specific editions (not works), although, as @hornc points out PG editions are a bit of a wild card because they are editions without any provenance information which are intentionally unassociated with existing editions. Is the intention to create new digital editions for the transcriptions which are derived from the original edition? Or is the intention just to make the transcription some type of digital proxy for the original edition? Wikisource, as with most things wiki*, seems a bit ambiguous, but seems to lean towards the latter model (ie they include a link to the Wikidata entity for the transcribed edition, but don't model the transcription separately). Complicating this is the fact that Wikidata is generally poor at modeling book metadata. It's not a huge deal because it doesn't have much it it, but some of the logical conflicts you'll see include:
Using or linking to one of these conflated entities extends the mess because the new connection usually requires (or implies) either an edition or a work, but not both. I would suggest that Wikisource transcriptions should actually be modeled independently from the editions that they transcribe, but that would require the buy-in/support of both the Wikisource and Wikidata communities. Certainly if OL considers exactly digital facsimiles from CreateSpace, etc, to be separate editions and transcription would definitely be considered a separate edition (but OpenLibrary's data model isn't rich enough to connect the two derived editions together, as far as I know). Has anyone looked at how many of the transcribed editions are NOT already in OpenLibrary? My assumption is that the vast majority of them are, so perhaps focusing on @hornc 's suggestion of closing the loop on IA/OL editions would be a good place to start.
I would consider it a transcribed derivative of OL23268596M / ia:addresstomaryade00scot which was authored by Q16944048 (no associated OL ID in Wikidata, but appears to be OL6627737A). Given that OL & IA each have (separate) catalog records with the metadata and IA has scanned page images as well as OCR'd text, which is expected to be derived from Wikisource? Just a link or an alternative text version or some set of metadata or ... ? It might be tempting to infer equivalence of author IDs, but that seems risky absent other evidence than cooccurrence. |
I will leave this open as an epic targeting this work, to be closed when we do run a bulk import step. Currently the code has been merged and is undergoing some final verification/checks before deciding on next steps. I will read through some of the comments/concerns raised here later today and respond; @pidgezero-one and I have largely discussed many of these concerns already :) |
On the identifier form, On "What is a book on WikiSource": This is a big concern; +1 @pidgezero-one 's response. Her extensive work in #9674 specifically targeting this problem is we believe sufficient at filtering out non-book items, like letters, press releases, decrees, etc, but the approach can always be improved. Regardless, the next step of this process (added a checkbox above) is to go through a random sample of the extracted books and manually verify the error rate. On new editions/works/publishers: These were concerns that others had raised as well, and I raised them up during our community call a few months ago to get more voices in the discussion. There we landed on creating separate editions for them, with the publisher set to "WikiSource" as well as the original publisher. This is inline with how we treat Project Gutenberg and Standard Ebooks. WikiSource is subtly more grey area than Project Gutenberg and Standard Ebooks, but I think the work required to create a WikiSource book, coupled with the difference between the original and the WikiSource book, is sufficient in warranting its new edition record. +1 @tfmorris suggestion that in some future we would be able to link this editions as "derived". On motivations of this feature: The motivation is two fold: (1) have more good quality books in Open Library that people can read in a wide variety of formats. WikiSource books have highly accurate EPUBs, PDFs, etc, which are great for readers looking to read on their phones or ereaders, and which are better than eg IA's auto-OCR'd books of the same formats. (2) Support a mission-aligned website doing great work in the book space. I love being able to drive traffic to WikiSource since they're a great project :) And it's open, so Open Library contributors can also go contribute on WikiSource if they so wish! |
That wasn't specified in the feature / issue description, so I wasn't particularly reviewing with that idea in mind, and it's not clear to me that the implementation that was merged does that either. It's also not clear that that is the best provider of value for the intended feature. Where and when was the appropriate time to comment on that? I think the development showed that it's a bit more complicated than that, so not only was the decision not documented, it was not really clarified, so it's still unclear what the implementation needs to do in all the likely encountered cases. It seems like whether an new edition is created it'll depend on the details of the existing matching process and the input data provided. I think most relevantly on what publication date is supplied. In the absence of any concrete examples of what was expected on a import, or what happens on an import with the current script, it is hard to evaluate on a simply technical mechanical level. Without a clear usecase that deals with a specific value to an Open Library patron leveraging Open Library and Wikisource, it is hard to evaluate whether the code satisfies the feature. I don't think this is particularly fair to developers or reviewers. Another potential comment I might have made on #9674 is that the feature looks like it could be implemented as an external script which makes use of existing import endpoints (JSON, or the new /bulk/ submission endpoint) -- The current implementation adds external modules to requirements.txt which will be installed on production and every development Docker container. Where and when should I raise those as potential things to think about? (module version maintenance overhead, security footprint concerns (recently a relevant issue!), and container bloat) I don't have a huge problem with this specific case, but I don't think it's sustainable to have every OL related script stored in /scripts/ expanding the production requirements for scripts that may or may not run in production. Given the lack of high level description on the feature, maybe I'm missing some context and there is only one way to do this. |
Problem
Followup to #8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as
langcode:title
(i.e.en:George_Bernard_Shaw
). Import Wikisource works into Open Library.https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing
Breakdown
Proposal & Constraints
Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max
The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.
The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.
In the future, we will want to expand this import to support other languages besides
en.wikisource.org
, and perhaps expand beyond theValidated texts
category, so the solution to this should be extensible.A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with
curid
. Example:https://en.wikisource.org/?curid=4496925
andhttps://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs
are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)Leads
Stakeholders
@cdrini @pidgezero-one
The text was updated successfully, but these errors were encountered: