Import Wikisource trusted book provider data #9671

pidgezero-one · 2024-08-01T01:09:51Z

Problem

Followup to #8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as langcode:title (i.e. en:George_Bernard_Shaw). Import Wikisource works into Open Library.

https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing

Breakdown

@pidgezero-one Create a script which implements proposal below to get WikiSource data and coerce the data into Open Library's import format
@cdrini Verify a ~10 sample of the resulting books
Both: Run bulk import

Proposal & Constraints

Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max

The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.

The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.

In the future, we will want to expand this import to support other languages besides en.wikisource.org, and perhaps expand beyond the Validated texts category, so the solution to this should be extensible.

A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with curid. Example: https://en.wikisource.org/?curid=4496925 and https://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)

Leads

Stakeholders

@cdrini @pidgezero-one

The text was updated successfully, but these errors were encountered:

hornc · 2024-11-27T23:35:00Z

I have a few questions about this feature,

it's not completely clear to me whether the en:George_Bernard_Shaw id is really a portable identifier or a really a URL equivalent (wiki + page title, which can change), or how it can be used to compare with other data sources that might list a 'Wikisource identifier'. The numeric ids look more like identifiers, but also are language wikisource specific, so there really isn't a single 'Wikisource identifier' 112842 is the 'George Bernard Shaw' book on en-wikisource, but it's something completely different on Ukrainian Wikisource.
Determining what is a 'book' on Wikisource does seem complicated, and it's not stated clearly. Pages on Wikisource appear to represent 'Works', but are generally expected to have a source published Edition -- I don't know if the edition can be changed in principle? I think that means Wikisource is not a publisher, so Wikisource will not be the only source for these books.
From the examples I have seen, the Wikisource scans appear to originate on archive.org, which should imply there are already Open Library records. The example https://en.wikisource.org/wiki/George_Bernard_Shaw has an archive.org id of https://archive.org/details/cu31924013547645 , which is in the metadata. A Ukrainian example https://uk.wikisource.org/wiki/%D0%A4%D0%B0%D0%B2%D1%81%D1%82 doesn't appear to mention or link to archive.org at all, but the scan that is on Wikisource seems to be this one from IA: https://archive.org/details/favsttragediia01goet If there is a built-in workflow relationship with Wikisource and archive.org already, there might be a more direct way to close the loop and associate the identifiers?

Knowing whether this the main value of this feature is to:

get more books into Open Library that OL does not have
close the loop on associating Wikisource pages with already existing records in OL
supporting some other Wikisource related workflow

would possibly help focus effort.

Some Wikisource texts appear to come from Project Gutenberg texts, and that makes me worry about some of the lack-of-provenance issues such PD texts might have. I'm not 100% sure how we do handle Project Gutenberg texts on OL, are they their own editions, do they change over time? That's probably a different topic though.

pidgezero-one · 2024-11-28T00:10:37Z

it's not completely clear to me whether the en:George_Bernard_Shaw id is really a portable identifier or a really a URL equivalent (wiki + page title, which can change), or how it can be used to compare with other data sources that might list a 'Wikisource identifier'. The numeric ids look more like identifiers, but also are language wikisource specific, so there really isn't a single 'Wikisource identifier' 112842 is the 'George Bernard Shaw' book on en-wikisource, but it's something completely different on Ukrainian Wikisource.

I don't love the lang:title identifier format, personally. In the script in my open PR, I originally tried to use the numeric ID like the one you've identified. I stuck with lang:title here for two reasons: less so, I couldn't get the numeric identifier to resolve to the outbound links in the download options section for Wikisource books, and more so, it's already the identifier format that the small selection of existing Wikisource books in OL use (same example).

Determining what is a 'book' on Wikisource does seem complicated, and it's not stated clearly. Pages on Wikisource appear to represent 'Works', but are generally expected to have a source published Edition -- I don't know if the edition can be changed in principle? I think that means Wikisource is not a publisher, so Wikisource will not be the only source for these books.

Wikisource/Wikidata not explicitly differentiating what counts as a "book" has been a real thorn in my side. For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

get more books into Open Library that OL does not have

This was my understanding of the main purpose here when Drini and I were first discussing the project. I've been writing the import record script with the understanding that we'd like to import items from more Wikisource language bases than just English in the future.

tfmorris · 2024-11-28T23:10:28Z

I share @hornc 's concerns and would like to see this much more tightly specified.

Import Wikisource works into Open Library.

is a pretty terse description of request which could take a variety of different forms.

Wikisource is mostly made up of transcriptions of specific editions (not works), although, as @hornc points out PG editions are a bit of a wild card because they are editions without any provenance information which are intentionally unassociated with existing editions.

Is the intention to create new digital editions for the transcriptions which are derived from the original edition? Or is the intention just to make the transcription some type of digital proxy for the original edition? Wikisource, as with most things wiki*, seems a bit ambiguous, but seems to lean towards the latter model (ie they include a link to the Wikidata entity for the transcribed edition, but don't model the transcription separately).

Complicating this is the fact that Wikidata is generally poor at modeling book metadata. It's not a huge deal because it doesn't have much it it, but some of the logical conflicts you'll see include:

works with ISBNs (which can only be associated with editions)
works with OL edition identifiers (often added by bots based on the ISBNs above)
works/editions with both OL work and edition identifiers (which is never legal)

Using or linking to one of these conflated entities extends the mess because the new connection usually requires (or implies) either an edition or a work, but not both.

I would suggest that Wikisource transcriptions should actually be modeled independently from the editions that they transcribe, but that would require the buy-in/support of both the Wikisource and Wikidata communities. Certainly if OL considers exactly digital facsimiles from CreateSpace, etc, to be separate editions and transcription would definitely be considered a separate edition (but OpenLibrary's data model isn't rich enough to connect the two derived editions together, as far as I know).

Has anyone looked at how many of the transcribed editions are NOT already in OpenLibrary? My assumption is that the vast majority of them are, so perhaps focusing on @hornc 's suggestion of closing the loop on IA/OL editions would be a good place to start.

For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

I would consider it a transcribed derivative of OL23268596M / ia:addresstomaryade00scot which was authored by Q16944048 (no associated OL ID in Wikidata, but appears to be OL6627737A). Given that OL & IA each have (separate) catalog records with the metadata and IA has scanned page images as well as OCR'd text, which is expected to be derived from Wikisource? Just a link or an alternative text version or some set of metadata or ... ? It might be tempting to infer equivalence of author IDs, but that seems risky absent other evidence than cooccurrence.

cdrini · 2024-12-05T17:32:09Z

I will leave this open as an epic targeting this work, to be closed when we do run a bulk import step. Currently the code has been merged and is undergoing some final verification/checks before deciding on next steps.

I will read through some of the comments/concerns raised here later today and respond; @pidgezero-one and I have largely discussed many of these concerns already :)

cdrini · 2024-12-05T18:59:47Z

On the identifier form, en:George_Bernard_Shaw: I understand the concerns, but I think you've reached the same conclusion that I reached, which is that this is the only identifier-like format that works across language-specific wikisource pages. It is also officially supported by their URL schemes: eg https://wikisource.org/wiki/en:The_Annotated_Strange_Case_of_Dr_Jekyll_and_Mr_Hyde . This decision was made in #8545 .

On "What is a book on WikiSource": This is a big concern; +1 @pidgezero-one 's response. Her extensive work in #9674 specifically targeting this problem is we believe sufficient at filtering out non-book items, like letters, press releases, decrees, etc, but the approach can always be improved. Regardless, the next step of this process (added a checkbox above) is to go through a random sample of the extracted books and manually verify the error rate.

On new editions/works/publishers: These were concerns that others had raised as well, and I raised them up during our community call a few months ago to get more voices in the discussion. There we landed on creating separate editions for them, with the publisher set to "WikiSource" as well as the original publisher. This is inline with how we treat Project Gutenberg and Standard Ebooks. WikiSource is subtly more grey area than Project Gutenberg and Standard Ebooks, but I think the work required to create a WikiSource book, coupled with the difference between the original and the WikiSource book, is sufficient in warranting its new edition record. +1 @tfmorris suggestion that in some future we would be able to link this editions as "derived".

On motivations of this feature: The motivation is two fold: (1) have more good quality books in Open Library that people can read in a wide variety of formats. WikiSource books have highly accurate EPUBs, PDFs, etc, which are great for readers looking to read on their phones or ereaders, and which are better than eg IA's auto-OCR'd books of the same formats. (2) Support a mission-aligned website doing great work in the book space. I love being able to drive traffic to WikiSource since they're a great project :) And it's open, so Open Library contributors can also go contribute on WikiSource if they so wish!

hornc · 2024-12-05T23:41:35Z

" we landed on creating separate editions for them"

That wasn't specified in the feature / issue description, so I wasn't particularly reviewing with that idea in mind, and it's not clear to me that the implementation that was merged does that either. It's also not clear that that is the best provider of value for the intended feature. Where and when was the appropriate time to comment on that? I think the development showed that it's a bit more complicated than that, so not only was the decision not documented, it was not really clarified, so it's still unclear what the implementation needs to do in all the likely encountered cases.

It seems like whether an new edition is created it'll depend on the details of the existing matching process and the input data provided. I think most relevantly on what publication date is supplied. In the absence of any concrete examples of what was expected on a import, or what happens on an import with the current script, it is hard to evaluate on a simply technical mechanical level.

Without a clear usecase that deals with a specific value to an Open Library patron leveraging Open Library and Wikisource, it is hard to evaluate whether the code satisfies the feature.

I don't think this is particularly fair to developers or reviewers.

Another potential comment I might have made on #9674 is that the feature looks like it could be implemented as an external script which makes use of existing import endpoints (JSON, or the new /bulk/ submission endpoint) -- The current implementation adds external modules to requirements.txt which will be installed on production and every development Docker container. Where and when should I raise those as potential things to think about? (module version maintenance overhead, security footprint concerns (recently a relevant issue!), and container bloat) I don't have a huge problem with this specific case, but I don't think it's sustainable to have every OL related script stored in /scripts/ expanding the production requirements for scripts that may or may not run in production. Given the lack of high level description on the feature, maybe I'm missing some context and there is only one way to do this.

pidgezero-one added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Aug 1, 2024

cdrini assigned pidgezero-one Aug 1, 2024

This was referenced Aug 1, 2024

feat: import books from Wikisource #9674

Merged

Automatic Wikisource import pipeline #9683

Open

mekarpeles added this to the Sprint 2024-08 milestone Aug 2, 2024

mekarpeles modified the milestones: Sprint 2024-08, Sprint 2024-09, Sprint 2024-10 Aug 30, 2024

mekarpeles modified the milestones: Sprint 2024-10, 2024-11 Oct 25, 2024

github-actions bot added the Needs: Response Issues which require feedback from lead label Nov 28, 2024

mekarpeles modified the milestones: Sprint 2024-11, Sprint 2024-12, 2024 (provisional, requires discussion) Dec 2, 2024

hornc mentioned this issue Dec 3, 2024

feat: consolidate author remote_ids and wikidata identifiers #10092

Draft

cdrini closed this as completed in #9674 Dec 5, 2024

cdrini reopened this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Wikisource trusted book provider data #9671

Import Wikisource trusted book provider data #9671

pidgezero-one commented Aug 1, 2024 •

edited by cdrini

Loading

hornc commented Nov 27, 2024

pidgezero-one commented Nov 28, 2024 •

edited

Loading

tfmorris commented Nov 28, 2024

cdrini commented Dec 5, 2024 •

edited

Loading

cdrini commented Dec 5, 2024 •

edited

Loading

hornc commented Dec 5, 2024

Import Wikisource trusted book provider data #9671

Import Wikisource trusted book provider data #9671

Comments

pidgezero-one commented Aug 1, 2024 • edited by cdrini Loading

Problem

Breakdown

Proposal & Constraints

Leads

Stakeholders

hornc commented Nov 27, 2024

pidgezero-one commented Nov 28, 2024 • edited Loading

tfmorris commented Nov 28, 2024

cdrini commented Dec 5, 2024 • edited Loading

cdrini commented Dec 5, 2024 • edited Loading

hornc commented Dec 5, 2024

pidgezero-one commented Aug 1, 2024 •

edited by cdrini

Loading

pidgezero-one commented Nov 28, 2024 •

edited

Loading

cdrini commented Dec 5, 2024 •

edited

Loading

cdrini commented Dec 5, 2024 •

edited

Loading