-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import endpoint should allow for any (known) author identifiers #9448
Import endpoint should allow for any (known) author identifiers #9448
Comments
The vast majority of strong identifiers on import come from MARC records. #7724 covers that use case. |
@hornc Here's a concrete example (continuing discussion from PR). Let's say a volunteer wants to do a bulk import from LibriVox into Open Library. Their ImportRecords would likely look like: {
"title": "A Christmas Carol",
"authors": [{ "name": "Charles Dickens" }],
"identifiers": {"librivox": [140]},
...
} This is well and good, but we've giving the system very little info to work with for author resolution. They just have the name, "Charles Dickens". And: LibriVox has more information! It has a unique identifier for every author as well! The expansion proposed here would be to allow specifying those identifiers, eg: {
"title": "A Christmas Carol",
"authors": [{ "name": "Charles Dickens", "remote_ids": {"librivox": 91} }],
"identifiers": {"librivox": [140]},
...
} This removes almost all ambiguity and makes it nearly impossible to accidentally create a new author record when one is not needed. That's the motivation for this feature. To reduce the number of accidentally created duplicate authors (which is an issue librarians have repeatedly reported), and to make our import pipeline more robust. |
@cdrini The concrete example I requested on the PR was for one from the Wikisource context that appeared to be driving that specific development. I can image a synthetic example too, but I think in your example it is important for clear naming in the import schema for author corrected example:
Although now I'm looking at the description's proposed JSON schema, it seems like we need to clarify whether author That's why a clear description of value backed up by some useful real world examples will help clarify the feature, and establish the range of situations where author identifiers will be useful. I'm not trying to be pedantic, but clear examples of utility will help highlight the correct datatypes without much effort, otherwise we'll be arguing about them from different hypothetical backgrounds and assumptions. I'm not against movement in principle, I just like efficient movement in the right direction, and I think we have the resources and ability to be doing that. |
I don’t see why this is important in an example demonstrating the value of this feature. This seems like something that would be important when reviewing a PR (e.g., like my comment on #9674). You clearly understood the example.
When I first wrote the issue, I didn’t realise that Authors’
I feel like @cdrini’s example was very clear. Part of my motivation for a lot of my work for pushing identifiers comes from wanting to run an import of LibriVox into Open Library. What about that example was not clear to you? |
Oh, and also, internetarchive/openlibrary-client#419 already exists for implementing this issue in the import JSON schema, and @pidgezero-one did a much better job at fleshing the schema out, than I did in my simple and hasty example. |
Well, I don't see how that JSON that had obvious errors demonstrates "the value of this feature". It's just a hypothetical implementation that skips any discussion or clarification about the underlying requirements. I didn't want the incorrect JSON to lie around and mislead people, which is why I "corrected" it. It should be deleted. I agree that specifying the implementation JSON is an implementation detail that belongs in a PR. I keep finding myself forced to comment on random implementation details to query why because none of it is backed up with examples that demonstrate how there are useful to someone who is trying to do something valuable. An actual example would be a real LibriVox record that someone really wants to import into Open Library (ideally backed up with why that adds value to anyone -- why can't some get an audio book directly from LibriVox, or archive.org where they are hosted?). I have strong opinions on LibriVox imports -- I would hope a properly thought out LibriVox "strategic partnership" would think about what Open Library users expect in terms of LibriVox records, and consider how many LibriVox items are already in OL, how they got there, and how often and how LibriVox adds new works. I'd consider the ability to import random snapshots of LibriVox from datadumps and their website to already be in place. Figuring out what users of Open Library expect from LibriVox and using that as a motivator for a feature would be good. It'd hopefully uncover timeliness expectations. Considering a member of the LibriVox team as a 'user' and thinking in terms of what they might want as part of their publishing process would be nice too. In the recent past I have generated MARC records for LibriVox items hosted on archive.org (i.e. all of them) and used that to share with archive.org library partners. Storing those MARC records, perhaps making them available via OL (LibriVox has had an aspirational goal to provide MARC records since the beginning, but it hasn't moved since no one has really needed them? I did, so I made some) I also fixed metadata issues in the archive.org items where language codes were missing, which had been a bug raised 7 or so years ago. As well as correcting the data, I submitted a PR to the LibriVox project to ensure the metadata stated in sync. I think Open Library could be brought closer into that IA/LV data relationship as part of the LibriVox publication process. A properly planned LibriVox feature / epic that explored the actual state of LibriVox and what OL could provide value would be great to drive this, not a single par-baked assumption drenched JSON example of a pre-decided implementation that misses the bulk of potential value. "As a OL developer, in order to import some data that exists, I want another partner specific endpoint to be able to say I imported the data up to arbitrary time stamp, without reference to which data and what happens with it once it is in." is not the best user story. Sorry, I am feeling a bit frustrated, but
Is just doubling down on running straight to a solution without specifying requirements. It makes it very hard to evaluate whether it is fit for purpose. A single JSON example does not demonstrate utility. I have to infer something about why it exists and what it could be used for. The other PR was 67 comments in to an implementation of something before I started raising what the feature was for There is probably still various opinions on what that conversation was about. My takeaway was the original problem was along the lines of : I see the hypothetical argument that author identifiers would resolve this, but counter with the real fact that these were coming from Wikidata where dates and identifiers were available, and for the basic usecase of importing an author's dates are actually more immediately useful to a OL patron / librarian for disambiguating authors than an Wikidata or VIAF id. When all are present, sure, let's import them all. I do think that's the real value of importing identifiers for this feature: "As an OL patron/librarian, in order to assist with disambiguating authors, in addition to their dates, I would like to be able to reference (clickable) authority identifiers, imported if they are available from reliable sources." Expanding 'Open Library's Author Matching Pipeline" (a thing that doesn't exist) is not a good reason for it's own sake. |
Howdy folks; please let's keep the tone productive; this is getting a little heated! On API design: I think here or on a PR are both ok times to bring up thoughts on the API design.
On motivations: I think the rest of the concerns centre around motivations of the feature. Note some definitions: Author Resolution is the process of determining which Open Library author record/key best matches the author data present in the import record (namely this fn). I think you're looking for an example of the nature "Here is a case where the current author resolution algorithm failed with just author name/dates, and here is that same example now working with an author resolution algorithm that supports author identifiers". I think it would be wonderful if we had that sort of infrastructure set up to do that kind of testing, but alas we don't! The method doesn't even have a unit test :') . If this was big complicated project with lost of concerns about the cost of implementation vs the value, I would 100% agree with you that we need more evidence of the effectiveness. But this is a pretty trivial change to this method, so the implementation cost is practically 0. And the value, although not measured in the type of example I believe you're asking of me (which I agree would be more powerful), is pretty clear based on the LibriVox example here, and on Stef's experiences while working on the WikiSource imports. Using dates would be useful, but isn't as unambiguous as matching identifiers directly, and has extra complications. But I will see if I can spot such an example while validating Stef's WikiSource JSONL. On bulk LibriVox import approach: I've run out of time to respond to this, and since it's kind of adjacent to the core of this issue, I'm punting it for the moment. |
Problem
A clear and concise description of what you want to happen
Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of #9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)
Expected behaviour / screenshots (ex: Figma design screenshots for UI feature)
When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.
Additional Context
When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.
Proposal & Constraints
What is the proposed solution / implementation?
Changing the JSON schema to allow for identifiers for authors, perhaps something like,
and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.
Is there a precedent of this approach succeeding elsewhere?
Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.
Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.
Which suggestions or requirements should be considered for how feature needs to appear or be implemented?
Some considerations:
And I’m sure there are plenty of other edge cases, but this is clearly more involved than just allowing OL ids (#9411)
Leads
Related files
Stakeholders
Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.
The text was updated successfully, but these errors were encountered: