Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import endpoint should allow for any (known) author identifiers #9448

Open
Freso opened this issue Jun 18, 2024 · 7 comments · May be fixed by internetarchive/openlibrary-client#419 or #10110
Open

Import endpoint should allow for any (known) author identifiers #9448

Freso opened this issue Jun 18, 2024 · 7 comments · May be fixed by internetarchive/openlibrary-client#419 or #10110
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Response Issues which require feedback from lead python Pull requests that update Python code Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@Freso
Copy link
Contributor

Freso commented Jun 18, 2024

Problem

A clear and concise description of what you want to happen

Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of #9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)

Expected behaviour / screenshots (ex: Figma design screenshots for UI feature)

When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.

Additional Context

When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.

Proposal & Constraints

What is the proposed solution / implementation?

Changing the JSON schema to allow for identifiers for authors, perhaps something like,

diff --git a/olclient/schemata/import.schema.json b/olclient/schemata/import.schema.json
index 3f00e90..8603f02 100644
--- a/olclient/schemata/import.schema.json
+++ b/olclient/schemata/import.schema.json
@@ -163,6 +163,17 @@
 	"title": {
 	  "type": "string",
 	  "examples": ["duc d'Otrante"]
+	},
+	"identifiers": {
+	  "type": "object",
+	  "examples": [
+            {
+	       "librivox": "2278"
+	    },
+	    {
+	       "project_runeberg": "nexo"
+	    }
+	  ]
 	}
       }
     },

and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.

Is there a precedent of this approach succeeding elsewhere?

Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.

Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.

Which suggestions or requirements should be considered for how feature needs to appear or be implemented?

Some considerations:

  • What happens when an author otherwise perfectly matches an existing author, but…
    • the provided identifier isn’t/identifiers aren’t already known?
      • author gets matched regardless?
        • and the identifier is/identifiers are discarded
        • and the identifier is added to the matched author
      • a new author gets created with the identifier(s) attached?
      • import fails
    • provided identifier(s) not known but author already has identifier(s) of the same type(s) (e.g., LibriVox id provided, but matched author already has a different LibriVox id)
    • provided identifier(s) match(es) a different author
    • provided identifier(s) match(es) different author_s_(!)
  • Identifier(s) match(es) existing Author but any other provided data (name, birth/death dates, …) do not

And I’m sure there are plenty of other edge cases, but this is clearly more involved than just allowing OL ids (#9411)

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

@Freso Freso added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Jun 18, 2024
@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Jun 24, 2024
@scottbarnes scottbarnes added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] python Pull requests that update Python code and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jun 28, 2024
@tfmorris
Copy link
Contributor

The vast majority of strong identifiers on import come from MARC records. #7724 covers that use case.

@cdrini
Copy link
Collaborator

cdrini commented Dec 5, 2024

@hornc Here's a concrete example (continuing discussion from PR). Let's say a volunteer wants to do a bulk import from LibriVox into Open Library. Their ImportRecords would likely look like:

{
    "title": "A Christmas Carol",
    "authors": [{ "name": "Charles Dickens" }],
    "identifiers": {"librivox": [140]},
    ...
}

This is well and good, but we've giving the system very little info to work with for author resolution. They just have the name, "Charles Dickens". And: LibriVox has more information! It has a unique identifier for every author as well! The expansion proposed here would be to allow specifying those identifiers, eg:

{
    "title": "A Christmas Carol",
    "authors": [{ "name": "Charles Dickens", "remote_ids": {"librivox": 91} }],
    "identifiers": {"librivox": [140]},
    ...
}

This removes almost all ambiguity and makes it nearly impossible to accidentally create a new author record when one is not needed. That's the motivation for this feature. To reduce the number of accidentally created duplicate authors (which is an issue librarians have repeatedly reported), and to make our import pipeline more robust.

@hornc
Copy link
Collaborator

hornc commented Dec 6, 2024

@cdrini The concrete example I requested on the PR was for one from the Wikisource context that appeared to be driving that specific development. I can image a synthetic example too, but I think in your example it is important for clear naming in the import schema for author remote_ids to be identifiers, and I think that has already been documented in the JSON schema in the original issue description above, and a few other locations where this has come up.

corrected example:

{
    "title": "A Christmas Carol",
    "authors": [{ "name": "Charles Dickens", "identifiers": {"librivox":  "91"} }],
    "identifiers": {"librivox": ["140"]},
    ...
}

Although now I'm looking at the description's proposed JSON schema, it seems like we need to clarify whether author identifiers is expected to be a list or an object.... Identifiers should be a string to be applicable to all kinds of ids that exist, and there's probably an open question on whether Author identifiers should be strings or list of strings (book / edition identifiers are lists of strings -- I don't know why that is, it might make sense).

That's why a clear description of value backed up by some useful real world examples will help clarify the feature, and establish the range of situations where author identifiers will be useful.

I'm not trying to be pedantic, but clear examples of utility will help highlight the correct datatypes without much effort, otherwise we'll be arguing about them from different hypothetical backgrounds and assumptions. I'm not against movement in principle, I just like efficient movement in the right direction, and I think we have the resources and ability to be doing that.

@Freso
Copy link
Contributor Author

Freso commented Dec 6, 2024

but I think in your example it is important for clear naming in the import schema for author remote_ids to be identifiers

I don’t see why this is important in an example demonstrating the value of this feature. This seems like something that would be important when reviewing a PR (e.g., like my comment on #9674). You clearly understood the example.

Although now I'm looking at the description's proposed JSON schema, it seems like we need to clarify whether author identifiers is expected to be a list or an object

When I first wrote the issue, I didn’t realise that Authors’ remote_ids could only have a single value (string) per identifier, where Editions and Works have an array of strings per identifier, so the proposed JSON is not applicable—but this is, again, an implementation detail that should be handled in a PR, and not something that has a bearing on the issue’s validity, IMHO.

clear examples of utility

I feel like @cdrini’s example was very clear. Part of my motivation for a lot of my work for pushing identifiers comes from wanting to run an import of LibriVox into Open Library. What about that example was not clear to you?

@Freso
Copy link
Contributor Author

Freso commented Dec 6, 2024

Oh, and also, internetarchive/openlibrary-client#419 already exists for implementing this issue in the import JSON schema, and @pidgezero-one did a much better job at fleshing the schema out, than I did in my simple and hasty example.

@hornc
Copy link
Collaborator

hornc commented Dec 7, 2024

@Freso

I don’t see why this is important in an example demonstrating the value of this feature. This seems like something that would be important when reviewing a PR (e.g., like #9674 (comment)). You clearly understood the example.

Well, I don't see how that JSON that had obvious errors demonstrates "the value of this feature". It's just a hypothetical implementation that skips any discussion or clarification about the underlying requirements. I didn't want the incorrect JSON to lie around and mislead people, which is why I "corrected" it. It should be deleted. I agree that specifying the implementation JSON is an implementation detail that belongs in a PR. I keep finding myself forced to comment on random implementation details to query why because none of it is backed up with examples that demonstrate how there are useful to someone who is trying to do something valuable.

An actual example would be a real LibriVox record that someone really wants to import into Open Library (ideally backed up with why that adds value to anyone -- why can't some get an audio book directly from LibriVox, or archive.org where they are hosted?).

I have strong opinions on LibriVox imports -- I would hope a properly thought out LibriVox "strategic partnership" would think about what Open Library users expect in terms of LibriVox records, and consider how many LibriVox items are already in OL, how they got there, and how often and how LibriVox adds new works.

I'd consider the ability to import random snapshots of LibriVox from datadumps and their website to already be in place.

Figuring out what users of Open Library expect from LibriVox and using that as a motivator for a feature would be good. It'd hopefully uncover timeliness expectations. Considering a member of the LibriVox team as a 'user' and thinking in terms of what they might want as part of their publishing process would be nice too.

In the recent past I have generated MARC records for LibriVox items hosted on archive.org (i.e. all of them) and used that to share with archive.org library partners. Storing those MARC records, perhaps making them available via OL (LibriVox has had an aspirational goal to provide MARC records since the beginning, but it hasn't moved since no one has really needed them? I did, so I made some) I also fixed metadata issues in the archive.org items where language codes were missing, which had been a bug raised 7 or so years ago. As well as correcting the data, I submitted a PR to the LibriVox project to ensure the metadata stated in sync. I think Open Library could be brought closer into that IA/LV data relationship as part of the LibriVox publication process.

A properly planned LibriVox feature / epic that explored the actual state of LibriVox and what OL could provide value would be great to drive this, not a single par-baked assumption drenched JSON example of a pre-decided implementation that misses the bulk of potential value.

"As a OL developer, in order to import some data that exists, I want another partner specific endpoint to be able to say I imported the data up to arbitrary time stamp, without reference to which data and what happens with it once it is in." is not the best user story.

Sorry, I am feeling a bit frustrated, but

clear examples of utility
I feel like @cdrini’s example was very clear.

Is just doubling down on running straight to a solution without specifying requirements. It makes it very hard to evaluate whether it is fit for purpose. A single JSON example does not demonstrate utility. I have to infer something about why it exists and what it could be used for. The other PR was 67 comments in to an implementation of something before I started raising what the feature was for There is probably still various opinions on what that conversation was about.

My takeaway was the original problem was along the lines of :
"Lewis Caroll; Raging Monster Trucks of Ohio Annual 2005"
Wasn't being matched to the author Lewis Carroll 1832 - 1898, which I would argue is by design. (equiv. to importing Hunting of the Snark in OL dev environment)

I see the hypothetical argument that author identifiers would resolve this, but counter with the real fact that these were coming from Wikidata where dates and identifiers were available, and for the basic usecase of importing an author's dates are actually more immediately useful to a OL patron / librarian for disambiguating authors than an Wikidata or VIAF id. When all are present, sure, let's import them all.

I do think that's the real value of importing identifiers for this feature:

"As an OL patron/librarian, in order to assist with disambiguating authors, in addition to their dates, I would like to be able to reference (clickable) authority identifiers, imported if they are available from reliable sources."

Expanding 'Open Library's Author Matching Pipeline" (a thing that doesn't exist) is not a good reason for it's own sake.

@cdrini
Copy link
Collaborator

cdrini commented Dec 10, 2024

Howdy folks; please let's keep the tone productive; this is getting a little heated!

On API design: I think here or on a PR are both ok times to bring up thoughts on the API design.

  • remote_ids vs identifiers: I think we have to support both actually, since our author records use remote_ids. This was an oversight when the author data model was expanded to house identifiers. But we have to support it now to avoid backward incompatible changes. In the interest of consistency, I think remote_ids will be less confusing as a starting step. I would support adding identifiers as an "alias" in a future issue.
  • type, string vs list[string]: For the same reasons, for consistency I would say start with string to match remote_ids as a first stab? But I do think this is confusing, and we should support lists as well in a future issue. (Or in this issue if implementer thinks it easy)

On motivations: I think the rest of the concerns centre around motivations of the feature. Note some definitions: Author Resolution is the process of determining which Open Library author record/key best matches the author data present in the import record (namely this fn). I think you're looking for an example of the nature "Here is a case where the current author resolution algorithm failed with just author name/dates, and here is that same example now working with an author resolution algorithm that supports author identifiers". I think it would be wonderful if we had that sort of infrastructure set up to do that kind of testing, but alas we don't! The method doesn't even have a unit test :') . If this was big complicated project with lost of concerns about the cost of implementation vs the value, I would 100% agree with you that we need more evidence of the effectiveness. But this is a pretty trivial change to this method, so the implementation cost is practically 0. And the value, although not measured in the type of example I believe you're asking of me (which I agree would be more powerful), is pretty clear based on the LibriVox example here, and on Stef's experiences while working on the WikiSource imports. Using dates would be useful, but isn't as unambiguous as matching identifiers directly, and has extra complications. But I will see if I can spot such an example while validating Stef's WikiSource JSONL.

On bulk LibriVox import approach: I've run out of time to respond to this, and since it's kind of adjacent to the core of this issue, I'm punting it for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Response Issues which require feedback from lead python Pull requests that update Python code Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
6 participants