-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve author name resolution #9003
Improve author name resolution #9003
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! This is going to make such a big difference in avoiding duplicate authors being created!
We should add a grafana metric every time an author is created, cause I want to see a drop in a graph so I can tell people about it. We talked and can't find a simple hook for this infogami (we didn't look that hard), so I'll create a temporary script that walks back through ImportBot's edits. That should suffice for this.
d1ceaf3
to
3a56c1f
Compare
Related: internetarchive#7349 Honorifics (e.g. the French `M.`) can throw off the importer, and librarians will remove them manually. Instead, this commit removes them at the point of import, at least for some titles in English, French, Spanish, and German. There is also a way to create exceptions, such as "Dr. Seuss".
This commit changes the author name resolution order to be: 1. match on name + birth date + death date; 2. match on alternate_names + birth date + death date; 3. match on surname + birth date + death date. It does this by relying on an Infobase change to change the `op` associated with `~` to use `ILIKE` rather than `LIKE`. internetarchive/infogami#217 It also updates `mock_site()` to more accurately approximate the `LIKE` and (and now`ILIKE`) query done by PostgreSQL.
3a56c1f
to
9c87553
Compare
2135012
to
467b795
Compare
467b795
to
30c6dc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Once the infogami version bump is included we can merge 👍 I didn't test this code; the unit tests seem pretty thorough.
Partially closes #7349
Related: internetarchive/infogami#217
Feature.
Technical
This PR does three things:
This PR would change the author name resolution order as follows:
alternate_names
+ birth date + death date; andIt does this in part by relying on an Infobase change to change the
op
associated with
~
to useILIKE
rather thanLIKE
. See internetarchive/infogami#217It also updates
mock_site()
to more accurately approximate theLIKE
(and now
ILIKE
) query done by PostgreSQL.I should note this does not drop honorifics at the end of an author name, which was part of the problem in #7349. My rationale was this case would get picked up by
alternate_names
(once an alternate name has been added..), and only operating at the start would of a name would be less likely to make an unintended removal. However, it would be easy to remove honorifics at the end if that is also desired.This PR also does not address punctuation removal, which was called for in #7349. That's partially because I simply wasn't sure if I should, as this could result in overhead or required changes to the database structure, depending on the manner in which we do this.
However, one strategy that could use indexing and require no changes to the DB would be to use PostgreSQL
s
_single character wildcard with
LIKE/ILIKE
. That would allow a search such asWilliam H_ Brewer
to match an existingWilliam H. Brewer
, though it would not work the other way around.I don't think it would add that much complexity to enable this using a new operator (e.g.
~_
or something, instead of~
), but I wanted to check before pursing this at all.Testing
NOTE: testing, or at least the case-insensitive aspect of it, relies on the one line change found in internetarchive/infogami#217. Without that merged, tests (against PostgreSQL) relating to case-insensitivity will fail. The unit tests, however, will pass, as they rely on an updated
mock_infobase
, as frustrating as that may be.The tests should lay out a fairly comprehensive listing of the cases covered. However, those do rely on
mock_site()
, which, try as I might, may not fully replicateILIKE
.The last commit adds a
/api/author
endpoint solely for testing purposes. It is NOT intended for merging.The idea there is just to make it easier to test the author-specific part of
build_query
without having to actually import things -- one can simply see what the result would be.Sample use:
Stakeholders
@tfmorris
@seabelis
@cdrini