Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bot changes to be indexed #5617

Merged
merged 1 commit into from
Feb 2, 2022
Merged

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Sep 1, 2021

Closes #5700

Currently bot edits are being excluded from the regular ongoing Solr index updates. This PR removes the deliberate bot exclusion.

Bots are the main source of new imports into Open Library by a long way. Being excluded from index means that any newly added works or editions are not discoverable, and every new author added is created with an empty 0 works page

e.g. https://openlibrary.org/authors/OL9405192A/Edward_L._Parker (which my bot recently added from an achive.org record)

It should list the work https://openlibrary.org/works/OL24955945W , instead the author page looks like a junk, unlinked record. This appears to be the default behaviour of newly imported authors, and won't be resolved without manual reindexing. I though there have been issues raised about 0 work authors in the past, but I can't located one right now.

This is just one example of many, and this appears to be the current standard behaviour with the exculde bots flag set.

Technical

Since the Solr 8 update I have noticed manual edits being reflected pretty promptly (less than the previously stated 15mins, although I wasn't timing accurately) -- it seems noticeably faster and better than the past, so Solr 8 has been a very good improvement. I hope the load from bot edits won't cause a preformance problem. OL needs to keep current with more books though, so it needs to be able to index its content, which the bots are providing.

There are also multiple clean up and fix tasks performed by bots that aren't being reflected in the search index (I noticed this problem after merging 10K editions and works from a librarian request bot task) -- the changes were written but there was no obvious effect.

As far as performance goes, I manually added the 10k to reindex in successive batches of 1000 to the admin interface (copy and pasting through the web UI) and the whole lot were picked up successfully and the expect index update occurred promptly, so I think solr can handle large batches like this as they happen.

Testing

Screenshot

Stakeholders

@cdrini
@mekarpeles

@hornc hornc requested a review from cdrini September 1, 2021 22:28
@tfmorris
Copy link
Contributor

tfmorris commented Sep 2, 2021

Independent of the indexing, why is this bot creating duplicate author records and duplicate work records instead of using the ones that exist? That doesn't seem like a "Clean Up."

https://openlibrary.org/authors/OL2420469A
https://openlibrary.org/works/OL16753796W

@hornc
Copy link
Collaborator Author

hornc commented Sep 2, 2021

@tfmorris I believe the technical answers to the question is that the author wasn't matched because the existing author had dates, and the matching code does not assume undated names from sources are the same as one with dates.
The titles possibly didn't match because the original has the subtitle unsplit, and according to the algorithm the authors did not match, which may be enough to treat them as different, in the absence of any identical bibliographic ids in the source data.

An earlier edit of the bot on https://openlibrary.org/works/OL16753796W did match on the existing archive.org id, which is why it was associated with the existing records (author/edition/work).

The author date matching behaviour is the how the current code has worked since the earliest days of Open Library, it's possibly not ideal, but errs on the side of not conflating undated names with dated ones... and is why there are always authors to merge.

Most of the subtleties of how imports actually behave have been obscured by out of date and gappy Solr indexes and plenty of other issues. Solr 8 should improve the situation so we can review and adjust specific issues with matching etc. Deliberately not indexing isn't going to help the data or our ability to audit it.

I was worried that this lack of indexing may hamper any attempts in matching on existing author and title fields.

@jimchamp jimchamp added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Jan 24, 2022
@cdrini cdrini added this to the Active Sprint milestone Jan 24, 2022
@mekarpeles mekarpeles removed this from the Active Sprint milestone Jan 24, 2022
@cdrini cdrini added the Patch Deployed This PR has been deployed to production independently, outside of the regular deploy cycle. label Jan 31, 2022
@cdrini
Copy link
Collaborator

cdrini commented Jan 31, 2022

This is patch deployed; if things look good I'll merge and make it permanent! Thanks @hornc !

I'm monitoring now for sluggishness, but let me know if you notice anything! Also if you have a big batch of clean up bot stuff to do, it might be good to run that as a test. But give me a heads up before you kick it off :)

@cdrini
Copy link
Collaborator

cdrini commented Feb 2, 2022

Merging this now; it seems to be going well. If we notice any issues, it's a simple fix to reverse this PR and add back the cli param to block Bot edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Patch Deployed This PR has been deployed to production independently, outside of the regular deploy cycle. Priority: 1 Do this week, receiving emails, time sensitive, . [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bot edits should be analyzed by solrupdater
5 participants