-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow bot changes to be indexed #5617
Conversation
Independent of the indexing, why is this bot creating duplicate author records and duplicate work records instead of using the ones that exist? That doesn't seem like a "Clean Up." https://openlibrary.org/authors/OL2420469A |
@tfmorris I believe the technical answers to the question is that the author wasn't matched because the existing author had dates, and the matching code does not assume undated names from sources are the same as one with dates. An earlier edit of the bot on https://openlibrary.org/works/OL16753796W did match on the existing archive.org id, which is why it was associated with the existing records (author/edition/work). The author date matching behaviour is the how the current code has worked since the earliest days of Open Library, it's possibly not ideal, but errs on the side of not conflating undated names with dated ones... and is why there are always authors to merge. Most of the subtleties of how imports actually behave have been obscured by out of date and gappy Solr indexes and plenty of other issues. Solr 8 should improve the situation so we can review and adjust specific issues with matching etc. Deliberately not indexing isn't going to help the data or our ability to audit it. I was worried that this lack of indexing may hamper any attempts in matching on existing author and title fields. |
This is patch deployed; if things look good I'll merge and make it permanent! Thanks @hornc ! I'm monitoring now for sluggishness, but let me know if you notice anything! Also if you have a big batch of clean up bot stuff to do, it might be good to run that as a test. But give me a heads up before you kick it off :) |
Merging this now; it seems to be going well. If we notice any issues, it's a simple fix to reverse this PR and add back the cli param to block |
Closes #5700
Currently bot edits are being excluded from the regular ongoing Solr index updates. This PR removes the deliberate bot exclusion.
Bots are the main source of new imports into Open Library by a long way. Being excluded from index means that any newly added works or editions are not discoverable, and every new author added is created with an empty 0 works page
e.g. https://openlibrary.org/authors/OL9405192A/Edward_L._Parker (which my bot recently added from an achive.org record)
It should list the work https://openlibrary.org/works/OL24955945W , instead the author page looks like a junk, unlinked record. This appears to be the default behaviour of newly imported authors, and won't be resolved without manual reindexing. I though there have been issues raised about 0 work authors in the past, but I can't located one right now.
This is just one example of many, and this appears to be the current standard behaviour with the exculde bots flag set.
Technical
Since the Solr 8 update I have noticed manual edits being reflected pretty promptly (less than the previously stated 15mins, although I wasn't timing accurately) -- it seems noticeably faster and better than the past, so Solr 8 has been a very good improvement. I hope the load from bot edits won't cause a preformance problem. OL needs to keep current with more books though, so it needs to be able to index its content, which the bots are providing.
There are also multiple clean up and fix tasks performed by bots that aren't being reflected in the search index (I noticed this problem after merging 10K editions and works from a librarian request bot task) -- the changes were written but there was no obvious effect.
As far as performance goes, I manually added the 10k to reindex in successive batches of 1000 to the admin interface (copy and pasting through the web UI) and the whole lot were picked up successfully and the expect index update occurred promptly, so I think solr can handle large batches like this as they happen.
Testing
Screenshot
Stakeholders
@cdrini
@mekarpeles