Allow bot changes to be indexed #5617

hornc · 2021-09-01T22:27:55Z

Currently bot edits are being excluded from the regular ongoing Solr index updates. This PR removes the deliberate bot exclusion.

Bots are the main source of new imports into Open Library by a long way. Being excluded from index means that any newly added works or editions are not discoverable, and every new author added is created with an empty 0 works page

e.g. https://openlibrary.org/authors/OL9405192A/Edward_L._Parker (which my bot recently added from an achive.org record)

It should list the work https://openlibrary.org/works/OL24955945W , instead the author page looks like a junk, unlinked record. This appears to be the default behaviour of newly imported authors, and won't be resolved without manual reindexing. I though there have been issues raised about 0 work authors in the past, but I can't located one right now.

This is just one example of many, and this appears to be the current standard behaviour with the exculde bots flag set.

Technical

Since the Solr 8 update I have noticed manual edits being reflected pretty promptly (less than the previously stated 15mins, although I wasn't timing accurately) -- it seems noticeably faster and better than the past, so Solr 8 has been a very good improvement. I hope the load from bot edits won't cause a preformance problem. OL needs to keep current with more books though, so it needs to be able to index its content, which the bots are providing.

There are also multiple clean up and fix tasks performed by bots that aren't being reflected in the search index (I noticed this problem after merging 10K editions and works from a librarian request bot task) -- the changes were written but there was no obvious effect.

As far as performance goes, I manually added the 10k to reindex in successive batches of 1000 to the admin interface (copy and pasting through the web UI) and the whole lot were picked up successfully and the expect index update occurred promptly, so I think solr can handle large batches like this as they happen.

Testing

Screenshot

Stakeholders

@cdrini
@mekarpeles

tfmorris · 2021-09-02T00:01:07Z

Independent of the indexing, why is this bot creating duplicate author records and duplicate work records instead of using the ones that exist? That doesn't seem like a "Clean Up."

https://openlibrary.org/authors/OL2420469A
https://openlibrary.org/works/OL16753796W

hornc · 2021-09-02T01:50:38Z

@tfmorris I believe the technical answers to the question is that the author wasn't matched because the existing author had dates, and the matching code does not assume undated names from sources are the same as one with dates.
The titles possibly didn't match because the original has the subtitle unsplit, and according to the algorithm the authors did not match, which may be enough to treat them as different, in the absence of any identical bibliographic ids in the source data.

An earlier edit of the bot on https://openlibrary.org/works/OL16753796W did match on the existing archive.org id, which is why it was associated with the existing records (author/edition/work).

The author date matching behaviour is the how the current code has worked since the earliest days of Open Library, it's possibly not ideal, but errs on the side of not conflating undated names with dated ones... and is why there are always authors to merge.

Most of the subtleties of how imports actually behave have been obscured by out of date and gappy Solr indexes and plenty of other issues. Solr 8 should improve the situation so we can review and adjust specific issues with matching etc. Deliberately not indexing isn't going to help the data or our ability to audit it.

I was worried that this lack of indexing may hamper any attempts in matching on existing author and title fields.

cdrini · 2022-01-31T19:50:03Z

This is patch deployed; if things look good I'll merge and make it permanent! Thanks @hornc !

I'm monitoring now for sluggishness, but let me know if you notice anything! Also if you have a big batch of clean up bot stuff to do, it might be good to run that as a test. But give me a heads up before you kick it off :)

cdrini · 2022-02-02T20:07:03Z

Merging this now; it seems to be going well. If we notice any issues, it's a simple fix to reverse this PR and add back the cli param to block Bot edits.

allow bot changes to be indexed

f2d3c55

hornc requested a review from cdrini September 1, 2021 22:28

mekarpeles assigned cdrini Sep 7, 2021

jimchamp added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Jan 24, 2022

cdrini added this to the Active Sprint milestone Jan 24, 2022

mekarpeles removed this from the Active Sprint milestone Jan 24, 2022

cdrini added the Patch Deployed This PR has been deployed to production independently, outside of the regular deploy cycle. label Jan 31, 2022

hornc mentioned this pull request Feb 2, 2022

ImportBot duplicating Author creation #756

Closed

cdrini merged commit 3b48215 into internetarchive:master Feb 2, 2022

This was referenced Aug 19, 2022

[Snyk] Fix for 20 vulnerabilities meonBot/openlibrary#1

Open

[Snyk] Fix for 20 vulnerabilities MarcelRaschke/openlibrary#15

Open

MarcelRaschke mentioned this pull request Aug 19, 2022

[Snyk] Fix for 20 vulnerabilities devcode1981/openlibrary#1

Open

snyk-bot mentioned this pull request Aug 19, 2022

[Snyk] Fix for 20 vulnerabilities 47-studio-org/openlibrary#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow bot changes to be indexed #5617

Allow bot changes to be indexed #5617

hornc commented Sep 1, 2021 •

edited by cdrini

Loading

tfmorris commented Sep 2, 2021

hornc commented Sep 2, 2021

cdrini commented Jan 31, 2022

cdrini commented Feb 2, 2022

Allow bot changes to be indexed #5617

Allow bot changes to be indexed #5617

Conversation

hornc commented Sep 1, 2021 • edited by cdrini Loading

Technical

Testing

Screenshot

Stakeholders

tfmorris commented Sep 2, 2021

hornc commented Sep 2, 2021

cdrini commented Jan 31, 2022

cdrini commented Feb 2, 2022

hornc commented Sep 1, 2021 •

edited by cdrini

Loading