Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr: don't delete docs that will just change #10579

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented May 22, 2024

What this PR does / why we need it: The existing code made a list of all files in all versions of a dataset (using a list so with repeats) and then added all of the file docs actually in solr, and tried to delete them all. And then tried to delete them again per card. The PR changes that to find which docs exist in solr (same query as before) and then looks to find any that are not in the one/two versions of the dataset that will be indexed and sends a delete request only for the orphaned docs.

It also adds one minor improvement - skipping the file indexing doc creation loop when indexableDataset.isFilesShouldBeIndexed() is false (deaccessioned datasets) instead of going through doc creation and then not submitting it.

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Suggestions on how to test this: FWIW: I've been testing with FINE logging to see what docs exist and which ones are deleted for various cases (creating, editing, publishing, deaccessioning while adding/deleting files). Nominally, just checking api/admin/index/status and verifying that there are no unexpected orphans at the end should be sufficient (there shouldn't be any orphan content docs. The PR improves detecting permission orphans and you may see datafile_id_draft_permission ophan docs being created (a file in a published version who's metadata isn't changed in a new draft). The PR isn't creating them, just making them visible. Using the feature flag to turn the new code on/off might be helpful to confirm that, e.g. make the changes and reindex using the old code and verify that with the flag the /status call sees the orphan.).

I'm not sure how to assess performance changes. Nominally doing a clear-orphans to get rid of the permissions doc orphans that weren't visible before could speed things up. Avoiding the extra deletes should help too, but it looks like the reindex all api call avoids doing the 'normalSolrCleanup', which is deleting the old docs, so the performance/load reduction may only be visible when reindexing is triggered by dataset edits, etc., versus it being visible as a reduction in a reindex all in place run.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: some thought over in #10469 (comment), mostly about possible additions now that old docs aren't automatically deleted.

@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label May 30, 2024
@coveralls
Copy link

Coverage Status

coverage: 20.567% (-0.007%) from 20.574%
when pulling 5dc7b8f on GlobalDataverseCommunityConsortium:change_solr_doc_deletes
into 5bf6b6d on IQSS:develop.

This is only used in determining the most recent version a dataset is in
on the file page, e.g. for https://demo.dataverse.org/file.xhtml
?persistentId=doi:10.70122/FK2/FO0MPQ/KNG6PA&version=3.0

I confirmed that demo shows version 1 in this example whereas it should
show version 2 (which this commit fixes).
@coveralls
Copy link

Coverage Status

coverage: 20.565% (-0.009%) from 20.574%
when pulling c16fdd5 on GlobalDataverseCommunityConsortium:change_solr_doc_deletes
into 5bf6b6d on IQSS:develop.

@coveralls
Copy link

Coverage Status

coverage: 20.572% (-0.002%) from 20.574%
when pulling 57b7ed9 on GlobalDataverseCommunityConsortium:change_solr_doc_deletes
into 5bf6b6d on IQSS:develop.

@coveralls
Copy link

Coverage Status

coverage: 20.628% (+0.05%) from 20.574%
when pulling 058c28b on GlobalDataverseCommunityConsortium:change_solr_doc_deletes
into 5bf6b6d on IQSS:develop.

@qqmyers qqmyers marked this pull request as ready for review June 13, 2024 20:31
@coveralls
Copy link

Coverage Status

coverage: 20.628% (+0.05%) from 20.574%
when pulling 1b0d3a1 on GlobalDataverseCommunityConsortium:change_solr_doc_deletes
into 5bf6b6d on IQSS:develop.

@landreev landreev added this to the 6.3 milestone Jun 20, 2024
@landreev landreev self-requested a review June 20, 2024 19:39
@landreev landreev self-assigned this Jun 20, 2024
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
(One somewhat unexpected thing is that you didn't have to touch SolrIndexServiceBean at all in order to achieve this. Cool!)

@landreev
Copy link
Contributor

An interesting performance result: As a definite positive, it really looks like a reindex of a large db over an existing, fully-populated index is no longer taking longer than the first, from-scratch indexing. (i.e., run a full reindex starting w/ an empty solr instance; then run a reindex again; compare the times at a few set increments - the numbers are virtually identical).
On the other hand, the first initial index appears to be slower now, than it was with just the #10547/"soft commit". It is still a bit faster than it was before 10547, but less so than the almost 2X speedup observed then.

I'll take this tradeoff any day. But does this actually make sense/should this be expected, for the first indexing - when there are no existing documents to delete - to take longer w/ the code in this PR?

@landreev
Copy link
Contributor

Also, could you please sync the branch w/ develop. (what I'm testing is my own checkout of the branch that I synced w/ develop locally)

@qqmyers
Copy link
Member Author

qqmyers commented Jun 25, 2024

I wouldn't have expected much of a slow-down for an initial index, since as far as I can tell, the bulk reindex is done with doNormalSolrCleanup = false, which skips deleting docs entirely. Just in case, I added a commit to skip trying to remove file docs from the delete list if it is empty to start with. It's possible I've picked up some query that is expensive.

There is a writeDebugInfo method that is called when FINE logging is enabled - that could be slow.

The one thing I expect is more expensive would be the /status and /clear-orphan calls where more work is being done now to find/remove permission docs where there isn't a corresponding content doc.

@landreev
Copy link
Contributor

I guess there is a possibility that when I synced my local checkout with upstream develop something went sideways, and I ended up with the "soft commit" parts different from what's now in develop.
OK, I'll try/retest with the latest.

@landreev landreev added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed Size: 10 A percentage of a sprint. 7 hours. labels Jun 26, 2024
@landreev landreev merged commit dea145a into IQSS:develop Jun 26, 2024
12 checks passed
@landreev landreev removed their assignment Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants