Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor reindexing of harvested datasets #10734

Open
landreev opened this issue Jul 31, 2024 · 0 comments · May be fixed by #10836
Open

Refactor reindexing of harvested datasets #10734

landreev opened this issue Jul 31, 2024 · 0 comments · May be fixed by #10836
Assignees
Labels
Feature: Harvesting FY25 Sprint 5 FY25 sprint 5 GREI 3 Search and Browse Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)

Comments

@landreev
Copy link
Contributor

landreev commented Jul 31, 2024

Among the indexing improvements in 6.3, we now have added logic that prevents deleting solr documents when existing and already indexed datasets are updated. Unfortunately, we still cannot take advantage of this improvement when it comes to reindexing (re-)harvested datasets - since the harvesting framework relies on completely deleting an existing harvested dataset, then re-creating it from scratch. So we still end up going to the trouble of deleting all the existing solr documents associated with it, then rebuilding them, even if it was a minor metadata update (documents, plural - because harvested datasets can have files).
A most obvious way to solve this is to add some straightforward mods to the destroy dataset framework, and make it spare existing solr documents during a re-harvesting workflow. An alternative would be to modify the harvesting framework itself, and figure out how to avoid having to destroy the dataset and still avoid creating multiple versions... but that may be more difficult (?).

By nature of harvesting, it often involves having to modify a large number of datasets in quick succession, so this can still be a serious performance premium in production. It would be great to address this before we restart serious-scale harvesting at HDV.

@landreev landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Aug 14, 2024
@cmbz cmbz added the GREI 3 Search and Browse label Aug 14, 2024
@landreev landreev self-assigned this Aug 28, 2024
@cmbz cmbz added the FY25 Sprint 5 FY25 sprint 5 label Aug 29, 2024
landreev added a commit that referenced this issue Sep 11, 2024
landreev added a commit that referenced this issue Sep 11, 2024
landreev added a commit that referenced this issue Sep 11, 2024
landreev added a commit that referenced this issue Sep 11, 2024
landreev added a commit that referenced this issue Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 5 FY25 sprint 5 GREI 3 Search and Browse Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
Status: 🔍 Interest
Development

Successfully merging a pull request may close this issue.

2 participants