Refactor reindexing of harvested datasets #10734

landreev · 2024-07-31T14:59:00Z

Among the indexing improvements in 6.3, we now have added logic that prevents deleting solr documents when existing and already indexed datasets are updated. Unfortunately, we still cannot take advantage of this improvement when it comes to reindexing (re-)harvested datasets - since the harvesting framework relies on completely deleting an existing harvested dataset, then re-creating it from scratch. So we still end up going to the trouble of deleting all the existing solr documents associated with it, then rebuilding them, even if it was a minor metadata update (documents, plural - because harvested datasets can have files).
A most obvious way to solve this is to add some straightforward mods to the destroy dataset framework, and make it spare existing solr documents during a re-harvesting workflow. An alternative would be to modify the harvesting framework itself, and figure out how to avoid having to destroy the dataset and still avoid creating multiple versions... but that may be more difficult (?).

By nature of harvesting, it often involves having to modify a large number of datasets in quick succession, so this can still be a serious performance premium in production. It would be great to address this before we restart serious-scale harvesting at HDV.

…xing process. #10734

landreev added the Feature: Harvesting label Jul 31, 2024

landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Aug 14, 2024

cmbz added the GREI 3 Search and Browse label Aug 14, 2024

cmbz mentioned this issue Aug 14, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

34 tasks

landreev self-assigned this Aug 28, 2024

cmbz added the FY25 Sprint 5 FY25 sprint 5 label Aug 29, 2024

landreev added a commit that referenced this issue Sep 9, 2024

Some work-in-progress modifications of the Harvested imports and inde…

03debc1

…xing process. #10734

landreev added a commit that referenced this issue Sep 11, 2024

intermediate/unfinished state - work in progress #10734

e62fc55

landreev added a commit that referenced this issue Sep 11, 2024

incremental. #10734

3fab4c8

landreev added a commit that referenced this issue Sep 11, 2024

Largely finalized versions of the refactored harvesting classes. #10734

bd08b69

landreev linked a pull request Sep 11, 2024 that will close this issue

10734 Refactoring and optimization of importing and reindexing of harvested content #10836

Open

landreev added a commit that referenced this issue Sep 11, 2024

a typo + some unused imports #10734

e02a584

landreev added a commit that referenced this issue Sep 11, 2024

comment language #10734

30266dd

landreev added a commit that referenced this issue Sep 11, 2024

minor/cosmetic #10734

9291554

landreev added a commit that referenced this issue Sep 19, 2024

Some cleanup/streamlining #10734

7568cd0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor reindexing of harvested datasets #10734

Refactor reindexing of harvested datasets #10734

landreev commented Jul 31, 2024 •

edited

Loading

Refactor reindexing of harvested datasets #10734

Refactor reindexing of harvested datasets #10734

Comments

landreev commented Jul 31, 2024 • edited Loading

landreev commented Jul 31, 2024 •

edited

Loading