Investigate/implement non-redundant ancestry processing #91

alexdunnjpl · 2024-01-02T22:19:53Z

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that ECS costs are significantly reduced

📖 Additional Details

No response

Acceptance Criteria

Ancestry sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which references a bundle or collection which has been modified.

⚙️ Engineering Details

No response

alexdunnjpl · 2024-01-05T19:48:46Z

Key question for whether this will work - are registry-refs docs guaranteed to be written to OpenSearch after the non-aggregate products they refer to?

@jordanpadams @al-niessner do you know, off the top of your head?

al-niessner · 2024-01-08T18:01:20Z

@alexdunnjpl

I was just looking at a related item in harvest a week or two ago. While I cannot say for certain because I have not tested it, harvest processes bundles then collections then products (non-aggs) and writes them in that order via batching and a List. So, just the opposite of the order you want.

I think the primary reason it is in this order is it simplifies testing and checking that harvest has to do. Since the bundle is already loaded, it knows if the collection is part of it. Ditto on next layer down. It makes the harvest code much simpler.

The order it is written is not as important. However it can be batched or done as found. Default is batch. If done as found, then order is obvious. I remember the batch using a list and sending to registry from first to last index. It would probably be easy, but no promises, to do the array in reverse. However, this would not help you if the user is not using batch mode.

alexdunnjpl · 2024-01-08T18:04:45Z

Thanks Al, much appreciated!

Will need to have a think about whether to follow this (harvest) up or rely on detection/cleanup of such cases. Given that all it would take to break something is for someone to use an out-of-date harvest even if we did fix it, seems like maybe the latter is the only option.

jordanpadams · 2024-01-09T21:31:04Z

@alexdunnjpl @al-niessner one catch here is that probably only applies when someone actually points at a bundle. Harvest can be pointed at any directory.

alexdunnjpl · 2024-01-18T23:19:23Z

Suggest (accepted, per breakout): implement naively, ignoring the "ingestion while sweeping" test case and monitor the quantity of orphaned documents or just check them in a few weeks/months. If there are an unmanageable quantity of orphans, we'll need to rethink, else implement a secondary cleanup sweeper process.

alexdunnjpl added needs:triage requirement the current issue is a requirement labels Jan 2, 2024

alexdunnjpl assigned jordanpadams Jan 2, 2024

alexdunnjpl added this to EN Portfolio Backlog Jan 2, 2024

github-project-automation bot moved this to Backlog in EN Portfolio Backlog Jan 2, 2024

alexdunnjpl added the sprint-backlog label Jan 9, 2024

tloubrieu-jpl removed the needs:triage label Jan 18, 2024

This was referenced Jan 26, 2024

Non redundant ancestry #100

Merged

Investigate/implement non-redundant provenance processing #92

Closed

alexdunnjpl closed this as completed in #100 Jan 29, 2024

github-project-automation bot moved this from Backlog to 🏁 Done in EN Portfolio Backlog Jan 29, 2024

This was referenced Jan 29, 2024

update README.md with sweepers optimisation info #102

Merged

add ancestry/provenance sweeper version to logs #103

Merged

91 - improve performance when logging ancestry orphaned documents #105

Merged

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate/implement non-redundant ancestry processing #91

Investigate/implement non-redundant ancestry processing #91

alexdunnjpl commented Jan 2, 2024

alexdunnjpl commented Jan 5, 2024

al-niessner commented Jan 8, 2024

alexdunnjpl commented Jan 8, 2024

jordanpadams commented Jan 9, 2024

alexdunnjpl commented Jan 18, 2024

Investigate/implement non-redundant ancestry processing #91

Investigate/implement non-redundant ancestry processing #91

Comments

alexdunnjpl commented Jan 2, 2024

Checked for duplicates

🧑‍🔬 User Persona(s)

💪 Motivation

📖 Additional Details

Acceptance Criteria

⚙️ Engineering Details

alexdunnjpl commented Jan 5, 2024

al-niessner commented Jan 8, 2024

alexdunnjpl commented Jan 8, 2024

jordanpadams commented Jan 9, 2024

alexdunnjpl commented Jan 18, 2024