Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/implement non-redundant ancestry processing #91

Closed
alexdunnjpl opened this issue Jan 2, 2024 · 5 comments Β· Fixed by #100
Closed

Investigate/implement non-redundant ancestry processing #91

alexdunnjpl opened this issue Jan 2, 2024 · 5 comments Β· Fixed by #100
Assignees
Labels
requirement the current issue is a requirement

Comments

@alexdunnjpl
Copy link
Contributor

Checked for duplicates

No - I haven't checked

πŸ§‘β€πŸ”¬ User Persona(s)

No response

πŸ’ͺ Motivation

...so that ECS costs are significantly reduced

πŸ“– Additional Details

No response

Acceptance Criteria

Ancestry sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which references a bundle or collection which has been modified.

βš™οΈ Engineering Details

No response

@alexdunnjpl
Copy link
Contributor Author

Key question for whether this will work - are registry-refs docs guaranteed to be written to OpenSearch after the non-aggregate products they refer to?

@jordanpadams @al-niessner do you know, off the top of your head?

@al-niessner
Copy link
Contributor

@alexdunnjpl

I was just looking at a related item in harvest a week or two ago. While I cannot say for certain because I have not tested it, harvest processes bundles then collections then products (non-aggs) and writes them in that order via batching and a List. So, just the opposite of the order you want.

I think the primary reason it is in this order is it simplifies testing and checking that harvest has to do. Since the bundle is already loaded, it knows if the collection is part of it. Ditto on next layer down. It makes the harvest code much simpler.

The order it is written is not as important. However it can be batched or done as found. Default is batch. If done as found, then order is obvious. I remember the batch using a list and sending to registry from first to last index. It would probably be easy, but no promises, to do the array in reverse. However, this would not help you if the user is not using batch mode.

@alexdunnjpl
Copy link
Contributor Author

Thanks Al, much appreciated!

Will need to have a think about whether to follow this (harvest) up or rely on detection/cleanup of such cases. Given that all it would take to break something is for someone to use an out-of-date harvest even if we did fix it, seems like maybe the latter is the only option.

@jordanpadams
Copy link
Member

@alexdunnjpl @al-niessner one catch here is that probably only applies when someone actually points at a bundle. Harvest can be pointed at any directory.

@alexdunnjpl
Copy link
Contributor Author

Suggest (accepted, per breakout): implement naively, ignoring the "ingestion while sweeping" test case and monitor the quantity of orphaned documents or just check them in a few weeks/months. If there are an unmanageable quantity of orphans, we'll need to rethink, else implement a secondary cleanup sweeper process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
requirement the current issue is a requirement
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

4 participants