Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion deltas for non-UUID sources #2975

Merged
merged 7 commits into from
Mar 21, 2023
Merged

Ingestion deltas for non-UUID sources #2975

merged 7 commits into from
Mar 21, 2023

Conversation

jsbrittain
Copy link
Contributor

Add support for generating and ingesting source differences for non-UUID sources

  • Differences ('deltas') are generated between the current fetched source and the last successfully processed source.
  • In order to considerably speed-up ingestion, deltas are generated between source files upon retrieval, prior to parsing.
  • Deltas are split into Addition and Deletion files. A Suspected case converting to Confirmed would therefore generate both a Deletion file and an Addition file as source deltas.
  • For Deletion deltas, the first matching record is marked for exclusion from the line list. Any cases marked for exclusion at the end of processing are removed from the database. If marking fails, any marked records are reverted prior to pruning.
  • If a delta ingestion fails (either an Addition or a Deletion) then bulk ingest during the next scheduled retrieval (all records are replaced). This is required as Addition and Deletion deltas are currently assigned separate upload IDs for processing which might result in desynchronisation of the database if one succeeds without the other.
  • UUID sources remain unaffeced by these changes.

@jsbrittain jsbrittain marked this pull request as ready for review March 17, 2023 11:02
@jsbrittain jsbrittain requested a review from abhidg March 17, 2023 11:03
Copy link
Contributor

@abhidg abhidg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other than a few comments and clarifications!

ingestion/functions/retrieval/retrieval.py Outdated Show resolved Hide resolved
ingestion/functions/retrieval/retrieval.py Show resolved Hide resolved
ingestion/functions/retrieval/retrieval.py Outdated Show resolved Hide resolved
ingestion/functions/run_deltas_test_e2e.py Show resolved Hide resolved
data-serving/scripts/prune-uploads/prune_uploads.py Outdated Show resolved Hide resolved
data-serving/scripts/prune-uploads/prune_uploads.py Outdated Show resolved Hide resolved
data-serving/scripts/prune-uploads/prune_uploads.py Outdated Show resolved Hide resolved
data-serving/scripts/prune-uploads/prune_uploads.py Outdated Show resolved Hide resolved
ingestion/functions/retrieval/retrieval.py Outdated Show resolved Hide resolved
@jsbrittain
Copy link
Contributor Author

@abhidg Comments addressed and new commit pushed. Includes prune_db.py which was mentioned but missing from the original PR.

@abhidg abhidg self-requested a review March 21, 2023 15:11
Copy link
Contributor

@abhidg abhidg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@abhidg abhidg merged commit dda3640 into main Mar 21, 2023
@abhidg abhidg deleted the ingestion_deltas branch March 21, 2023 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants