-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/rki tsv csv #409
Fix/rki tsv csv #409
Conversation
* RKI seems to have changed from CSV to TSV * in addition, lineages are part of the metadata and the the file we previously used seems absent.
Looks like there were additional changes to the headers of the file as well. I'll be pushing up more changes shortly. |
In the reorg of the RKI repo, the metadata file now includes the lineages so we no longer need to download the additional lineage file.
This typo prevented Snakemake from finding the correct input file so the workflow ran the `fetch_rki_ndjson_from_s3` rule in the previous test run¹, which is why it was able to complete successfully. The following commits will fix the errors raised by the workflow when actually running through the `transform_rki_data_to_ndjson` rule. ¹ https://github.com/nextstrain/ncov-ingest/actions/runs/5458351864/jobs/9933357136
In the reorg of the RKI repo, the headers of the metadata file has been updated along with the file format. This commit updates to the equivalent index column that contains sequence IDs.
NA types are technically not JSON compliant and will result in type errors when processing the NDJSON file later in `bin/transform-rki`. This was not obvious in the previous RKI CSV metadata file, but the new RKI TSV metadata file does include NA values.
The reorg of the RKI repo included updating the headers and columns of the metadata file. This commit updates the `COLUMN_MAP` to the new column headers. It also drops two columns that are no longer available in the new metadata TSV: `date_released` and `strain_gisaid`. These two columns are not used in the transform pipeline and are not included in the output TSV so this change should not affect the final output metadata TSV of the open ncov-ingest pipeline.
Triggered a test run with the latest changes. I'll compare the final metadata output from then test run to the current file on S3 when the test run is complete to make sure there are not weird changes. |
The `date_submitted` (`SEQUENCE. PUSHED_TO_DWH`) field is expected to have a new date format according to the RKI docs¹ ¹ https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#variablen-und-werte
The test run completed successfully and uploaded files to the staging S3 bucket. Comparing the RKI entries in
Summary of changes:
Not much we can do about the removed records, but I took a look at the first 1000 changed records Most changes were just the There are some changes to |
"SEQUENCE.DATE_OF_SAMPLING": "date", | ||
"SEQUENCE.PUSHED_TO_DWH": "date_submitted", | ||
"DL.ID": "originating_lab", | ||
"SL.ID": "submitting_lab", | ||
"PANGOLIN.LINEAGE_LATEST": "pango_lineage", | ||
"SEQUENCE.SEQUENCING_REASON": "sampling_strategy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full disclosure: I made these column changes using Google Translate to translate the RKI docs of the metadata fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are correct :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking 🙏
I opened an issue in the RKI repo to let them know that most submission dates have gone missing with the new format: robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#50 If this doesn't get fixed upstream we could patch the submission dates back using dates from the old format, but let's wait for a bit to save us unnecessary work. |
No description provided.