Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/rki tsv csv #409

Merged
merged 9 commits into from
Jul 11, 2023
Merged

Fix/rki tsv csv #409

merged 9 commits into from
Jul 11, 2023

Conversation

rneher
Copy link
Member

@rneher rneher commented Jul 4, 2023

No description provided.

rneher added 2 commits July 4, 2023 22:54
 * RKI seems to have changed from CSV to TSV
 * in addition, lineages are part of the metadata and the
   the file we previously used seems absent.
@joverlee521
Copy link
Contributor

Looks like there were additional changes to the headers of the file as well. I'll be pushing up more changes shortly.

In the reorg of the RKI repo, the metadata file now includes the
lineages so we no longer need to download the additional lineage file.
This typo prevented Snakemake from finding the correct input file
so the workflow ran the `fetch_rki_ndjson_from_s3` rule in the previous
test run¹, which is why it was able to complete successfully.

The following commits will fix the errors raised by the workflow when
actually running through the `transform_rki_data_to_ndjson` rule.

¹ https://github.com/nextstrain/ncov-ingest/actions/runs/5458351864/jobs/9933357136
In the reorg of the RKI repo, the headers of the metadata file has
been updated along with the file format. This commit updates to the
equivalent index column that contains sequence IDs.
NA types are technically not JSON compliant and will result in type
errors when processing the NDJSON file later in `bin/transform-rki`.

This was not obvious in the previous RKI CSV metadata file, but the new
RKI TSV metadata file does include NA values.
The reorg of the RKI repo included updating the headers and columns of
the metadata file.

This commit updates the `COLUMN_MAP` to the new column headers.
It also drops two columns that are no longer available in the new
metadata TSV: `date_released` and `strain_gisaid`. These two columns are
not used in the transform pipeline and are not included in the output
TSV so this change should not affect the final output metadata TSV of
the open ncov-ingest pipeline.
@joverlee521
Copy link
Contributor

Triggered a test run with the latest changes.

I'll compare the final metadata output from then test run to the current file on S3 when the test run is complete to make sure there are not weird changes.

The `date_submitted` (`SEQUENCE. PUSHED_TO_DWH`) field is expected to
have a new date format according to the RKI docs¹

¹ https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#variablen-und-werte
@joverlee521 joverlee521 requested a review from a team July 6, 2023 22:17
@joverlee521
Copy link
Contributor

The test run completed successfully and uploaded files to the staging S3 bucket.

Comparing the RKI entries in s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst with RKI entries in the current file at s3://nextstrain-data/files/ncov/open/metadata.tsv.zst

aws s3 cp s3://nextstrain-staging/files/ncov/open/branch/fix/rki-tsv-csv/metadata.tsv.zst data/rki-test-metadata.tsv.zst
aws s3 cp s3://nextstrain-data/files/ncov/open/metadata.tsv.zst` data/open-metadata.tsv.zst

csvtk filter2 -t -f '$database=="rki"' data/rki-test-metadata.tsv.zst > data/rki-test-rki-only.tsv 
csvtk filter2 -t -f '$database=="rki"' data/open-metadata.tsv.zst > data/open-rki-only.tsv 

csv-diff data/open-rki-only.tsv data/rki-test-rki-only.tsv --key strain > data/rki-only.diff 

Summary of changes:

633970 rows changed, 7689 rows added, 52473 rows removed

Not much we can do about the removed records, but I took a look at the first 1000 changed records

Most changes were just the date_submitted and sampling_strategy fields. There are now a lot of empty date_submitted fields but the diff did allow me to catch a new expected date format that I've added in 3a7f3d4. The sampling_strategy field changes are just simplifications of the codes that RKI use for sequencing reasons.

There are some changes to pango_lineage but these values come directly from RKI metadata. There are a few changes to clock_deviation but I believe that's expected.

Comment on lines +32 to +37
"SEQUENCE.DATE_OF_SAMPLING": "date",
"SEQUENCE.PUSHED_TO_DWH": "date_submitted",
"DL.ID": "originating_lab",
"SL.ID": "submitting_lab",
"PANGOLIN.LINEAGE_LATEST": "pango_lineage",
"SEQUENCE.SEQUENCING_REASON": "sampling_strategy",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclosure: I made these column changes using Google Translate to translate the RKI docs of the metadata fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are correct :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking 🙏

@corneliusroemer corneliusroemer self-assigned this Jul 11, 2023
@corneliusroemer corneliusroemer self-requested a review July 11, 2023 09:20
@corneliusroemer
Copy link
Member

corneliusroemer commented Jul 11, 2023

I opened an issue in the RKI repo to let them know that most submission dates have gone missing with the new format: robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#50

If this doesn't get fixed upstream we could patch the submission dates back using dates from the old format, but let's wait for a bit to save us unnecessary work.

@corneliusroemer corneliusroemer merged commit 847729a into master Jul 11, 2023
@corneliusroemer corneliusroemer deleted the fix/rki-tsv-csv branch July 11, 2023 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants