Pull data directly from COG-UK Data #329

joverlee521 · 2022-07-25T19:53:24Z

Context

There has been a significant drop off in sequences from the UK in the NCBI data since ~April 2022 (issue was originally raised in Slack):

Description

We can update the pipeline to pull metadata and sequences directly from COG-UK Data instead of waiting on them to submit to NCBI.

We would have to use the ena_sample.secondary_accession column in their accessions TSV to drop duplicates from GenBank via the BioSample accession.

The text was updated successfully, but these errors were encountered:

huddlej · 2022-07-26T19:22:09Z

We discussed a couple of options to address this during triage:

Reach out to COG-UK group via Slack to see if there are plans to continue submitting to NCBI more regularly
Add COG-UK to ingest which will require a way to ingest from metadata and sequences to NDJSON prior to applying transforms.

@joverlee521 will continue work on the latter scripts and then revisit this issue.

joverlee521 · 2022-10-03T22:48:50Z

Prompted by @corneliusroemer, this is my general idea of how to switch to directly pulling data from COG-UK instead of relying on their submissions to GenBank:

Update the current patch of COG-UK data to remove all COG-UK records from the GenBank data. It will be less confusing if we make sure all COG-UK data comes from a single source instead of mix of sources. I also think this is the best way to ensure that we do not have duplicate COG-UK records.
Add a rule to fetch the COG-UK sequences. I think this should be the All sequence FASTA since we do our own alignment and masking. (We already fetch the COG-UK metadata CSV)
The COG-UK metadata CSV is formatted differently than GenBank data, so I think we can run it through its own transform pipeline with some combination of tsv-utils, csvtk, and/or the upcoming augur curate command. The produced TSV + FASTA can then be appended to the GenBank files before upload to S3.

joverlee521 added the enhancement New feature or request label Jul 25, 2022

nextstrain-bot added this to Nextstrain planning (archived) Jul 26, 2022

nextstrain-bot moved this to New in Nextstrain planning (archived) Jul 26, 2022

huddlej moved this from New to Backlog in Nextstrain planning (archived) Jul 26, 2022

joverlee521 mentioned this issue Jul 29, 2022

Pull data from "SARS-CoV-2 Sequence Data from Germany" #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull data directly from COG-UK Data #329

Pull data directly from COG-UK Data #329

joverlee521 commented Jul 25, 2022 •

edited

Loading

huddlej commented Jul 26, 2022

joverlee521 commented Oct 3, 2022

Pull data directly from COG-UK Data #329

Pull data directly from COG-UK Data #329

Comments

joverlee521 commented Jul 25, 2022 • edited Loading

Context

Description

huddlej commented Jul 26, 2022

joverlee521 commented Oct 3, 2022

joverlee521 commented Jul 25, 2022 •

edited

Loading