Ingest sequencing accession IDs #387

davereinhart · 2024-08-07T15:32:50Z

The clinical data module is being updated to parse sequencing accessioning data for uploading into receiving.clinical, and the clinical ETL is being updated for ingestion into warehouse tables. This data is being sourced from tracking sheets maintained on Github for the Seattle Flu Study.

Running this ETL on receiving.clinical records with document containing gisaid_accession or genbank_accession will result in custom processing for this particular type of data. After matching to an existing sample, a minimal consensus_genome and genomic_sequence record will be generated for each covid-19, RSV-A, RSV-B, Influenza A and Influenza B sequence record.

Adding a function to clinical ETL module to parse sequencing accession ID tracking sheets that contain NWGC IDs, BBI-assigned strain names, GISAID, and GenBank accessions. This function parses, filters, and formats the data for ingestion into ID3C.

…RSV tracking sheet format.

Calculates a sequence identifier by hashing the `strain_name` and appending the pathogen code (RSVA, RSVB, or HCOV19) to be used as the identifier in warehouse.genomic_sequence table.

…eceiving table The clinical ETL is being updated for ingestion of sequencing accessioning data from receiving.clinical. This data is being sourced from tracking sheets maintained on Github for the Seattle Flu Study. Running this ETL on receiving.clinical records with document containing `gisaid_accession` or `genbank_accession` will result in custom processing for this particular type of data. After matching to an existing sample, a minimal `consensus_genome` and `genomic_sequence` record will be generated for each covid-19, RSV-A, and RSV-B sequence record.

…ommand.

Updating ETL to process sequencing accession identifier data from SFS from the clinical receiving table. This data is being processed through the clinical ETL because it has a distinct format from previous sequencing data, and is likely to only be needed once at project close.

The upsert_genome function was only performing inserts. Adding ON CONFLICT clause to perform a "non-updating" update and return the existing record id.

davereinhart added 4 commits August 5, 2024 16:08

Update clinical sequecning accession file parsing function to handle …

521db2b

…RSV tracking sheet format.

Add sequence identifier when parsing sequencing accessioning file

daff921

Calculates a sequence identifier by hashing the `strain_name` and appending the pathogen code (RSVA, RSVB, or HCOV19) to be used as the identifier in warehouse.genomic_sequence table.

davereinhart requested a review from a team as a code owner August 7, 2024 15:32

davereinhart marked this pull request as draft August 7, 2024 15:33

Add processing for flu A and B records in clincial parse-sequencing c…

6fa2d1d

…ommand.

davereinhart force-pushed the ingest-seq-accession-ids branch from 25c298e to 3929359 Compare September 10, 2024 23:33

davereinhart added 2 commits September 16, 2024 08:57

Enable upsert of consensus_genome records with ON CONFLICT clause

e306ec1

The upsert_genome function was only performing inserts. Adding ON CONFLICT clause to perform a "non-updating" update and return the existing record id.

davereinhart force-pushed the ingest-seq-accession-ids branch from 3929359 to e306ec1 Compare September 16, 2024 15:58

davereinhart marked this pull request as ready for review September 16, 2024 17:22

jstone-dev approved these changes Oct 2, 2024

View reviewed changes

davereinhart merged commit 932e855 into master Oct 7, 2024
4 checks passed

davereinhart deleted the ingest-seq-accession-ids branch October 7, 2024 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest sequencing accession IDs #387

Ingest sequencing accession IDs #387

davereinhart commented Aug 7, 2024 •

edited

Loading

Ingest sequencing accession IDs #387

Ingest sequencing accession IDs #387

Conversation

davereinhart commented Aug 7, 2024 • edited Loading

davereinhart commented Aug 7, 2024 •

edited

Loading