usegalaxy.org | usegalaxy.eu |
---|---|
To understand the amount of heterogeneity in individual COVID-19 isolates.
As of writing (2/13/2020) there were just three Illumina datasets from COVID-19 patients:
- sra-study: SRP242226
bioproject: PRJNA601736
biosample: SAMN13872787
sra-sample: SRS6007144
sra-experiment: SRX7571571
sra-run: SRR10903401
- sra-study: SRP242226
bioproject: PRJNA601736
biosample: SAMN13872786
sra-sample: SRS6007143
sra-experiment: SRX7571570
sra-run: SRR10903402
- sra-study: SRP245409
bioproject: PRJNA603194
biosample: SAMN13922059
sra-sample: SRS6067521
sra-experiment: SRX7636886
sra-run: SRR10971381
To understand the extent of sequence variation within these samples we performed the following analysis. First, we used a Galaxy workflow to perform the following steps:
- Mapped all reads against COVID-19 reference NC_045512.2 using
bwa mem
- Filtered reads with mapping quality of at least 20, that were mapped as proper pairs
- Performed realignments using
lofreq viterbi
- Called variants using
lofreq call
- Annotated variants using
snpeff
against database created from NC_045512.2 GenBank file - Converted VCFs into tab delimited datasets
Next, we analyzed this tab delimited data in a Jupyter notebook.
-
GenBank file for the reference COVID-19 genome.
The GenBank record is used by
snpeff
to generate a database for variant annotation. -
Set of illumina reads (in this case a collection of unfiltered reads from
SRR10903401
,SRR10903402
, andSRR10971381
)
The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.
The workflow produces a table of variants that looks like this:
Sample | CHROM | POS | REF | ALT | DP | AF | SB | DP4 | IMPACT | FUNCLASS | EFFECT | GENE | CODON | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SRR10903401 | NC_045512 | 1409 | C | T | 124 | 0.040323 | 1 | 66,53,2,3 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | Cat/Tat |
1 | SRR10903401 | NC_045512 | 1821 | G | A | 95 | 0.094737 | 0 | 49,37,5,4 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | gGt/gAt |
2 | SRR10903401 | NC_045512 | 1895 | G | A | 107 | 0.037383 | 0 | 51,52,2,2 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | Gta/Ata |
3 | SRR10903401 | NC_045512 | 2407 | G | T | 122 | 0.024590 | 0 | 57,62,1,2 | MODERATE | MISSENSE | NON_SYNONYMOUS_CODING | orf1ab | aaG/aaT |
4 | SRR10903401 | NC_045512 | 3379 | A | G | 121 | 0.024793 | 0 | 56,62,1,2 | LOW | SILENT | SYNONYMOUS_CODING | orf1ab | gtA/gtG |
Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).
The variants we identified were distributed across the SARS-CoV-2 genome in the following way:
The following table describes variants with frequencies above 10%:
A Galaxy workspace (history) containing the most current analysis can be imported from here.
The publicly accessible workflow can be downloaded and installed on any Galaxy instance. It contains version information for all tools used in this analysis.
Tools used in this analysis are also available from BioConda:
Name | Link |
---|---|
bwa |
|
samtools |
|
lofreq |
|
snpeff |
|
snpsift |