Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Latest commit

 

History

History

4-Variation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
usegalaxy.org usegalaxy.eu
Galaxy workflow Galaxy workflow
Galaxy history Galaxy history
Jupyter Notebook Jupyter Notebook

Analysis of variation within individual COVID-19 samples

What's the point?

To understand the amount of heterogeneity in individual COVID-19 isolates.

Outline

As of writing (2/13/2020) there were just three Illumina datasets from COVID-19 patients:

- sra-study: SRP242226
  bioproject: PRJNA601736
  biosample: SAMN13872787
  sra-sample: SRS6007144
  sra-experiment: SRX7571571
  sra-run: SRR10903401

- sra-study: SRP242226
  bioproject: PRJNA601736
  biosample: SAMN13872786
  sra-sample: SRS6007143
  sra-experiment: SRX7571570
  sra-run: SRR10903402

- sra-study: SRP245409
  bioproject: PRJNA603194
  biosample: SAMN13922059
  sra-sample: SRS6067521
  sra-experiment: SRX7636886
  sra-run: SRR10971381

To understand the extent of sequence variation within these samples we performed the following analysis. First, we used a Galaxy workflow to perform the following steps:

  1. Mapped all reads against COVID-19 reference NC_045512.2 using bwa mem
  2. Filtered reads with mapping quality of at least 20, that were mapped as proper pairs
  3. Performed realignments using lofreq viterbi
  4. Called variants using lofreq call
  5. Annotated variants using snpeff against database created from NC_045512.2 GenBank file
  6. Converted VCFs into tab delimited datasets

Next, we analyzed this tab delimited data in a Jupyter notebook.

Inputs

Workflow

  1. GenBank file for the reference COVID-19 genome.

    The GenBank record is used by snpeff to generate a database for variant annotation.

  2. Set of illumina reads (in this case a collection of unfiltered reads from SRR10903401, SRR10903402, and SRR10971381)

Jupyter notebook

The Jupyter notebook requires the GenBank file (#1 from above) and the output of the workflow described below.

Outputs

The workflow produces a table of variants that looks like this:

Sample CHROM POS REF ALT DP AF SB DP4 IMPACT FUNCLASS EFFECT GENE CODON
0 SRR10903401 NC_045512 1409 C T 124 0.040323 1 66,53,2,3 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab Cat/Tat
1 SRR10903401 NC_045512 1821 G A 95 0.094737 0 49,37,5,4 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab gGt/gAt
2 SRR10903401 NC_045512 1895 G A 107 0.037383 0 51,52,2,2 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab Gta/Ata
3 SRR10903401 NC_045512 2407 G T 122 0.024590 0 57,62,1,2 MODERATE MISSENSE NON_SYNONYMOUS_CODING orf1ab aaG/aaT
4 SRR10903401 NC_045512 3379 A G 121 0.024793 0 56,62,1,2 LOW SILENT SYNONYMOUS_CODING orf1ab gtA/gtG

Here, most fields names are descriptive. SB = the Phred-scaled probability of strand bias as calculated by lofreq (0 = no strand bias); DP4 = strand-specific depth for reference and alternate allele observations (Forward reference, reverse reference, forward alternate, reverse alternate).

The variants we identified were distributed across the SARS-CoV-2 genome in the following way:

The following table describes variants with frequencies above 10%:

History and workflow

A Galaxy workspace (history) containing the most current analysis can be imported from here.

The publicly accessible workflow can be downloaded and installed on any Galaxy instance. It contains version information for all tools used in this analysis.

BioConda

Tools used in this analysis are also available from BioConda:

Name Link
bwa Anaconda-Server Badge
samtools Anaconda-Server Badge
lofreq Anaconda-Server Badge
snpeff Anaconda-Server Badge
snpsift Anaconda-Server Badge