Skip to content

Latest commit

 

History

History
187 lines (156 loc) · 16.7 KB

qc-solutions.md

File metadata and controls

187 lines (156 loc) · 16.7 KB

QC Solutions for SARS-CoV-2 Genomic Analysis

PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J

Document Changelog
  • 2022-06-23:
    • First draft published
  • 2023-03-09:
    • Added changelog

Overview

Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data.

However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses.

In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.

Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.

Contents

Process Control For Bioinformatics QC Checkpoints

The focus of this document is on the quality control (QC) of tiled amplicon sequencing--through the Artic V3 protocol, for example--a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and--as discussed in this working group's Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document--assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples.

Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly.

In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself.

Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data.

QC Acceptance Criteria

When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group's suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols.

PHA4GE Suggested Thresholds

Read QC Metrics
Number of Reads Protocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq)
Percent Human Reads <20%
Alignment QC Metrics
Average Read Depth ≥100x
Percent mapped reads to Wuhan reference genome ≥65%
Coverage at a Single Base to Make a Base Call ≥50x
Percent Agreement 80%
Average base quality of aligned reads >15
Assembly QC Metrics
Percent reference coverage >83%
Number of Ns <5,000bp
Assembly length unambiguous >24,000bp
NTC percent coverage <10%
Lineage defining mutations ≥60%
S-gene coverage ≥99%
S-gene frameshifts sequence 0
S-gene ambiguous bases <10%

QC Metric Definitions

Read QC Metrics

Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files).

Term Definition

Reads

Fragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform

Number of Reads

Count of reads generated in an NGS run

BCL Files

Raw image files produced by Illumina instruments, converted to fastq via bcl2fastq program

FAST5 Files

Raw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard)

Basecalling

The computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence
Performance of neural network basecalling tools for Oxford Nanopore sequencing

FASTQ Files

The common “raw” sequence files containing nucleotide sequences and their associated quality scores
• The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length
• The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position
• Q scores range from 0 to 40 and are mathematically equivalent to:
    
 Q = -10log10P
Quality Scores for Next-Generation Sequencing - illumina
Measuring sequencing accuracy - illumina
• Q Scores for Illumina and ONT sequencing will differ dramatically
     • An excellent Illumina run will have an average Q Score of 27-30
     • An excellent Nanopore run will have an average Q Score of 12-15
• Low Q Scores indicate poor sequencing quality which will impact all downstream analyses

Ambiguity / Mixed Sites

The percent of each read where the base called is ambiguous
IUPAC Codes

Sequence GC Content

The GC content of reads should be normally distributed

Raw vs Processed Reads

It is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed.

Percent Human Reads

Percentage of human read data sequenced in an NGS run.

Alignment QC Metrics

Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome--usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)--and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome--VCF files can also be inspected to assess quality of identified variant positions.

Term Definition

Sequence Alignment

A method of arranging nucleic acid (DNA/RNA) or protein sequences to identify regions of similarity or conservation that may be of function, structural, or evolutionary relationships. Pairwise sequence alignment consists of two sequences whereas multiple sequence alignment consists of more than three sequences

Sequencing Depth

The number of reads that cover a particular nucleotide, section/amplicon of the genome, or average across the reference sequence
• Ideally a min depth of 10X for Illumina or 20X for Nanopore would be reached
• Uniform depth of coverage is better
• Nonuniform depth may be indicative of differential amplification of amplicons, or amplicon dropout
    • This can be assessed using bedtools

Percent Agreement

Percentage of base call concordance in reads mapped at a designated position in the reference genome

Coverage

What percent of the reference sequence is covered by the reads that have been produced
• This metric is typically used in conjunction with depth

Percent Mapped Reads

Percentage of read data mapped to a specified reference genome

Average Base Quality of Aligned Reads

Mean phred score of read data mapped to a reference genome

Consensus Assembly QC Metrics

An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations.

Term Definition

Length of the Assembly

Should be similar to that of reference. If it is not, why? Have there been large insertions/deletions, gene duplications, etc.

Total Number of N’s

The total number of ambiguous basecalls in the assembly

Length of Strings of N’s

While the total number of N’s is important, the length of the strings of N’s can indicate issues with upstream laboratory workflows. If a string of N’s is consistently reported over a specific region of the genome, then one can cross reference the primer binding loci in the bed file to see if one amplicon is dropping out or amplifying at a lower rate than the other amplicons. This could be due to amplification bias, resulting from a large differential in the GC content between the amplicons. This may also indicate that you have a mixed population and there may be a subpopulation with a different sequence in the ambiguous region.

Percent Reference Coverage

Percentage of the Wuhan-1 reference genome represented in the consensus assembly

Number of Ns

Number of ambiguous base calls (Ns) incorporated into the consensus assembly

Assembly Length Unambiguous

Number of unambiguous base calls (ATCGs) incorporated into the consensus assembly

NTC Percent Coverage

Percentage of the Wuhan-1 reference genome represented in the consensus assembly of a non-template control (NTC; i.e. negative control)

Lineage Defining Mutations

Percentage of lineage-specific mutations represented in the consensus assembly

Number of Ns

Number of ambiguous base calls (Ns) incorporated into the consensus assembly

S-gene Coverage

Percentage of the SARS-CoV-2 S-gene represented in the consensus assembly

S-gene Frameshifts

S-gene insertion or deletion events represented in the consensus assembly

S-gene Ambiguous Bases

Number of ambiguous base calls (Ns) incorporated into the s-gene of the consensus assembly

Additional QC Resources and Materials