QC Solutions for SARS-CoV-2 Genomic Analysis

PHA4GE Bioinformatics Pipelines & Visualization Working Group
Libuit KG, Lunn S, Carleton H, Khan W, Kanwar S, van Heusden P, Amrosio F, Lemmer D, Mboowa G, Macori G, Southgate J

Document Changelog

2022-06-23:
- First draft published
2023-03-09:
- Added changelog

Overview

Next-generation sequencing (NGS) has expanded the approach of genomic analysis for pathogen surveillance systems. The demand for NGS continues to grow, with the need for high throughput, lower costs, and better quality of data.

However, the quality of NGS sequencing data can be affected by library preparation and sequencing processes, systematic variation in quality scores across sequence reads, biases in sequencing due to base composition, and less-than optimal library fragment sizes and indexes. Such factors can negatively impact the quality of raw sequencing data for downstream analyses.

In an attempt to assist with quality control (QC) measures, the bioinformatics pipeline and visualization working group of the Public Health Alliance for Genomic Epidemiology (PHA4GE) has drafted this living document to help define the QC challenges for SC2 genomic analysis and suggest a QC systems solutions to address them.

Please note that the QC guidelines in this document are simply an attempt to highlight the most accessible solutions as per the opinions of our working group and in no way represent a comprehensive system for QC guidance and bioinformatic solutions. If this document fails to include a valuable public health resource or in some way mischaracterizes a resource mentioned, we encourage community collaboration through pull-requests and/or raised GitHub issues.

Process Control For Bioinformatics QC Checkpoints

The focus of this document is on the quality control (QC) of tiled amplicon sequencing--through the Artic V3 protocol, for example--a common method for generating SC2 sequencing data. These sequencing experiments generate thousands of amplicon reads that represent fragments of the original SC2 genome present in a sample and--as discussed in this working group's Bioinformatics Solutions for SARS-CoV-2 Genomic Analysis Guidance Document--assembling a contiguous SC2 genome from these amplicon read data is a critical step in providing insight from sequenced samples.

Throughout this process, quality control checkpoints should be conducted at different stages of bioinformatics analysis, including QC of raw read data, pre-processing stages (trimming and filtering), and alignment/assembly.

In this context, raw read data refers to the fastq read files generated by the NGS platform and processed reads are read files that have had adapter sequences removed, trimmed based on size and quality, and dehosted. Alignment QC refers to the examination of the BAM or VCF files generated during the consensus genome assembly process and Consensus Assembly QC refers to an assessment of the fasta assembly file itself.

Future updates to this document will include QC guidance for SARS-CoV-2 genomic epidemiology analysis and wastewater sequencing data.

QC Acceptance Criteria

When performing QC checks on SARS-CoV-2 genomic data, it can be helpful to establsh acceptance thresholds to determine how and when data will be reported and utilized to inform public health decision-making. Below are this working group's suggested QC thresholds for SARS-CoV-2 genomic data as well as various resources and metric definitions to assist in public health laboratories implmenting SARS-CoV-2 sequencing and analysis protocols.

PHA4GE Suggested Thresholds

Read QC Metrics
Number of Reads	Protocol dependent, (e.g. 100,000 reads from Artic Amplicons sequenced on Illumina MiSeq)
Percent Human Reads	<20%
Alignment QC Metrics
Average Read Depth	≥100x
Percent mapped reads to Wuhan reference genome	≥65%
Coverage at a Single Base to Make a Base Call	≥50x
Percent Agreement	80%
Average base quality of aligned reads	>15
Assembly QC Metrics
Percent reference coverage	>83%
Number of Ns	<5,000bp
Assembly length unambiguous	>24,000bp
NTC percent coverage	<10%
Lineage defining mutations	≥60%
S-gene coverage	≥99%
S-gene frameshifts sequence	0
S-gene ambiguous bases	<10%

QC Metric Definitions

Read QC Metrics

Different sequencing platforms use different technologies to determine the nucleotide sequence of the genetic material that they are processing, but all of these technologies converge on the fastq file format. For example, Illumina uses a sequencing-by-synthesis approach which involves assembling copies of each read using fluorescently tagged nucleotides and taking high resolution pictures of each read as each nucleotide is added to the read. These images are then captured in binary base call (BCL) files, and BCL files are converted into fastq files using the bcl2fastq program. On the other hand, Oxford Nanopore Technologies sequencing platforms run single strands of nucleic acids through nano-scale protein pores. An electric current is run across the pore, and the changes in current are detected as each nucleotide passes through the pore. The raw electric signal is captured in the fast5 file format and converted into fastq file format using the basecalling program guppy. Due to the nature of these sequencing platforms there are different considerations when assessing the quality of the raw sequence data (the fastq files).

Term	Definition
Reads	Fragments of sequence DNA base pairs that are generated during sequencing; also referred to as the raw data generated from a sequencing platform
Number of Reads	Count of reads generated in an NGS run
BCL Files	Raw image files produced by Illumina instruments, converted to fastq via bcl2fastq program
FAST5 Files	Raw electrical signal files produced by Oxford Nanopore Technologies sequencing equipment, converted to fastq via basecalling software (guppy is the current industry standard)
Basecalling	The computational process of translating raw electrical signal files (FAST5) or flowcell images (BCL) to nucleotide sequence Performance of neural network basecalling tools for Oxford Nanopore sequencing
FASTQ Files	The common “raw” sequence files containing nucleotide sequences and their associated quality scores • The quality scores contained within a fastq file are encoded as ASCII characters so that they require one bit per score making the string of nucleotide sequences and the string of quality scores equal in length • The quality score (Q Score) represents the probability of an accurate base assignment at the associated nucleotide position • Q scores range from 0 to 40 and are mathematically equivalent to: Q = -10log₁₀P • Quality Scores for Next-Generation Sequencing - illumina • Measuring sequencing accuracy - illumina • Q Scores for Illumina and ONT sequencing will differ dramatically • An excellent Illumina run will have an average Q Score of 27-30 • An excellent Nanopore run will have an average Q Score of 12-15 • Low Q Scores indicate poor sequencing quality which will impact all downstream analyses
Ambiguity / Mixed Sites	The percent of each read where the base called is ambiguous IUPAC Codes
Sequence GC Content	The GC content of reads should be normally distributed
Raw vs Processed Reads	It is typical for some reads to be removed during quality filtering. Based on the known characteristics of the sample, one should be able to predict a reasonable proportion of the reads to be removed.
Percent Human Reads	Percentage of human read data sequenced in an NGS run.

Alignment QC Metrics

Consensus-genome assembly approaches have been widely adopted for SARS-CoV-2 genomic analysis. In this approach, read data are aligned to a reference genome--usually [Wuhan-1 (MN908947.3)] (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3)--and each position in the alignment are assessed to determine the consensus basecall supported by the read data at each position. The alignments are captured in a BAM file that can be used to assess critical quality control metrics; additionally VCF files can be produced from an alignment to call variant positions relative to the reference genome--VCF files can also be inspected to assess quality of identified variant positions.

Term	Definition
Sequence Alignment	A method of arranging nucleic acid (DNA/RNA) or protein sequences to identify regions of similarity or conservation that may be of function, structural, or evolutionary relationships. Pairwise sequence alignment consists of two sequences whereas multiple sequence alignment consists of more than three sequences
Sequencing Depth	The number of reads that cover a particular nucleotide, section/amplicon of the genome, or average across the reference sequence • Ideally a min depth of 10X for Illumina or 20X for Nanopore would be reached • Uniform depth of coverage is better • Nonuniform depth may be indicative of differential amplification of amplicons, or amplicon dropout • This can be assessed using bedtools
Percent Agreement	Percentage of base call concordance in reads mapped at a designated position in the reference genome
Coverage	What percent of the reference sequence is covered by the reads that have been produced • This metric is typically used in conjunction with depth
Percent Mapped Reads	Percentage of read data mapped to a specified reference genome
Average Base Quality of Aligned Reads	Mean phred score of read data mapped to a reference genome

Consensus Assembly QC Metrics

An examination of the resulting assembly quality is also critical as these assemblies often inform critical downstream analysis, such as lineage and clade assignments and genomic epidmiology investigations.

Term	Definition
Length of the Assembly	Should be similar to that of reference. If it is not, why? Have there been large insertions/deletions, gene duplications, etc.
Total Number of N’s	The total number of ambiguous basecalls in the assembly
Length of Strings of N’s	While the total number of N’s is important, the length of the strings of N’s can indicate issues with upstream laboratory workflows. If a string of N’s is consistently reported over a specific region of the genome, then one can cross reference the primer binding loci in the bed file to see if one amplicon is dropping out or amplifying at a lower rate than the other amplicons. This could be due to amplification bias, resulting from a large differential in the GC content between the amplicons. This may also indicate that you have a mixed population and there may be a subpopulation with a different sequence in the ambiguous region.
Percent Reference Coverage	Percentage of the Wuhan-1 reference genome represented in the consensus assembly
Number of Ns	Number of ambiguous base calls (Ns) incorporated into the consensus assembly
Assembly Length Unambiguous	Number of unambiguous base calls (ATCGs) incorporated into the consensus assembly
NTC Percent Coverage	Percentage of the Wuhan-1 reference genome represented in the consensus assembly of a non-template control (NTC; i.e. negative control)
Lineage Defining Mutations	Percentage of lineage-specific mutations represented in the consensus assembly
Number of Ns	Number of ambiguous base calls (Ns) incorporated into the consensus assembly
S-gene Coverage	Percentage of the SARS-CoV-2 S-gene represented in the consensus assembly
S-gene Frameshifts	S-gene insertion or deletion events represented in the consensus assembly
S-gene Ambiguous Bases	Number of ambiguous base calls (Ns) incorporated into the s-gene of the consensus assembly

Additional QC Resources and Materials

ncov-tools - Tools and plots for performing quality control on coronavirus sequencing results.
Quality Management Systems Tools & Resources - Process Management - US CDC Quality Management Systems for SARS-CoV-2 NGS Data
TheiaCoV QC output Video - Video tutorial for assessing SARS-CoV-2 genomic characterization with Theiagen's TheiaCoV workflows
StaPH-B Glossary - US State Public Health Bioinformatics (StaPH-B) working group's bioinformatics glossary of terms
PHA4GE Bioinformatics Solutions - This working groups list of bioinformatics solutions for SARS-CoV-2 bioinformatics
ECDC: Guidance for representative and targeted genomic SARS-CoV-2 monitoring - European CDC Guidance Document for SARS-CoV-2 genomic analysis

Files

qc-solutions.md

Latest commit

History

qc-solutions.md

File metadata and controls

QC Solutions for SARS-CoV-2 Genomic Analysis

Overview

Contents

Process Control For Bioinformatics QC Checkpoints

QC Acceptance Criteria

PHA4GE Suggested Thresholds

QC Metric Definitions

Read QC Metrics

Reads

Number of Reads

BCL Files

FAST5 Files

Basecalling

FASTQ Files

Ambiguity / Mixed Sites

Sequence GC Content

Raw vs Processed Reads

Percent Human Reads

Alignment QC Metrics

Sequence Alignment

Sequencing Depth

Percent Agreement

Coverage

Percent Mapped Reads

Average Base Quality of Aligned Reads

Consensus Assembly QC Metrics

Length of the Assembly

Total Number of N’s

Length of Strings of N’s

Percent Reference Coverage

Number of Ns

Assembly Length Unambiguous

NTC Percent Coverage

Lineage Defining Mutations

Number of Ns

S-gene Coverage

S-gene Frameshifts

S-gene Ambiguous Bases

Additional QC Resources and Materials