Record Variants, isolates, and mutations

DRDB database stores information tables of variants, isolates, and isolate mutations. This wiki page provides the general guidance and rules of how to maintain these tables.

Definitions

A variant is a set of sequences, in which the Spike amino acid mutations matched a predefined list of amino acid mutations. For example, Variant "Alpha" is any sequences contains the Spike mutations Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H with several exceptions are acceptable.

An isolate, or a precise mutation pattern is a set of sequences which are identical at amino acid level. Isolate is the subclass unit of variant. The variants table and the isolates table have a one-to-many relationship.

A mutation is defined as an amino acid difference from Wuhan-Hu-1 reference sequence, an insertion or a deletion.

An isolate mutation is a combination of isolate name, gene, and mutation. The isolates table and the isolate_mutations table have a one-to-many relationship.

Mutation format

An amino acid mutation always contains following four aspects:

Gene (gene)
Position in the gene (position)
Reference amino acid (refAA)
Mutant amino acid / indel / stop (mutAA)

Two type of formats are used when a mutation need to be represent as text:

Non-display format

This format is used mainly by isolates.isolate_name. It is the "internal" format of the database and is more friendly to machine.

A mutation should start with the gene, followed by a colon, the position and end with the mutAA. E.g. nsp2:106L, RdRP:323L, ORF3a:257del.
For insertion mutations, "ins" should be used at the place of refAA.
For deletion mutations, "del" should be used at the place of refAA.
For stop codons, "stop" should be used at the place of refAA.

To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol "+" should be used to join the list. The gene should be omitted if the mutation is not the first of the gene. For example, RdRP:323L+S:69del+70del+144del+145del+501Y+570D+614G+681H+716I+982A+1118H.

Display format

This format is used when the text will be showed on website/program. It is more friendly to human reader.

A mutation should start with the gene, followed by a colon, the refAA, position and end with the mutAA. E.g. nsp2:P106L, RdRP:P323L. ORF3a:N257del, with following two exception:
1. For mutations of Spike gene, the gene should be omitted: A222V, S501Y, D614G, etc.
2. For deletions, the refAA and mutAA should be omitted and a greek letter "Δ" should be placed before the position: Δ144, ORF3a:Δ257.
Neighboring deletions should be represented as one, and use dash to connect the position begin and end: nsp6:Δ107-109, Δ69-70.
For insertion mutations, "ins" should be used at the place of refAA.
For stop codons, "*" should be used at the place of refAA.

To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol with two space " + " should be used to join the list. The gene should be omitted if the mutation is not the first of the gene. For example, RdRP:P323L + Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H.

Variants

Naming rules

In general, the widest used name should be used as the primary name of a variant/lineage. In addition, modifier can be added after the variant main name.

If a variant is a WHO VOC or VOI, the WHO name (Alpha, Beta, etc) should be used. The PANGO lineage name should be listed as the first synonym in variant_synonyms.csv.
If a variant is not a WHO VOC and VOI, the PANGO lineage name should be used.
If a variant is a known sub-lineage of another lineage, the sub-lineage should be used as variant name. E.g. Q.1 and Alpha.
Modifier can be added to variant in following formats:
1. To indicate an additional mutation, use a slash followed by the mutation with its reference. E.g. Alpha/E484K.
2. To indicate a missing mutation, use "w/o" followed by the mutation with its reference. E.g. Iota w/o E484K.

Variant modifiers: the rationale and when to use

The variant is a key aggregation factor that used by our data summary program and has the potential to be used by others too. The modifiers can better distinguish neutralization results with important mutations from those without.

Therefore, only important mutations should be added as modifiers. Here is an incomplete list of rules of calling important mutations:

The mutation must be a known resistance mutation;
The mutation must located in the important region (e.g. RBD/RBM); or
The mutation must be the major topic of a study that is relative to the variant.

Variant modifier should use mutation display format.

Synonyms

A variant can have multiple synonyms. In our program, the synonyms are displayed with variant primary name when space is allowed. Following can be added as a synonym:

If a variant includes only one or two Spike mutations, the display format can be added as a synonym. For example, "D614G" can be added as the synonym of "B.1".
If a variant uses WHO name, the PANGO lineage should be added as a synonym. For example, "Alpha" as the primary name and "B.1.1.7" as the synonym.
Name suggested by author can be also added as a synonym. For example, "A.27/A227V" as the primary name and "A.27.RN" as the synonym.

Isolates

Naming rules

The isolate name is mostly used internally and is never showed to our website users. Following are several (non-enforce) good practices:

Use GISAID virus name / GenBank isolate name when possible: if GISAID number and/or GenBank accession is provided, just use the full name from the source and save the GISAID number under gisaid_id and GenBank accession under genbank_accn. AVOID using GISAID number/GenBank accession as isolate name since its not readable.
Use non-display format mutation list when GISAID/GenBank names are not available. AVOID using "<Variant> Spike", "<Variant> full genome" since this way makes it harder to tell the minor differences between isolates of same variant.
Use combination of ref_name, var_name and genomic region. For examples, Truffot21 B.1.1.7 spike, Wang22 BA.2 full genome.
For selection data, include patient characteristics and collection day. For example, Truffot21 72/M D10.
For extremely long mutation list, e.g. SARS-CoV or WIV1, using "SARS-CoV" or "WIV1" is acceptable.

Variant classification

It can be somehow tricky when assigning the variant for an isolate. The PANGOLIN program is not always reliable and it doesn't weigh the important mutations. Here lists several steps which is considered as the good practice:

If the sequence is available, use Sierra Program to tell the PANGO lineage and find out the mutation list. Go to 4.
If the sequence is not available, find out if the PANGO lineage (or equivalent classification) and mutation list are provided by the author. Go to 4.
If no PANGO lineage (or equivalent classification) is found, go to 7.
Use Outbreak.info to find out the consensus mutations of the PANGO lineage. A query URL can be constructed like this: https://outbreak.info/situation-reports?pango=B.1.1.7.
Comparing the Sierra mutations with Outbreak.info's consensus mutations, especially mutations of Spike gene. Find out how many Spike mutations in total (numTotal), how many added/removed (numDiff) and if important mutations are added/removed.
If numDiff ≤ 3 or numDiff divided by numTotal ≤ 50%:
1. If no important mutations are added/removed, the isolate's variant should be the PANGO lineage.
2. If important mutations are added/removed, the isolate's variant should be the PANGO lineage concatenated with the modifier.
Else, the isolate should not be linked to any variant.

Isolate mutations

Isolate mutations should be added to table isolate mutations when a new isolate is added.

If the sequence is available, you can find out the mutation list using Sierra Program.
If the sequence is not available, the author might provide the mutation list somewhere in the publication.
If the author only provides the PANGO lineage or equivalent, use the consensus mutations from Outbreak.info.

Format

The mutation's format are well constrained by the database. Following are the valid values (case sensitive):

Field gene: nsp1, nsp2, PLpro, nsp4, _3CLpro, nsp6, nsp7, nsp8, nsp9, nsp10, RdRP, nsp13, nsp14, nsp15, nsp16, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10.
Field amino_acid: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, X (Out-frame deletion), stop, del, and ins.

Noted nsp3 is PLpro, nsp5 is _3CLpro (an underscore is added before the "3" due to program naming restriction), and nsp11/nsp12/nsp12b is (partly) RdRP.

SARS-CoV-2 genome / gene position conversion table

Apr 5, 2022 Update: The recent update in hivdb/covid-drdb@854f1d6 supported automatic conversion of gene and position in isolate_mutations CSVs by autofill command. The synonym genes listed below are all supported and manual conversion is not necessary any more.

The reference is Wuhan-Hu-1. Pay extra attention to unmatched synonym/refAA. Check NA position if necessary. Open an issue if you have questions or experienced difficulties.

Genome NA position	Synonyms	Alt AA position	Acceptable gene	AA position
266-805	ORF1a / ORF1ab	1-180	nsp1	1-180
806-2719	ORF1a / ORF1ab	181-818	nsp2	1-638
2720-8554	ORF1a / ORF1ab	819-2763	PLpro	1-1945
8555-10054	ORF1a / ORF1ab	2764-3263	nsp4	1-500
10055-10972	ORF1a / ORF1ab	3264-3569	_3CLpro	1-306
10055-10972	MPro / MainPro		_3CLpro	1-306
10973-11842	ORF1a / ORF1ab	3570-3859	nsp6	1-290
11843-12091	ORF1a / ORF1ab	3860-3942	nsp7	1-83
12092-12685	ORF1a / ORF1ab	3943-4140	nsp8	1-198
12686-13024	ORF1a / ORF1ab	4141-4253	nsp9	1-113
13025-13441	ORF1a / ORF1ab	4254-4392	nsp10	1-139
13442-13468	ORF1a / ORF1ab	4393-4401	RdRP	1-9
13442-13468	nsp11		RdRP	1-9
13468-16236	ORF1b / nsp12 / nsp12b	1-923	RdRP	10-932
16237-18039	ORF1b	924-1524	nsp13	1-601
18040-19620	ORF1b	1525-2051	nsp14	1-527
19621-20658	ORF1b	2052-2397	nsp15	1-346
20659-21552	ORF1b	2398-2695	nsp16	1-298
13442-16236	ORF1ab	4393-5324	RdRP	1-932
16237-18039	ORF1ab	5325-5925	nsp13	1-601
18040-19620	ORF1ab	5926-6452	nsp14	1-527
19621-20658	ORF1ab	6453-6798	nsp15	1-346
20659-21552	ORF1ab	6799-7096	nsp16	1-298
21563-25381	ORF2 / NS2 / gp02		S	1-1273
25393-26217	NS3 / gp03		ORF3a	1-275
26245-26469	ORF4 / NS4 / gp04		E	1-75
26523-27188	ORF5 / NS5 / gp05		M	1-222
27202-27384	NS6 / gp06		ORF6	1-61
27394-27756	NS7a / gp07		ORF7a	1-121
27756-27884	NS7b / gp08		ORF7b	1-43
27894-28256	NS8 / gp09		ORF8	1-121
28274-29530	ORF9 / NS9 / gp10		N	1-419
29558-29671	NS10 / gp11		ORF10	1-38

If you have any issues or questions, please create a new issue.

Enter neutralization data

Data analysis

Use potency to calculate fold change for plasma titer

Prevalence

Prevalence of Severe Acute Respiratory Syndrome related Coronavirus (SARSr)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Variants, isolates, and mutations

Definitions

Mutation format

Non-display format

Display format

Variants

Naming rules

Variant modifiers: the rationale and when to use

Synonyms

Isolates

Naming rules

Variant classification

Isolate mutations

Format

SARS-CoV-2 genome / gene position conversion table

Enter neutralization data

Data analysis

Prevalence

Clone this wiki locally