Skip to content

Record Variants, isolates, and mutations

Kaiming Tao edited this page Jun 17, 2022 · 1 revision

DRDB database stores information tables of variants, isolates, and isolate mutations. This wiki page provides the general guidance and rules of how to maintain these tables.

Definitions

A variant is a set of sequences, in which the Spike amino acid mutations matched a predefined list of amino acid mutations. For example, Variant "Alpha" is any sequences contains the Spike mutations Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H with several exceptions are acceptable.

An isolate, or a precise mutation pattern is a set of sequences which are identical at amino acid level. Isolate is the subclass unit of variant. The variants table and the isolates table have a one-to-many relationship.

A mutation is defined as an amino acid difference from Wuhan-Hu-1 reference sequence, an insertion or a deletion.

An isolate mutation is a combination of isolate name, gene, and mutation. The isolates table and the isolate_mutations table have a one-to-many relationship.

Mutation format

An amino acid mutation always contains following four aspects:

  1. Gene (gene)
  2. Position in the gene (position)
  3. Reference amino acid (refAA)
  4. Mutant amino acid / indel / stop (mutAA)

Two type of formats are used when a mutation need to be represent as text:

Non-display format

This format is used mainly by isolates.isolate_name. It is the "internal" format of the database and is more friendly to machine.

  1. A mutation should start with the gene, followed by a colon, the position and end with the mutAA. E.g. nsp2:106L, RdRP:323L, ORF3a:257del.
  2. For insertion mutations, "ins" should be used at the place of refAA.
  3. For deletion mutations, "del" should be used at the place of refAA.
  4. For stop codons, "stop" should be used at the place of refAA.

To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol "+" should be used to join the list. The gene should be omitted if the mutation is not the first of the gene. For example, RdRP:323L+S:69del+70del+144del+145del+501Y+570D+614G+681H+716I+982A+1118H.

Display format

This format is used when the text will be showed on website/program. It is more friendly to human reader.

  1. A mutation should start with the gene, followed by a colon, the refAA, position and end with the mutAA. E.g. nsp2:P106L, RdRP:P323L. ORF3a:N257del, with following two exception:
    1. For mutations of Spike gene, the gene should be omitted: A222V, S501Y, D614G, etc.
    2. For deletions, the refAA and mutAA should be omitted and a greek letter "Δ" should be placed before the position: Δ144, ORF3a:Δ257.
  2. Neighboring deletions should be represented as one, and use dash to connect the position begin and end: nsp6:Δ107-109, Δ69-70.
  3. For insertion mutations, "ins" should be used at the place of refAA.
  4. For stop codons, "*" should be used at the place of refAA.

To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol with two space " + " should be used to join the list. The gene should be omitted if the mutation is not the first of the gene. For example, RdRP:P323L + Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H.

Variants

Naming rules

In general, the widest used name should be used as the primary name of a variant/lineage. In addition, modifier can be added after the variant main name.

  1. If a variant is a WHO VOC or VOI, the WHO name (Alpha, Beta, etc) should be used. The PANGO lineage name should be listed as the first synonym in variant_synonyms.csv.
  2. If a variant is not a WHO VOC and VOI, the PANGO lineage name should be used.
  3. If a variant is a known sub-lineage of another lineage, the sub-lineage should be used as variant name. E.g. Q.1 and Alpha.
  4. Modifier can be added to variant in following formats:
    1. To indicate an additional mutation, use a slash followed by the mutation with its reference. E.g. Alpha/E484K.
    2. To indicate a missing mutation, use "w/o" followed by the mutation with its reference. E.g. Iota w/o E484K.

Variant modifiers: the rationale and when to use

The variant is a key aggregation factor that used by our data summary program and has the potential to be used by others too. The modifiers can better distinguish neutralization results with important mutations from those without.

Therefore, only important mutations should be added as modifiers. Here is an incomplete list of rules of calling important mutations:

  1. The mutation must be a known resistance mutation;
  2. The mutation must located in the important region (e.g. RBD/RBM); or
  3. The mutation must be the major topic of a study that is relative to the variant.

Variant modifier should use mutation display format.

Synonyms

A variant can have multiple synonyms. In our program, the synonyms are displayed with variant primary name when space is allowed. Following can be added as a synonym:

  1. If a variant includes only one or two Spike mutations, the display format can be added as a synonym. For example, "D614G" can be added as the synonym of "B.1".
  2. If a variant uses WHO name, the PANGO lineage should be added as a synonym. For example, "Alpha" as the primary name and "B.1.1.7" as the synonym.
  3. Name suggested by author can be also added as a synonym. For example, "A.27/A227V" as the primary name and "A.27.RN" as the synonym.

Isolates

Naming rules

The isolate name is mostly used internally and is never showed to our website users. Following are several (non-enforce) good practices:

  1. Use GISAID virus name / GenBank isolate name when possible: if GISAID number and/or GenBank accession is provided, just use the full name from the source and save the GISAID number under gisaid_id and GenBank accession under genbank_accn. AVOID using GISAID number/GenBank accession as isolate name since its not readable.
  2. Use non-display format mutation list when GISAID/GenBank names are not available. AVOID using "<Variant> Spike", "<Variant> full genome" since this way makes it harder to tell the minor differences between isolates of same variant.
  3. Use combination of ref_name, var_name and genomic region. For examples, Truffot21 B.1.1.7 spike, Wang22 BA.2 full genome.
  4. For selection data, include patient characteristics and collection day. For example, Truffot21 72/M D10.
  5. For extremely long mutation list, e.g. SARS-CoV or WIV1, using "SARS-CoV" or "WIV1" is acceptable.

Variant classification

It can be somehow tricky when assigning the variant for an isolate. The PANGOLIN program is not always reliable and it doesn't weigh the important mutations. Here lists several steps which is considered as the good practice:

  1. If the sequence is available, use Sierra Program to tell the PANGO lineage and find out the mutation list. Go to 4.
  2. If the sequence is not available, find out if the PANGO lineage (or equivalent classification) and mutation list are provided by the author. Go to 4.
  3. If no PANGO lineage (or equivalent classification) is found, go to 7.
  4. Use Outbreak.info to find out the consensus mutations of the PANGO lineage. A query URL can be constructed like this: https://outbreak.info/situation-reports?pango=B.1.1.7.
  5. Comparing the Sierra mutations with Outbreak.info's consensus mutations, especially mutations of Spike gene. Find out how many Spike mutations in total (numTotal), how many added/removed (numDiff) and if important mutations are added/removed.
  6. If numDiff ≤ 3 or numDiff divided by numTotal ≤ 50%:
    1. If no important mutations are added/removed, the isolate's variant should be the PANGO lineage.
    2. If important mutations are added/removed, the isolate's variant should be the PANGO lineage concatenated with the modifier.
  7. Else, the isolate should not be linked to any variant.

Isolate mutations

Isolate mutations should be added to table isolate mutations when a new isolate is added.

  1. If the sequence is available, you can find out the mutation list using Sierra Program.
  2. If the sequence is not available, the author might provide the mutation list somewhere in the publication.
  3. If the author only provides the PANGO lineage or equivalent, use the consensus mutations from Outbreak.info.

Format

The mutation's format are well constrained by the database. Following are the valid values (case sensitive):

  • Field gene: nsp1, nsp2, PLpro, nsp4, _3CLpro, nsp6, nsp7, nsp8, nsp9, nsp10, RdRP, nsp13, nsp14, nsp15, nsp16, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10.
  • Field amino_acid: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, X (Out-frame deletion), stop, del, and ins.

Noted nsp3 is PLpro, nsp5 is _3CLpro (an underscore is added before the "3" due to program naming restriction), and nsp11/nsp12/nsp12b is (partly) RdRP.

SARS-CoV-2 genome / gene position conversion table

Apr 5, 2022 Update: The recent update in hivdb/covid-drdb@854f1d6 supported automatic conversion of gene and position in isolate_mutations CSVs by autofill command. The synonym genes listed below are all supported and manual conversion is not necessary any more.

The reference is Wuhan-Hu-1. Pay extra attention to unmatched synonym/refAA. Check NA position if necessary. Open an issue if you have questions or experienced difficulties.

Genome NA position Synonyms Alt AA position Acceptable gene AA position
266-805 ORF1a / ORF1ab 1-180 nsp1 1-180
806-2719 ORF1a / ORF1ab 181-818 nsp2 1-638
2720-8554 ORF1a / ORF1ab 819-2763 PLpro 1-1945
8555-10054 ORF1a / ORF1ab 2764-3263 nsp4 1-500
10055-10972 ORF1a / ORF1ab 3264-3569 _3CLpro 1-306
10055-10972 MPro / MainPro _3CLpro 1-306
10973-11842 ORF1a / ORF1ab 3570-3859 nsp6 1-290
11843-12091 ORF1a / ORF1ab 3860-3942 nsp7 1-83
12092-12685 ORF1a / ORF1ab 3943-4140 nsp8 1-198
12686-13024 ORF1a / ORF1ab 4141-4253 nsp9 1-113
13025-13441 ORF1a / ORF1ab 4254-4392 nsp10 1-139
13442-13468 ORF1a / ORF1ab 4393-4401 RdRP 1-9
13442-13468 nsp11 RdRP 1-9
13468-16236 ORF1b / nsp12 / nsp12b 1-923 RdRP 10-932
16237-18039 ORF1b 924-1524 nsp13 1-601
18040-19620 ORF1b 1525-2051 nsp14 1-527
19621-20658 ORF1b 2052-2397 nsp15 1-346
20659-21552 ORF1b 2398-2695 nsp16 1-298
13442-16236 ORF1ab 4393-5324 RdRP 1-932
16237-18039 ORF1ab 5325-5925 nsp13 1-601
18040-19620 ORF1ab 5926-6452 nsp14 1-527
19621-20658 ORF1ab 6453-6798 nsp15 1-346
20659-21552 ORF1ab 6799-7096 nsp16 1-298
21563-25381 ORF2 / NS2 / gp02 S 1-1273
25393-26217 NS3 / gp03 ORF3a 1-275
26245-26469 ORF4 / NS4 / gp04 E 1-75
26523-27188 ORF5 / NS5 / gp05 M 1-222
27202-27384 NS6 / gp06 ORF6 1-61
27394-27756 NS7a / gp07 ORF7a 1-121
27756-27884 NS7b / gp08 ORF7b 1-43
27894-28256 NS8 / gp09 ORF8 1-121
28274-29530 ORF9 / NS9 / gp10 N 1-419
29558-29671 NS10 / gp11 ORF10 1-38