-
Notifications
You must be signed in to change notification settings - Fork 8
Record Variants, isolates, and mutations
DRDB database stores information tables of variants, isolates, and isolate mutations. This wiki page provides the general guidance and rules of how to maintain these tables.
A variant is a set of sequences, in which the Spike amino acid mutations matched a predefined list of amino acid mutations. For example, Variant "Alpha" is any sequences contains the Spike mutations Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H with several exceptions are acceptable.
An isolate, or a precise mutation pattern is a set of sequences which are identical at amino acid level. Isolate is the subclass unit of variant. The variants
table and the isolates
table have a one-to-many relationship.
A mutation is defined as an amino acid difference from Wuhan-Hu-1 reference sequence, an insertion or a deletion.
An isolate mutation is a combination of isolate name, gene, and mutation. The isolates
table and the isolate_mutations
table have a one-to-many relationship.
An amino acid mutation always contains following four aspects:
- Gene (
gene
) - Position in the gene (
position
) - Reference amino acid (
refAA
) - Mutant amino acid / indel / stop (
mutAA
)
Two type of formats are used when a mutation need to be represent as text:
This format is used mainly by isolates.isolate_name
. It is the "internal" format of the database and is more friendly to machine.
- A mutation should start with the
gene
, followed by a colon, theposition
and end with themutAA
. E.g. nsp2:106L, RdRP:323L, ORF3a:257del. - For insertion mutations, "ins" should be used at the place of
refAA
. - For deletion mutations, "del" should be used at the place of
refAA
. - For stop codons, "stop" should be used at the place of
refAA
.
To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol "+" should be used to join the list. The gene
should be omitted if the mutation is not the first of the gene. For example, RdRP:323L+S:69del+70del+144del+145del+501Y+570D+614G+681H+716I+982A+1118H
.
This format is used when the text will be showed on website/program. It is more friendly to human reader.
- A mutation should start with the
gene
, followed by a colon, therefAA
,position
and end with themutAA
. E.g. nsp2:P106L, RdRP:P323L. ORF3a:N257del, with following two exception:- For mutations of Spike gene, the
gene
should be omitted: A222V, S501Y, D614G, etc. - For deletions, the
refAA
andmutAA
should be omitted and a greek letter "Δ" should be placed before the position: Δ144, ORF3a:Δ257.
- For mutations of Spike gene, the
- Neighboring deletions should be represented as one, and use dash to connect the position begin and end: nsp6:Δ107-109, Δ69-70.
- For insertion mutations, "ins" should be used at the place of
refAA
. - For stop codons, "*" should be used at the place of
refAA
.
To represent a list of mutations, the mutations must be sorted by their location in the genome first. Plus symbol with two space " + " should be used to join the list. The gene
should be omitted if the mutation is not the first of the gene. For example, RdRP:P323L + Δ69-70 + Δ144-145 + N501Y + A570D + D614G + P681H + T716I + S982A + D1118H
.
In general, the widest used name should be used as the primary name of a variant/lineage. In addition, modifier can be added after the variant main name.
- If a variant is a WHO VOC or VOI, the WHO name (Alpha, Beta, etc) should be used. The PANGO lineage name should be listed as the first synonym in variant_synonyms.csv.
- If a variant is not a WHO VOC and VOI, the PANGO lineage name should be used.
- If a variant is a known sub-lineage of another lineage, the sub-lineage should be used as variant name. E.g. Q.1 and Alpha.
- Modifier can be added to variant in following formats:
- To indicate an additional mutation, use a slash followed by the mutation with its reference. E.g. Alpha/E484K.
- To indicate a missing mutation, use "w/o" followed by the mutation with its reference. E.g. Iota w/o E484K.
The variant is a key aggregation factor that used by our data summary program and has the potential to be used by others too. The modifiers can better distinguish neutralization results with important mutations from those without.
Therefore, only important mutations should be added as modifiers. Here is an incomplete list of rules of calling important mutations:
- The mutation must be a known resistance mutation;
- The mutation must located in the important region (e.g. RBD/RBM); or
- The mutation must be the major topic of a study that is relative to the variant.
Variant modifier should use mutation display format.
A variant can have multiple synonyms. In our program, the synonyms are displayed with variant primary name when space is allowed. Following can be added as a synonym:
- If a variant includes only one or two Spike mutations, the display format can be added as a synonym. For example, "D614G" can be added as the synonym of "B.1".
- If a variant uses WHO name, the PANGO lineage should be added as a synonym. For example, "Alpha" as the primary name and "B.1.1.7" as the synonym.
- Name suggested by author can be also added as a synonym. For example, "A.27/A227V" as the primary name and "A.27.RN" as the synonym.
The isolate name is mostly used internally and is never showed to our website users. Following are several (non-enforce) good practices:
- Use GISAID virus name / GenBank isolate name when possible: if GISAID number and/or GenBank accession is provided, just use the full name from the source and save the GISAID number under
gisaid_id
and GenBank accession undergenbank_accn
. AVOID using GISAID number/GenBank accession as isolate name since its not readable. - Use non-display format mutation list when GISAID/GenBank names are not available. AVOID using "<Variant> Spike", "<Variant> full genome" since this way makes it harder to tell the minor differences between isolates of same variant.
- Use combination of
ref_name
,var_name
andgenomic region
. For examples,Truffot21 B.1.1.7 spike
,Wang22 BA.2 full genome
. - For selection data, include patient characteristics and collection day. For example,
Truffot21 72/M D10
. - For extremely long mutation list, e.g. SARS-CoV or WIV1, using "SARS-CoV" or "WIV1" is acceptable.
It can be somehow tricky when assigning the variant for an isolate. The PANGOLIN program is not always reliable and it doesn't weigh the important mutations. Here lists several steps which is considered as the good practice:
- If the sequence is available, use Sierra Program to tell the PANGO lineage and find out the mutation list. Go to 4.
- If the sequence is not available, find out if the PANGO lineage (or equivalent classification) and mutation list are provided by the author. Go to 4.
- If no PANGO lineage (or equivalent classification) is found, go to 7.
- Use Outbreak.info to find out the consensus mutations of the PANGO lineage. A query URL can be constructed like this: https://outbreak.info/situation-reports?pango=B.1.1.7.
- Comparing the Sierra mutations with Outbreak.info's consensus mutations, especially mutations of Spike gene. Find out how many Spike mutations in total (
numTotal
), how many added/removed (numDiff
) and if important mutations are added/removed. - If
numDiff
≤ 3 ornumDiff
divided bynumTotal
≤ 50%:- If no important mutations are added/removed, the isolate's variant should be the PANGO lineage.
- If important mutations are added/removed, the isolate's variant should be the PANGO lineage concatenated with the modifier.
- Else, the isolate should not be linked to any variant.
Isolate mutations should be added to table isolate mutations when a new isolate is added.
- If the sequence is available, you can find out the mutation list using Sierra Program.
- If the sequence is not available, the author might provide the mutation list somewhere in the publication.
- If the author only provides the PANGO lineage or equivalent, use the consensus mutations from Outbreak.info.
The mutation's format are well constrained by the database. Following are the valid values (case sensitive):
- Field gene:
nsp1
,nsp2
,PLpro
,nsp4
,_3CLpro
,nsp6
,nsp7
,nsp8
,nsp9
,nsp10
,RdRP
,nsp13
,nsp14
,nsp15
,nsp16
,S
,ORF3a
,E
,M
,ORF6
,ORF7a
,ORF7b
,ORF8
,N
, andORF10
. - Field amino_acid:
A
,C
,D
,E
,F
,G
,H
,I
,K
,L
,M
,N
,P
,Q
,R
,S
,T
,V
,W
,Y
,X
(Out-frame deletion),stop
,del
, andins
.
Noted nsp3 is PLpro
, nsp5 is _3CLpro
(an underscore is added before the "3" due to program naming restriction), and nsp11/nsp12/nsp12b is (partly) RdRP.
Apr 5, 2022 Update: The recent update in hivdb/covid-drdb@854f1d6 supported automatic conversion of gene and position in isolate_mutations CSVs by autofill
command. The synonym genes listed below are all supported and manual conversion is not necessary any more.
The reference is Wuhan-Hu-1. Pay extra attention to unmatched synonym/refAA. Check NA position if necessary. Open an issue if you have questions or experienced difficulties.
Genome NA position | Synonyms | Alt AA position | Acceptable gene | AA position |
---|---|---|---|---|
266-805 | ORF1a / ORF1ab | 1-180 | nsp1 | 1-180 |
806-2719 | ORF1a / ORF1ab | 181-818 | nsp2 | 1-638 |
2720-8554 | ORF1a / ORF1ab | 819-2763 | PLpro | 1-1945 |
8555-10054 | ORF1a / ORF1ab | 2764-3263 | nsp4 | 1-500 |
10055-10972 | ORF1a / ORF1ab | 3264-3569 | _3CLpro | 1-306 |
10055-10972 | MPro / MainPro | _3CLpro | 1-306 | |
10973-11842 | ORF1a / ORF1ab | 3570-3859 | nsp6 | 1-290 |
11843-12091 | ORF1a / ORF1ab | 3860-3942 | nsp7 | 1-83 |
12092-12685 | ORF1a / ORF1ab | 3943-4140 | nsp8 | 1-198 |
12686-13024 | ORF1a / ORF1ab | 4141-4253 | nsp9 | 1-113 |
13025-13441 | ORF1a / ORF1ab | 4254-4392 | nsp10 | 1-139 |
13442-13468 | ORF1a / ORF1ab | 4393-4401 | RdRP | 1-9 |
13442-13468 | nsp11 | RdRP | 1-9 | |
13468-16236 | ORF1b / nsp12 / nsp12b | 1-923 | RdRP | 10-932 |
16237-18039 | ORF1b | 924-1524 | nsp13 | 1-601 |
18040-19620 | ORF1b | 1525-2051 | nsp14 | 1-527 |
19621-20658 | ORF1b | 2052-2397 | nsp15 | 1-346 |
20659-21552 | ORF1b | 2398-2695 | nsp16 | 1-298 |
13442-16236 | ORF1ab | 4393-5324 | RdRP | 1-932 |
16237-18039 | ORF1ab | 5325-5925 | nsp13 | 1-601 |
18040-19620 | ORF1ab | 5926-6452 | nsp14 | 1-527 |
19621-20658 | ORF1ab | 6453-6798 | nsp15 | 1-346 |
20659-21552 | ORF1ab | 6799-7096 | nsp16 | 1-298 |
21563-25381 | ORF2 / NS2 / gp02 | S | 1-1273 | |
25393-26217 | NS3 / gp03 | ORF3a | 1-275 | |
26245-26469 | ORF4 / NS4 / gp04 | E | 1-75 | |
26523-27188 | ORF5 / NS5 / gp05 | M | 1-222 | |
27202-27384 | NS6 / gp06 | ORF6 | 1-61 | |
27394-27756 | NS7a / gp07 | ORF7a | 1-121 | |
27756-27884 | NS7b / gp08 | ORF7b | 1-43 | |
27894-28256 | NS8 / gp09 | ORF8 | 1-121 | |
28274-29530 | ORF9 / NS9 / gp10 | N | 1-419 | |
29558-29671 | NS10 / gp11 | ORF10 | 1-38 |
If you have any issues or questions, please create a new issue.