-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutation analysis #348
base: master
Are you sure you want to change the base?
Mutation analysis #348
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
The current script produces output by multiplying the bi-weekly case counts data with the mutation proportions queried from covSPECTRUM and adding up the results across different clusters. There are two output files created in Points of discussion:
Advanced points:
|
Thank you for this Moira! This is a great start and I really appreciate your work!
I think this is the right set to query for ✔️
We should actually be able to query Nextstrain clades, which I think might be a nice starting point as we can avoid the nitty-gritty for the time being and see if we need to wade into it later. Should be in the format like "21L" (without the '(Omicron)') - here's a web link as an example: https://cov-spectrum.org/explore/Switzerland/AllSamples/Past6M/variants?nextstrainClade=21L&
This seems fine for now, I think we're most interested in the more common mutations people encountered
I actually think this should be ok too.... since hopefully all defining mutations are in there and at high percents!
Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.
Ohhh, interesting point. Let's keep that one on the burner. For now how about starting with the list I put in the slack?
Hm, good to produce both as I'm not sure what we may want long term, but intuitively I think it would be interesting to plot by country first - what's the estimated breakdown of which mutations have been seen? At the moment I'm not sure whether this will go on the site or not, so would you mind adding some simple plots of the files you're generating - so it outputs them as image files? No need for anything too fancy - but this will give us an idea as we go of what we're looking at! I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps? Do you think that gives some good next steps? |
When querying with nextstrainClade (extracted with
These are not found on covSPECTRUM:
@emmahodcroft Do these missing ones have different nextstrainClades than what is in |
Where could I get this list from? |
I got this running (querying by 2-week intervals), however, this takes a lot of time. I was wondering whether we could get the data downloaded directly sorted by date instead of querying for each interval separately, but I couldn't figure out whether this is possible. Do you by chance know? |
Sorry Moira, somehow I missed this completely. Trying to come back to it a little now! I think one step forward that's probably worth making it simplifying the Nextstrain names. However, I can't seem to get around to doing that so I've done the next best thing - I've added For the clusters that don't have Nextstrain names and so don't return hits, we can try to use Pango linages... I've added this and am seeing if that works. Probably not so critical to include these as they probably had few mutations that we're really interested in now, but if there's a simple way to include them, why not...
I think to keep it simple for now, I'm happy to go with a legit-looking-paper's definition. For RBD this article includes "Thr333–Gly526 of the SARS-CoV-2 RBD", this article says "we identified the region of SARS-CoV-2 RBD at residues 331 to 524 of S protein". For NTD, this article says "residues 15–305 of the SARS-CoV-2 spike protein ... were aligned against the NTD sequences", whereas this one says "NTD (aa 13–318)" So I think anything in this ranges works! I have randomly picked 333-526 and 15-305 for the moment... So currently it only stores mutations in these two locations.
Hm, I've now forgotten what the original I can also try to ask Chaoran if there's a better way, but I feel like I need to try and wrap my head around the scripts a little more before I formulate a question! Going through the script I'm a little confused about the role |
I really did forget to query for country. 🙈 How silly of me. Thanks for catching that! Interestingly it doesn't seem to make too much of a difference. I compared some of the plots, and most of them look rather similar. Here's one with a stronger difference I found: (pic above is new, pic below is old queries) This means that a given variant won't vary strongly in between countries, right? I'm just wondering if we could somehow make use of this to make things more efficient... 🤔 But I guess it would still be too potentially interesting to catch those differences to discard per country queries entirely... |
@MoiraZuber in the above push I just added the one thing I (apparently) hadn't pushed up before - an additional catch to check that |
From chatting with Richard he suggests we modify how we're storing the cases -- instead of storing the cumulative number of cases that have seen mutation X, he suggests storing, for each 2 week period the absolute number of cases that have seen mutation X. (So just the cases in that 2 week period, calculated from the mutation frequency * case numbers). The advantage of this is that from this we can get everything else (such as cumulative, just by summing) and have more flexibility in how we can incorporate reinfection (and even potentially in future immune fading) in the future - as the absolute number shouldn't change - just then how many we end up putting to "cumulative" or similar. (So all more complex calculations can be post-hoc to the calculation of how many people saw mutation X in a week as a very raw value - hope that makes sense!) Building on this, we can expect, certainly if we're working through 2022/Omicron, that we'll end up with more than 100% of the population infected - or that each person is infected more than once. (A rough value according to some studies in South Africa would be ~1.7 infections per person - but will vary per country a bit.) Infection is Poisson distributed (rare event that happens randomly*) so this then allows us to calculate the probability of reinfection as e^-1.7 (Or whatever the overall infection rate is for any given point in time) - to figure out how many of the "seeing mutation X" in a given week is "new" vs "reinfection" (aka don't add to the cumulative). I hope this makes sense, but we can also chat through it! I'm still parsing it a bit in my brain too, but I think it sounds like a good idea - as we can then be much more flexible in how we 'sum up' the weekly 'absolute exposures'! |
Finally, thought it might be worth putting our checklist from Friday 20th Jan 23 in here so things could actually be checked off! Top to-dos:
|
This PR aims at introducing a mutation analysis step to covariants. The goal is, for a given mutation and country, to obtain an estimate of what percentage of the population has encountered this mutation before. This is done by combining case counts data (which estimates the number of cases per bi-weekly interval for each cluster in a given country) with mutation proportions taken from covSPECTRUM (which mutation has been seen at what percentage in a given cluster).