-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: Standardize steps for adding gene coverage to metadata #50
Comments
A simple form of a flowchart for "figure out the reference" would be something like, "Is there a RefSeq entry? If so, use that. If not, do a literature search or consult an expert in the field." (I realize that's not great but I do think this is one of those areas where you kinda actually need to know something about what you're trying to do?) As for constructing a GFF, there are tools that we could point to? Presumably the most common starting point is going to be a GenBank file; if somebody is trying to start with a completely unannotated FASTA as the reference sequence, again, they're probably going to need more specialized support than we want to provide? |
For sure! Richard has a script fro generating the GFF from GenBank accession but I haven't personally tried it. |
Just a quick clarification/precision: Nextclade technically does not require a GFF annotation - it can run with just reference fasta and a very minimal (almost empty) pathogen.json. Though, of course, without annotation it would not know anything about CDSes and amino acid things. One idea for allowing faster bootstrapping of projects relying on Nextclade is to also not require annotations by default, where possible. This will end up with less useful analysis, but might encourage new learners and simplify their first steps. Will likely increase complexity of workflows though. |
Thanks for the clarification @ivan-aksamentov! I guess I didn't mean a minimum Nextclade dataset, but the minimum files needed to get the gene/CDS coverage, which does require a GFF annotation. I've updated the language in above to be explicit. |
Related to https://github.com/nextstrain/private/issues/102
It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.
This was previously done by @j23414 in dengue with nextstrain/dengue#36.
We can add these as standardized steps to the ingest template but one hiccup is it requires running sequences through Nextclade. This is easy if a Nextclade dataset already exists, but not as straightforward if users need to create a Nextclade dataset from scratch.
The minimal Nextclade dataset files for annotating gene coverage
The main stumbling block is figuring out which reference to use (currently ingest does not require a reference) and creating the GFF file. It seems like we should have a comprehensive guide on how to get past these blockers in the template as well.
The text was updated successfully, but these errors were encountered: