-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a k=31 SRA metagenomes index #24
Comments
The index is built from pre-calculated signatures generated by https://wort.sourmash.bio, so first step is to pull all the signatures that are currently available1. wort signatures are in an S3 bucket, sync to a local dir can be done with s5cmd:
This was executed on a local dir (
This took Footnotes |
Select the metagenomes subset of the SRA datasetsThe daily query for metagenomes in wort looks like this:
in order to grab all metagenomes up to a 2024-04-20, the query is1
There are a couple of ways to execute this query, the easiest one is to go to https://www.ncbi.nlm.nih.gov/sra/ and paste the query, the click "search". In the results page, click in "send to", "choose destination: file", "format: Accession list", and then click "Create file".
Next we will use this file to figure out which wort signatures we have and calculate a manifest for them. Clicking around with a browser is not very satisfying, because it is hard to automate. In this case it also goes way faster to download than using All commands in this section were executed on an Ubuntu Linux container started with:
Install dependencies for installing
Install
Recommended: set up an NCBI API KEY to have higher limits
Finally, create the accession list:
You may notice that this is only the accession list, and misses a lot of useful metadata. Previous versions of this index were calculate from a Run Info file, which has way more metadata, but makes it slower to download. As a result, when reaching ~800k entries in the Run Info file For smaller subsets, you can get a runinfo file by using the Finally, a better solution is to use bigquery/athena to do the query, and also grab the metadata associated with the accessions. But that has potentially extra costs if going over the free tier limit for GCP/AWS. So 🤷 Footnotes |
Creating a catalog for wort metagenomes in the accession listIn order to build an index we need a sourmash manifest describing the signatures. But before that we need to figure out what SRA metagenomes from the accession list are present in wort. So let's calculate the intersection between the acc list and the local copy of wort from the first step. This is a modified snakemake rule from the previous time the index was built, and it will be in the PR associated with this issue: rule catalog_metagenomes:
output:
catalog="outputs/20240420-metagenomes-catalog",
input:
acclist="inputs/20240420.acclist",
basepath="/data/wort/wort-sra/"
run:
import csv
from pathlib import Path
# load all sra IDs
sraids = set()
with open(input.acclist) as fp:
data = csv.DictReader(fp, delimiter=",")
for dataset in data:
if dataset['acc'] != 'acc':
sraids.add(dataset['acc'])
path = Path(input.basepath)
with open(output.catalog, 'w') as out:
# check if sraids exist on disk
for sra_id in sraids:
sig_path = path / "sigs" / f"{sra_id}.sig"
if sig_path.exists():
out.write(f"{sig_path}\n")
out.flush() The I still have the catalog used last time, so we can make some comparisons! Previous catalog had There are
And there are
This is something we observed before: metadata can change, and possible misclassifications (metagenome -> metatranscriptome, for example) and corrections will exclude datasets that were in the index from new versions. This also include retracted submissions 1.
and so on. Feel tempted to bring old ones to avoid disruption for people that did searches previously, and if you got an Amplicon result it is likely real (even tho we didn't validate 16s with scaled=1000). Footnotes
|
Side-quest: cleaning up empty sigs in wortWith the catalog ready, next step is calculating a manifest. But as soon as I started calculating the manifest, there were errors due to sourmash not being able to load the signature. 🤔 This led to a side quest to investigate what was wrong with these sigs. Turns out they are empty:
Wanted to check how many of those exist, and ran the following command1:
yikes. I checked a couple in S3 with
and they are indeed empty. Double yikes! A defensive measure is to check for file size when compressing sigs in wort and only upload if they are not empty. For now I excluded them from the catalog, so manifest calculation can proceed. Other follow-up tasks:
Footnotes
|
Building a manifestbranchwater indices are built from a sourmash manifest 1 containing some metadata from signatures. They usually look like this:
The rust crate under rule manifest_from_catalog_full:
output:
manifest="outputs/20240420-metagenomes-s1000.manifest",
input:
catalog="outputs/20240420-metagenomes-catalog",
removed="inputs/20240420.removed",
threads: 24,
shell: """
export RAYON_NUM_THREADS={threads}
RUST_LOG=info {EXEC} manifest \
--output {output} \
--basepath /data/wort/wort-sra \
<(cat {input.catalog} {input.removed})
""" Which can be run with
and generate this command for execution:
This took Two things to note above:
Footnotes |
I'll use this issue to document steps to build a
k=31,scaled=1000
index for SRA metagenomes. This is the same process used for the currentk=21,scaled=1000
index in branchwater.sourmash.bio, but considering the changes from #4, and bringing new SRA datasets added after the cutoff from the current index (2023-08-17
).The text was updated successfully, but these errors were encountered: