diff --git a/notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb b/notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb index 8e46398..b9333bb 100644 --- a/notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb +++ b/notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb @@ -11,9 +11,15 @@ "\n", "These values were estimated by using FracMinHash sketching. \n", "[FracMinhash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2.abstract) represent a fraction of the k-mers in the original sequence.\n", + "\n", "K-mers are words of length _k_ in nucleotide sequences.\n", "Here we used k-mer sizes 21, 31, and 51 because these can be used to estimate genus, species, and strain-level similarity respectively between sets of sequences.\n", - "We also downsampled to 1/200th of all k-mers to facilitate faster comparisons.\n", + "\n", + "We also downsampled to 1/200th (0.5%) of all k-mers to facilitate faster comparisons (`scaled = 200`).\n", + "Importantly, the same fraction of k-mers is retained across samples if the same sequences are contained within those samples.\n", + "Within the sourmash ecosystem, scaled values of 1000 and 2000 are the most standard because these values are a nice compromise for speed of comparisons while retaining enough k-mers to accurately represent microbial genomes.\n", + "Because viral genomes can be very small, we reduced the scaled value to 200 to retain more k-mers with the idea that even small viral genomes would still have a few representative k-mers within the sketch.\n", + "See [this issue](https://github.com/sourmash-bio/sourmash/issues/1340) for more details about how the scaled value and the k-mer size impact the number of k-mers retained for viral genomes of different sizes.\n", "\n", "After we sketched both the metatranscriptome and the metagenome, k-mers that occurred in the metagenome were removed from its paired metatranscriptome.\n", "\n",