Update subtraction code and add notebook with first round of results #4

taylorreiter · 2022-09-13T20:01:25Z

The way I had implemented the subtraction before was crawlingly slow (documented here: sourmash-bio/sourmash#2248), so I re-implemented it to use the sourmash CLI instead of the python API...because for some weird reason the CLI was faster. I had to change the variable names for the output -- before, I was using the paired sample identifier for the output sigs. now I'm using <MTX_ACCESSIONS>-minus-<MGX_ACCESSION> because this helped the DAG solve correctly for snakemake outside of the python API (previously my python code was taking care of the naming, now I use trick the snakemake DAG into running things correctly).

This PR also adds two rules to run sourmash sig describe on all of the signatures calculated by this snakefile. sig describe reports information about the signature, like the k-size and scaled value used, and the most importantly, the number of distinct and total k-mers in each sketch. Note I had to write this rule twice to solve for both sets of output files. I had it written only once, but then sig describe would have to run all at once on a single thread instead of being parallelized by snakemake across all of the threads on my instance. It was much faster for me to run it in parallel, so I took the hit with duplicated code.

The last file in this PR is a notebook that uses the output of sourmash sig describe to estimate the fraction of each metatranscriptome that remains after k-mers in the paired metagenome have been removed.

Next steps (in next PR) will be to taxonomically profile the mtx leftovers, and to make sure host has been removed.

review-notebook-app · 2022-09-13T20:01:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

borgesadair1 · 2022-09-13T22:50:03Z

notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb

@@ -0,0 +1,555 @@
+{


what is the impact of downsampling to 1/200 here? If I understand correctly, you are comparing 0.05% of the metagenome to 0.05% of the transcriptome. assuming the full transcriptome and the full metagenome are actually exactly the same, what would the baseline expectation be that you would see something in the downsampled transcriptome not represented in the downsampled metagenome?
for downstream assembly will the reads need to be adapter trimmed?
will you do the assembly on the downsampled and subtracted k-mers ?

Reply via ReviewNB

it would be comparing 0.5%, but yes. I added more details to the notebook describing what the downsampling value is doing here. Because the same sequences are retained across samples, if a full transcriptome and full metagenome were exactly the same, there would be no sequences in the downsampled transcriptome that weren't in the metagenome -- the sketches would be exactly the same.

Yes for downstream analysis I will adapter and k-mer trim the reads ( I don't plan to do assembly, but will build assembly graphs; assembly graphs will have better recall for things that are systematically not in our databases, which is what we're targeting here). It will be fast to intersect the fractions of the the leftover sketches that are retained in the trimmed reads.

No I won't do assembly with the downsampled and subtracted k-mers. Because they're downsampled, it becomes effectively meaningless to do that type of analysis with them (i've done it multiple times for multiple projects over the last 5 years and it's always very disappointing). Instead, I'm using this analysis to inform what downstream samples to dig into more with assembly graph methods.

borgesadair1 · 2022-09-13T22:50:03Z

notebooks/20220913-fraction-mtx-after-subtracting-mgx.ipynb

@@ -0,0 +1,555 @@
+{


I am really suprised at how high the mean_f_n_hashes is ....am i understanding this correctly, that in several cases >90% of unique hashes in the transcriptome are not in metagenome? it looks like abundance weighting resolves some of this, which you are attributing to sequencing errors, etc. but that would be a super high error rate right?
Why did you choose k = 31? How do these fractions change with more stringest k = 51?

Reply via ReviewNB

Ya this is fairly un-intuitive, but remember that for every single sequencing error, we get 31 k-mers that are erroneous and will probably only occur in a data set once (think of a sliding window over the error, where the error occurs in every position during that sliding window). This means that distinct k-mers that represent errors really blow up -- abundance-weighting them gets rid of a lot of these problems.

I chose 31 because it's an ok default :) in reality, we could do it with all k-mer sizes, and i'm continuing the downstream taxonomic profiling with all k-mer sizes. The numbers decrease by ~3% for k=21, and increase by 3% for k = 51. These changes could be due to errors tho ... errors will have less of an impact at smaller k-sizes and bigger impact at bigger ksizes. Also, the conversation we had about the euk rna db (now recorded in taxonomically profiling the subtracted mtx fractions with a eukaryotic RNA database? #5), I only have that database for k = 31.

taylorreiter added 4 commits September 13, 2022 14:33

rm old sig sub and add doc to new

03ff5a9

add snakefile with single run of sig describe

75e2870

separate sig describe to parallelize over available cpu

01d83b4

add frac mtx nb

3ff8024

taylorreiter requested a review from borgesadair1 September 13, 2022 20:03

borgesadair1 reviewed Sep 13, 2022

View reviewed changes

add more notes about scaled

f43caec

taylorreiter merged commit a33e96e into main Sep 21, 2022

taylorreiter deleted the update_subtract branch September 21, 2022 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update subtraction code and add notebook with first round of results #4

Update subtraction code and add notebook with first round of results #4

taylorreiter commented Sep 13, 2022 •

edited

Loading

review-notebook-app bot commented Sep 13, 2022

borgesadair1 Sep 13, 2022

taylorreiter Sep 14, 2022

borgesadair1 Sep 13, 2022

taylorreiter Sep 14, 2022

Update subtraction code and add notebook with first round of results #4

Update subtraction code and add notebook with first round of results #4

Conversation

taylorreiter commented Sep 13, 2022 • edited Loading

review-notebook-app bot commented Sep 13, 2022

borgesadair1 Sep 13, 2022

Choose a reason for hiding this comment

taylorreiter Sep 14, 2022

Choose a reason for hiding this comment

borgesadair1 Sep 13, 2022

Choose a reason for hiding this comment

taylorreiter Sep 14, 2022

Choose a reason for hiding this comment

taylorreiter commented Sep 13, 2022 •

edited

Loading