Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update subtraction code and add notebook with first round of results #4

Merged
merged 5 commits into from
Sep 21, 2022

Conversation

taylorreiter
Copy link
Member

@taylorreiter taylorreiter commented Sep 13, 2022

The way I had implemented the subtraction before was crawlingly slow (documented here: sourmash-bio/sourmash#2248), so I re-implemented it to use the sourmash CLI instead of the python API...because for some weird reason the CLI was faster. I had to change the variable names for the output -- before, I was using the paired sample identifier for the output sigs. now I'm using <MTX_ACCESSIONS>-minus-<MGX_ACCESSION> because this helped the DAG solve correctly for snakemake outside of the python API (previously my python code was taking care of the naming, now I use trick the snakemake DAG into running things correctly).

This PR also adds two rules to run sourmash sig describe on all of the signatures calculated by this snakefile. sig describe reports information about the signature, like the k-size and scaled value used, and the most importantly, the number of distinct and total k-mers in each sketch. Note I had to write this rule twice to solve for both sets of output files. I had it written only once, but then sig describe would have to run all at once on a single thread instead of being parallelized by snakemake across all of the threads on my instance. It was much faster for me to run it in parallel, so I took the hit with duplicated code.

The last file in this PR is a notebook that uses the output of sourmash sig describe to estimate the fraction of each metatranscriptome that remains after k-mers in the paired metagenome have been removed.

Next steps (in next PR) will be to taxonomically profile the mtx leftovers, and to make sure host has been removed.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,555 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what is the impact of downsampling to 1/200 here? If I understand correctly, you are comparing 0.05% of the metagenome to 0.05% of the transcriptome. assuming the full transcriptome and the full metagenome are actually exactly the same, what would the baseline expectation be that you would see something in the downsampled transcriptome not represented in the downsampled metagenome?
  • for downstream assembly will the reads need to be adapter trimmed?
  • will you do the assembly on the downsampled and subtracted k-mers ?

Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it would be comparing 0.5%, but yes. I added more details to the notebook describing what the downsampling value is doing here. Because the same sequences are retained across samples, if a full transcriptome and full metagenome were exactly the same, there would be no sequences in the downsampled transcriptome that weren't in the metagenome -- the sketches would be exactly the same.
  2. Yes for downstream analysis I will adapter and k-mer trim the reads ( I don't plan to do assembly, but will build assembly graphs; assembly graphs will have better recall for things that are systematically not in our databases, which is what we're targeting here). It will be fast to intersect the fractions of the the leftover sketches that are retained in the trimmed reads.
  3. No I won't do assembly with the downsampled and subtracted k-mers. Because they're downsampled, it becomes effectively meaningless to do that type of analysis with them (i've done it multiple times for multiple projects over the last 5 years and it's always very disappointing). Instead, I'm using this analysis to inform what downstream samples to dig into more with assembly graph methods.

@@ -0,0 +1,555 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I am really suprised at how high the mean_f_n_hashes is ....am i understanding this correctly, that in several cases >90% of unique hashes in the transcriptome are not in metagenome? it looks like abundance weighting resolves some of this, which you are attributing to sequencing errors, etc. but that would be a super high error rate right?
  • Why did you choose k = 31? How do these fractions change with more stringest k = 51?

Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ya this is fairly un-intuitive, but remember that for every single sequencing error, we get 31 k-mers that are erroneous and will probably only occur in a data set once (think of a sliding window over the error, where the error occurs in every position during that sliding window). This means that distinct k-mers that represent errors really blow up -- abundance-weighting them gets rid of a lot of these problems.
  2. I chose 31 because it's an ok default :) in reality, we could do it with all k-mer sizes, and i'm continuing the downstream taxonomic profiling with all k-mer sizes. The numbers decrease by ~3% for k=21, and increase by 3% for k = 51. These changes could be due to errors tho ... errors will have less of an impact at smaller k-sizes and bigger impact at bigger ksizes. Also, the conversation we had about the euk rna db (now recorded in taxonomically profiling the subtracted mtx fractions with a eukaryotic RNA database? #5), I only have that database for k = 31.

@taylorreiter taylorreiter merged commit a33e96e into main Sep 21, 2022
@taylorreiter taylorreiter deleted the update_subtract branch September 21, 2022 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants