-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update subtraction code and add notebook with first round of results #4
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,555 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what is the impact of downsampling to 1/200 here? If I understand correctly, you are comparing 0.05% of the metagenome to 0.05% of the transcriptome. assuming the full transcriptome and the full metagenome are actually exactly the same, what would the baseline expectation be that you would see something in the downsampled transcriptome not represented in the downsampled metagenome?
- for downstream assembly will the reads need to be adapter trimmed?
- will you do the assembly on the downsampled and subtracted k-mers ?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- it would be comparing 0.5%, but yes. I added more details to the notebook describing what the downsampling value is doing here. Because the same sequences are retained across samples, if a full transcriptome and full metagenome were exactly the same, there would be no sequences in the downsampled transcriptome that weren't in the metagenome -- the sketches would be exactly the same.
- Yes for downstream analysis I will adapter and k-mer trim the reads ( I don't plan to do assembly, but will build assembly graphs; assembly graphs will have better recall for things that are systematically not in our databases, which is what we're targeting here). It will be fast to intersect the fractions of the the leftover sketches that are retained in the trimmed reads.
- No I won't do assembly with the downsampled and subtracted k-mers. Because they're downsampled, it becomes effectively meaningless to do that type of analysis with them (i've done it multiple times for multiple projects over the last 5 years and it's always very disappointing). Instead, I'm using this analysis to inform what downstream samples to dig into more with assembly graph methods.
@@ -0,0 +1,555 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I am really suprised at how high the mean_f_n_hashes is ....am i understanding this correctly, that in several cases >90% of unique hashes in the transcriptome are not in metagenome? it looks like abundance weighting resolves some of this, which you are attributing to sequencing errors, etc. but that would be a super high error rate right?
- Why did you choose k = 31? How do these fractions change with more stringest k = 51?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ya this is fairly un-intuitive, but remember that for every single sequencing error, we get 31 k-mers that are erroneous and will probably only occur in a data set once (think of a sliding window over the error, where the error occurs in every position during that sliding window). This means that distinct k-mers that represent errors really blow up -- abundance-weighting them gets rid of a lot of these problems.
- I chose 31 because it's an ok default :) in reality, we could do it with all k-mer sizes, and i'm continuing the downstream taxonomic profiling with all k-mer sizes. The numbers decrease by ~3% for k=21, and increase by 3% for k = 51. These changes could be due to errors tho ... errors will have less of an impact at smaller k-sizes and bigger impact at bigger ksizes. Also, the conversation we had about the euk rna db (now recorded in taxonomically profiling the subtracted mtx fractions with a eukaryotic RNA database? #5), I only have that database for k = 31.
The way I had implemented the subtraction before was crawlingly slow (documented here: sourmash-bio/sourmash#2248), so I re-implemented it to use the sourmash CLI instead of the python API...because for some weird reason the CLI was faster. I had to change the variable names for the output -- before, I was using the paired sample identifier for the output sigs. now I'm using <MTX_ACCESSIONS>-minus-<MGX_ACCESSION> because this helped the DAG solve correctly for snakemake outside of the python API (previously my python code was taking care of the naming, now I use trick the snakemake DAG into running things correctly).
This PR also adds two rules to run
sourmash sig describe
on all of the signatures calculated by this snakefile.sig describe
reports information about the signature, like the k-size and scaled value used, and the most importantly, the number of distinct and total k-mers in each sketch. Note I had to write this rule twice to solve for both sets of output files. I had it written only once, but then sig describe would have to run all at once on a single thread instead of being parallelized by snakemake across all of the threads on my instance. It was much faster for me to run it in parallel, so I took the hit with duplicated code.The last file in this PR is a notebook that uses the output of sourmash sig describe to estimate the fraction of each metatranscriptome that remains after k-mers in the paired metagenome have been removed.
Next steps (in next PR) will be to taxonomically profile the mtx leftovers, and to make sure host has been removed.