-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decontamination has a huge effect on real data tests #20
Comments
I will rerun the |
There must be a bug @leoisl , these differences are too big. Can you just ping martin and ask for 15 mins of his time, and show him the results with/out decontam? |
|
I do agree with this reasoning and this was my first thought, but then running |
I think so
Just have one sample for now ( PS: looks like number of SNPs in the SNPs file don't matter much, as we have a single SNP entry per position in the genome, which might call ref or alt (IDK why it is made like this, I just copied the pipeline) |
TLDR: with decontamination, we have ~3.2M Raw data:
|
update: after a chat with @martinghunt , we both think the issue is due to the lack of decontamination. I will rerun |
@martinghunt found that the non-decontaminated consensus has ~382k more Raw data:
|
small update:
|
No, sorry these are old symlinks to data basecalled with a much older version of guppy. This is the rule where the raw basecalled data is output https://github.com/mbhall88/head_to_head_pipeline/blob/d7b30eda3cca5fbaf2dc22dbcacd79f7c3a876c9/data/QC/Snakefile#L172 The most recent version of guppy I basecalled with is 5.0.16. The data you were using was 3.4.5, so that will contribute hugely to the differences you are seeing. To get the paths for the raw data you will need the guppy version, nanopore run name and sample - which are both in the samplesheet. An example is For comparing the consensus files, why not use For the mykrobe results its probably sufficient to check the susceptiblity calls match, no need to check the whole file. |
Thanks for the explanation about the different guppy reads. I am now using the correct ONT raw reads, basecalled with
|
I got a totally different analysis with
Basically, I thought a big difference between the two tools was that I am quite confused why we have such opposing statements from these tools |
Comparing nucmer, mash dist and psdmAll done in codon: Running nucmer
Nucmer outputIt shows 437 alignment blocks, each with ~90% of identity. NOTE: Martin says this Nucmer output is meaningless because the consensus is full of
Although I think Nucmer output:
Running mash
i.e. Running psdm
i.e. identical output, 0 difference. Conclusion:
|
i think an N means mash ignores many bases around Ns also, whereas psdm does not |
psdm ignores all
|
If there is an |
Trust psdm |
This is the output:
If we say that |
PS: I note that |
OK so that is a bit of an issue, as losing 1963 calls is probably losing 80% of calls. I think we'd better count how many non-ref calls we lose. We can afford to lose some, but 1900 is a lot. Is it that bad for all samples Leandro? |
Didn't check for all samples yet, but I want to be sure about what I am stating. Actually, for this sample, we have 1963 more Looking at this sample,
The difference is: raw ONT - 219 more alt |
Yes, just PASS calls are applied. Something which could be quite informative is if the tbpore VCFs have less calls, check the counts on the filters being applied. That might hint at the problem. For instance, if there is lots of FRS filter, then contamination is likely. You could also run hap.py/varifier using the head to head VCF as the truth and use that narrow in on the variants that are missing? |
Also could run decontamination just using the non-human references and see what that does. Oof. |
Am I right to assume we just care about non-ref calls for this? If so, that is on target! (I know it is just one sample, have to check for all others, but can't do this today, on training whole day):
Without decontamination we have ~95% non-
Would this be better/more informative than looking at the
Should I do that in a branch? And how do I measure if decontamination indeed improved results? Looking at |
Hi @leoisl - i agree with making a version without decontam first. Then, on a branch, we could put in the decontam (using the code michael has linked to), using a set of references of NTMs etc but not human, and see what SNPs we get and if the resulrs are closer to head2head than without-decontam. Does that make sense? If it works, and if the RAM use is not too big, i think we should just do that. Finally, @leoisl , you asked if looking at fRS is better than using hap.py/varifier. @mbhall88 is basically wondering if the places where with/out decontam differ are always the same place across genomes, or whether there is some pattern. I guess they are just alternative ways of probing/digging. |
Two other ways of comparing the decontam and non-decontam would be comparing the distance between the SNP distance matrices. i.e. if samples A and B have a distance of 50 with decontam, and 51 without decontam, then that's pretty good. |
I used this snakemake pipeline to run
tbpore
on 91 madagascar samples from Michael. I wanted to compare thehead_to_head_pipeline
results (described here) andtbpore
results when inputting the raw ONT reads, but had difficulties comparing themykrobe
output files as well as the VCFs, due to a high number of differences, so I decided to just compare the easiest output: the consensus fastas. I tried to use edit distance, but it was too slow, so I switched tomash dist
. However, the differences between thehead_to_head_pipeline
andtbpore
on these 91 samples were extremely high (3rd column is themash
distance, first pair of samples have amash
distance if 12.5%, the second 22.9% and so on):We are looking here at
head_to_head_pipeline
andtbpore
having consensus distances between 5% and 25% which I think is way too high, incomparable, especially for TB. So I made a small test: I got themada_2-42
sample, which has amash distance
of 25.4%, andtbpore
with the raw nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/madagascar/nanopore/mada_2-42/mada_2-42.nanopore.fastq.gz
- @mbhall88 could you please confirm this is the path to the raw nanopore reads?) and thenmash dist
to confirm that we are indeed getting amash distance
of 25.4% and there is nothing wrong with the pipeline:tbpore
with the decontaminated nanopore reads (/hps/nobackup/iqbal/mbhall/tech_wars/data/QC/filtered/madagascar/nanopore/mada_2-42/mada_2-42.filtered.fastq.gz
- @mbhall88 could you please confirm this is the path to decontaminated nanopore reads?) and thenmash dist
to now get amash distance
of only 0.05%:So it seems to me decontamination is essential to
tbpore
The text was updated successfully, but these errors were encountered: