Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Depth Coverage parameter #128

Closed
nbargues opened this issue Aug 3, 2022 · 20 comments
Closed

Question about Depth Coverage parameter #128

nbargues opened this issue Aug 3, 2022 · 20 comments
Labels
enhancement New feature or request

Comments

@nbargues
Copy link

nbargues commented Aug 3, 2022

Hello , I use clair3 on my Nanopore data.

I have a question about the Depth.
I determine the overall depth of my bam file with MosDepth (https://github.com/brentp/mosdepth).
And when I look up to the Depth of a specific variant (parameter DP on the VCF file) on the clair3 VCF output, I see that the Depth is significantly lower that the overall depth ( for example a sample with a overall Depth of 900X will have only 100X on certain variant ).

My question are, how to justify this fluctuation of Depth ? what are the criteria of filtering used by clair3 ?
and I see also if I compare duplicat from two different run that I found also a big fluctuation for the depth. Why ?

Thanks in advance

@aquaskyline
Copy link
Member

Clair3 downsamples input to either 144 or 89-fold coverage if the input coverage is higher than 144.

@nbargues
Copy link
Author

nbargues commented Aug 3, 2022

Thanks for the quick answer. In this image you can see that some variant of around 900X and some around 100X :

image

So how can I explain that difference ? Why some variant are downsamples and other don't ?

Sorry if I understant incorrectly your response

@aquaskyline
Copy link
Member

Even though it is not totally impossible, this level of fluctuation is not normal. Would you be able to show us an IGV screenshot of these sites, or simply send me a bam of this small region so I could look into the details?

@nbargues
Copy link
Author

nbargues commented Aug 4, 2022

Here is a first snp with a DP in the VCF of 184 :
image
And here a second snp with a DP in the VCF of 185 :
image

I have also another information : I also use at the same time LongShot caller (https://github.com/pjedge/longshot) and for the same variant, it founds around 900X.

@aquaskyline
Copy link
Member

It could be because of mapping quality. Clair3 doesn't consider alignments with MQ<5 by default.

@nbargues
Copy link
Author

nbargues commented Aug 8, 2022

After some QC, it is not a quality problem ( all alignment have a MQ > 5 ). Do you have another lead that can explain my issue ? Thanks

@aquaskyline
Copy link
Member

We need a bam of a few concerned positions to tell.

@nbargues
Copy link
Author

nbargues commented Aug 9, 2022

bc01_minimap2_ont.bam.gz

Here is the BAM that correspond with the table of SNP shown above. Thanks

@aquaskyline
Copy link
Member

aquaskyline commented Aug 11, 2022

Thanks for your patience. I confirmed that it's a defect of reporting DP in the r11 version only. Before r11, in which we switched to using C implementation, both pileup-calling and full-alignment-calling were reporting DP as the coverage "after subsampling". But in r11, pileup-calling still reports "after subsampling" while full-alignment-calling reports "raw coverage". We are fixing the problem and preparing to release r12 in the next week to resume reporting "coverage after subsampling" for both pileup-calling and full-alignment-calling in the new release. Before the new version, we can confirm that in r11, despite the reported coverages having discrepancies, the other characteristics of the called variants should be correct.

@nbargues
Copy link
Author

Thanks @aquaskyline for your feedback. I will wait for the new release. When the new release are up, the link of the docker image (docker://hkubal/clair3:latest) will be updated ?

@aquaskyline
Copy link
Member

Correct, all installation options will be updated in sync.

@jftaly
Copy link

jftaly commented Aug 17, 2022

Hello @aquaskyline,

I am working with Nicolas on this issue.
Sorry if I ask naive questions as I don't have a proper knowledge of your tool.
I would like to be sure I understood correctly what would be the VCF output after the bug correction.

Is it correct that in the exemple Nicolas gave, the variants showing a DP ~ 1000X were called with a full-aligment model and therefore they should show a DP ~ 180X, as the other variants (called with pileup), in version r12?

Therfore, whatever the amount of reads with give to Clair3, the DP values in the VCF file should be in the range of 100X-200X?

Many thanks for your help

@aquaskyline
Copy link
Member

aquaskyline commented Aug 17, 2022

@jftaly thanks for asking. To facilitate my following explanation, I use your case as an example, and I define 100x~200x as "subsampled coverage", and ~1000x as "raw coverage". In my previous reply to Nicolas, I said I propose to report "subsampled coverage" for both pileup- and full-alignment-based variant calls in r12. But after discussion with my team, and some experiments, we found that reporting "raw coverage" for both pileup- and full-alignment-based is doable without having any changes to the variant calling results (it is important for us to maintain the result consistency between versions except for sufficiently-benchmarked performance upgrades). So our new plan is to output "raw coverage" for all variant calls. We believe the new plan matches more users' expectations of the DP tag.

In short, in your example, using the upcoming r12, the DP values in the VCF file should be in the range of ~1000x.

@nbargues
Copy link
Author

Thanks @aquaskyline for your response.

I have another question regarding the fact taht Clair3 downsamples input to either 144 or 89-fold coverage if the input coverage is higher than 144.

As you can see on the table in the beginning of this topic, the depth is around 100x~200x but you state that it's maximum 144. So why we have certain value such as 185 ?

Thanks in advance for the clarification

@aquaskyline
Copy link
Member

In pileup calling, subsampling is for lowering the coverage to a value that has ample representatives during model training. The subsampling in pileup calling sets the target average coverage at 144. But because of the fluctuation of coverage along the sequence, both above and below 144 are possible.

In full-alignment calling, the height of the input is limited to 89, thus after subsampling, the coverage can only be 89 or below.

@aquaskyline aquaskyline added the enhancement New feature or request label Aug 19, 2022
@aquaskyline
Copy link
Member

v0.1-r12 is now released.

@nbargues
Copy link
Author

Thanks for the release.

I have a question regarding your method of subsampling ; is it a random subsampling or do you have a specific method for the downsampling ?

And when you said that you have a fluctuation in the coverage ( around 144 ), do you have a max/min coverage where the coverage can't be above/below ?

@aquaskyline
Copy link
Member

For subsampling, we used samtools view -s.

No. using samtools view -s, the subsampling is rescaled by a factor f, the original min/max will become min*f/max*f.

@nbargues
Copy link
Author

Ok. And when you use the parameter -s, do you use a random seed or a harcoded seed ?

Because it seems that when I do a repetability analysis on the same BAM file with clair3, I obtain exactly the same VCF.

@aquaskyline
Copy link
Member

We used a hardcoded seed, thus you should expect the same VCF output. But I need to add one more thing, samtools view -s was used in r11 and the versions before. In r12, in order to retain the DP information, we implemented the "rescaled by a factor f" ourselves, thus not relying on samtools view -s anymore. Nonetheless, the subsampling logic remains the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants