-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unify mark and remove duplicates
and validate bam file
workflows
#1095
Comments
unify mark and remove duplicates
and validate bam file
workflows
Continued discussion from issue: #1053 Implementing intermediate step in Sentieon Dedup removes all instances of duplicate reads Looking in more details at the usages of Sentieon Dedup (https://support.sentieon.com/manual/usages/general/) I observed these arguments:
They seemed to be able to solve the issue with the missing mate errors by removing all instances of the duplicate read "regardless of whether they are primary or non-primary". It didn't say anything about unmapped reads, but I decided to give it a try and implemented the intermediate step in the parallel alignment PR (#1090).
And it seems to work as intended, as running validatesamfile on these bamfiles finds no errors. Perhaps this solution can be implemented in the standard pipeline. If a read is marked as a duplicate I cannot at least see any reason why unmapped mates or non-primary alignments of this read should be kept in the bamfile. If this is true the questions still remain if removing these additional reads are important enough to justify adding this extra step which will add some computation-time. To answer the "important enough" part is difficult, as it can be quite difficult to trace the effects of these reads on downstream analysis tools. If I were to guess, the downstream effects of keeping these reads is probably in the range of minor to non-existent, the cleanest option however should be to remove them. I think the most important deciding factor to include it or not is the effect on runtime. If the effect on runtime is in the range of adding ~30 minutes to the total analysis of a WGS T/N case, I think it may be worth it. But this just my thinking at the moment. To begin with I will run the same WGS case 3 times each, with and without the extra dedup-step and compare the runtimes. |
Adding extra dedup-step does not significantly increase the runtime for the Dedup rule Ran the dedup-rule on 2 WGS-samples, tumor WGS120X and normal WGS30X, belonging to case uniquewolf. I summarised the runtime for all steps run in the rule within each category of "elapsed", "user" and "sys".
Below are the averages for each type:
Below is a table with the differences "3 step - 2 step" for the averages.
I think in conclusion from these tests, adding this extra dedup-step does not affect significantly the runtime and should be fine to include, only increasing the average runtime of a 120X WGS sample with ~12 minutes. |
Commenting on the possibility of running fastp as the deduplicate stage instead of the Sentieon Dedup or Picard MarkDuplicate tool, I wonder if it is possible given that we want to keep the workflow parallel upstream of generating the final bamfile. To explain further, and I don't see any option for including multiple pairs of fastq-files to fastp, which would be necessary to be sure we are removing all duplicates at this stage given that the same fragment may have duplicate reads across different lanes. Example: Removing duplicates with Fastp would result in 3 duplicates remaining for fragment X. If Fastp is limited in this way then I think we can decide right away that we cannot rely ENTIRELY on this tool. Perhaps it could be implemented later on as a means of speeding up the analysis a bit by removing some of the duplicates before alignment. But I think this would not be a very high priority thing, and it would make collecting the QC values for % duplicates a bit trickier too. Possible workaround with fastp dedup https://github.com/OpenGene/fastp#output-splitting However if we do this, I don't think there is a way we can perserve the original lane information. The input files will be concatenated like before, and the split will most likely produce fastq-files that are mixed across lanes. So In the end I'm skeptical that Fastp can be our dedup solution! |
Thank you for the update @mathiasbio. Based on the comments above, I suggest we may continue with Sentieon/picard Markdup. We can also add --dont_eval_duplication to fastp and fetch the information from the |
Sounds like a good idea! I'll add it! |
Description
An increased percentage of duplication is observed described in #1059. Looking carefully at how the percentage of duplicates is reported and removed, it is observed that the method for reporting, marking/removal of duplicates is not unified in
BALSAMIC
. There are three bioinformatic tools i.e.fastp
,picard
andSentieon
used for reporting, marking and removal of duplicates.The error identified in SAM validation error in mergeBam rule lead to perform a comprehensive validation check of BAM file leading to the identification of invalid flags for mate in the BAM files. It is also relevant to quote the suggestion from invalid bam flag using picardtools:
The bam files generated multiple errors showing an invalid bam flag for numerous reads. Samtools could fix a few of them, but not all. Picard tools fixmateinformation command performs better than samtools and fixes the invalid bam flags. The possible origin of these errors could be concatenated fastq file. The current PR fixes the issue, but it would be better to use each fastq as input and merge bam files instead.
It is possible that the problems described in the issues above are linked and require a careful overview to make sure that the standard and recommended workflow is implemented in BALSAMIC for the generation of final and valid BAM files.
Suggested solution
This requires identifying and fixing all the problems associated with validation errors in the final BAM file. This includes
FASTQ
filesfastp
,picard
andSentieon
mark and remove duplicates in TGA, WGS (TN/TO) workflows and unify the workflowThis can be closed when:
Describe what needs to be done for this issue to be closed
Blocked by
#1069
#1090
If there are any blocking issues/prs/things in this or other repos. Please link to them.
DAGs:
TN_WGS.pdf
TN_TGA.pdf
Before submitting
The text was updated successfully, but these errors were encountered: