-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add clumpify-based dedup #970
base: master
Are you sure you want to change the base?
Conversation
add bbmap.BBMapTool().dedup_clumpify(), along with unit tests
pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl
replace mvicuna-based read deduplication in taxon_filter.py::deplete() with clumpify-based deduplication that occurs farther upstream in advance of BWA-based depletion; add dedup_bam WDL workflow; in dedup_bam WDL task, create and emit FastQC report of only de-duplicated reads; update unit test input to include dup reads, and update expected output for the test_taxon_filter::TestDepleteHuman integration tests to reflect difference in output from clumpify vs previous mvicuna output
DNAnexus seems to have replaced their wiki with a new documentation page ( https://documentation.dnanexus.com/downloads ) and the old download URLs along with it
Let's hold off on merging this for now. While the tests pass here, the slow tests from DNAnexus (for which we do not have feedback on GitHub) are currently failing during dedup, with a NullPointerException in Picard. For example: https://platform.dnanexus.com/projects/F8PQ6380xf5bK0Qk0YPjB17P/monitor/job/FZZbVxj0xf5jZ07ZJ71gpFJ6 |
Allow containments (where one sequence is shorter) when using bbmap clumpify to deduplicate
We should either remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some specific comments in a few places above, mostly on the WDL, and broader question as a line-comment on the taxon_filter.deplete step.
One thing I can't quite tell from the code diffs: how well does clumpify.sh & our wrapper code handle the preservation of bam headers? This is something that we go to extreme lengths to preserve in all our other fastq-based tool invocations (to the extent of splitting out into separate files by RG
, running things like novoalign
separately on each set, and re-merging together at the end, to maintain proper RG
to read mappings).
pipes/WDL/workflows/demux_metag.wdl
Outdated
} | ||
} | ||
|
||
scatter(reads_bam in dedup.dedup_bam) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't double-scatter. You can just keep this as a single scatter block on the raw_reads and put all the task calls together in that single scatter. WDL interpreters/compilers are smart enough to figure out the DAG and parallelization opportunities within the scatter based on the dependencies between their inputs and outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So specifically:
scatter(raw_reads in illumina_demux.raw_reads_unaligned_bams) {
call reads.dedup_bam as dedup {
input:
in_bam = raw_reads
}
call reports.spikein_report as spikein {
input:
reads_bam = dedup.dedup_bam
}
call taxon_filter.deplete_taxa as deplete {
input:
raw_reads_unmapped_bam = dedup.dedup_bam
}
call assembly.assemble as spades {
input:
assembler = "spades",
reads_unmapped_bam = deplete.cleaned_bam
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more thoughts on this topic:
- Importantly, merging them would allow the execution platform to keep going on the linear DAG portions of each sample as they become ready without waiting for all samples to complete dedup before proceeding to the next steps.
- I wonder if we should consider running the spikein counting step on raw / non-deduplicated reads.... ERCC's are so short that we might quickly hit an artificial upper bound on the counts if we do it on dedup output.
@@ -5,18 +5,26 @@ import "tasks_metagenomics.wdl" as metagenomics | |||
import "tasks_taxon_filter.wdl" as taxon_filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this is ready and want to try it out, shouldn't you remove #DX_SKIP_WORKFLOW?
pipes/WDL/workflows/demux_metag.wdl
Outdated
@@ -39,6 +47,6 @@ workflow demux_metag { | |||
} | |||
call metagenomics.kaiju as kaiju { | |||
input: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't really used kaiju regularly via WDL yet, but I'm betting that we may want to consider moving it to a scatter-on-single-sample execution mode (like everything else in our WDLs except kraken). Its database is about 4x smaller (I'm guessing the localization time is just a few minutes) and the execution time of the algorithm is much slower, so the cost efficiency (algorithmic compute time vs VM wall clock time) of kaiju on a single sample is much better than kraken... so we might as well move it within the same scatter block as well.
pipes/WDL/workflows/demux_metag.wdl
Outdated
@@ -27,7 +35,7 @@ workflow demux_metag { | |||
|
|||
call metagenomics.krakenuniq as kraken { | |||
input: | |||
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams, | |||
reads_unmapped_bam = dedup.dedup_bam | |||
} | |||
call reports.aggregate_metagenomics_reports as metag_summary_report { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call aggregate_metageomics_reports a second time on the kaiju outputs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metagenomics.py::taxlevel_summary()
hasn't been adapted/tested to read kaiju summary files yet. I'd like that to be a separate PR (this one is already way beyond its initial scope).
pipes/WDL/workflows/demux_plus.wdl
Outdated
} | ||
} | ||
|
||
scatter(reads_bam in dedup.dedup_bam) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in demux_metag about combining the scatter blocks.
taxon_filter.py
Outdated
sanitize = not args.do_not_sanitize) as bamToDeplete: | ||
sanitize = not args.do_not_sanitize) as bam_to_dedup: | ||
|
||
read_utils.rmdup_mvicuna_bam(bam_to_dedup, args.rmdupBam, JVMmemory=args.JVMmemory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this new world, should we consider:
- dropping deduplication entirely from taxon_filter.deplete -- since you now include it in all the pipelines prior to depletion anyway, and since it never really fit in the scope of the name of the command, it was historically embedded in a funny place between depletion steps primary because of its performance profile: it was slower than bmtagger (so we ran it after that) but faster than blastn (so we ran it before that)
- dropping mvicuna altogether if we think bbmap is better
…on raw rather than de-duped reads
fix bug in conda command quiet calling ('-q -y' must be after 'conda <command>')
for bbmap clumpify de-dup, merge like-library RGs and perform deduplication on each, then gather the IDs of kept reads, and filter the input sam based on the list of IDs to keep so as to maintain header and RG information. move most of the theprocessing to bbmap.py::dedup_clumpify so it has a more simple interface that accepts one bam and emits one bam. ToDo: parallelize across LBs
pipes/WDL/workflows/demux_plus.wdl
Outdated
input: | ||
reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams | ||
# classify de-duplicated reads to taxa via krakenuniq | ||
call metagenomics.krakenuniq as krakenuniq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I don't think I'd bump this one inside the scatter. Unless maybe you want to take the opportunity to collect some data from this branch on cost efficiency changes (I've always been curious). The kraken and krakenuniq dbs have always burned about 10-15mins per job on db localization and unpacking -- which, when multiplied by the number of samples on highmem machines, adds up (which motivated the batched approach). But on principle I've always wanted to improve our staging time and move it within the scatter, but I guess I always assumed we'd never get there until we move to kraken2 (much smaller databases). That said, what's the dollars-per-demux-plus on our test/CI flowcell comparing scattered kraken vs batched kraken? If it's only a couple bucks extra, I might be fine with moving there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious of the cost delta, but CI builds don't show us costs since they're billed to the DNAnexus Science Team. I'll just move it back outside the scatter...
change to clumpify for pre-depletion dedup; dedup lication can be likely be removed from depletion entirely in the future once all calls in the codebase have been updated to have one fewer arg
remove rmdup from depletion call, remove *.rmdup.bam from positional arguments for depletion CLI parser, remove *.rmdup.bam from inputs where depletion is called (test cases, WDL), remove *.rmdup.bam from expected depletion outputs. Chance Snakemake merge_one_per_sample rule to call rmdup_clumpify_bam rather than rmdup_mvicuna_bam
tools/bbmap.py
Outdated
for line in inf: | ||
if (line_num % 4) == 0: | ||
idVal = line.rstrip('\n')[1:] | ||
if idVal.endswith('/1'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle novaseq-like read IDs? https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of what you mean? NovaSeq-like read IDs look the same from a few bam files I compared (NovaSeq vs HiSeq). Since we're doing the conversion to fastq files it should hopefully be fine? This is also the same code we've been using for mvicuna dedup, so the behavior should be the same (though that's not to say we missed a code path to update for NovaSeq support).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools/bbmap.py
Outdated
for fq in (outFastq1, outFastq2): | ||
with util.file.open_or_gzopen(fq, 'rt') as inf: | ||
line_num = 0 | ||
for line in inf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for line_num, line in enumerate(inf)
|
||
per_lb_read_lists.append(per_lb_read_list) | ||
# merge per-library read lists together | ||
util.file.concat(per_lb_read_lists, readListAll) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the read lists be written to a single file to start with instead of concat'ing later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read lists are separate in prep for parallelizing across libraries.
single-end reads do not have /1 /2 mate suffix, so pass through IDs missing the suffix
Here's another one, time to port over to viral-core? |
add
bbmap.BBMapTool().dedup_clumpify()
, along with unit tests; passJVMmemory
to bbmap and clumpify; addrmdup_clumpify_bam
toread_utils.py
; changeTestRmdupUnaligned
unit tests for bbmap to useread_utils.py::rmdup_clumpify_bam
; adddedup_bam
WDL task totasks_read_utils.wdl