add clumpify-based dedup #970

tomkinsc · 2019-07-02T20:34:40Z

add bbmap.BBMapTool().dedup_clumpify(), along with unit tests; pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl

add bbmap.BBMapTool().dedup_clumpify(), along with unit tests

pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl

replace mvicuna-based read deduplication in taxon_filter.py::deplete() with clumpify-based deduplication that occurs farther upstream in advance of BWA-based depletion; add dedup_bam WDL workflow; in dedup_bam WDL task, create and emit FastQC report of only de-duplicated reads; update unit test input to include dup reads, and update expected output for the test_taxon_filter::TestDepleteHuman integration tests to reflect difference in output from clumpify vs previous mvicuna output

pipes/WDL/workflows/tasks/tasks_read_utils.wdl

read_utils.py

tools/bbmap.py

DNAnexus seems to have replaced their wiki with a new documentation page ( https://documentation.dnanexus.com/downloads ) and the old download URLs along with it

tomkinsc · 2019-07-16T14:15:12Z

Let's hold off on merging this for now. While the tests pass here, the slow tests from DNAnexus (for which we do not have feedback on GitHub) are currently failing during dedup, with a NullPointerException in Picard. For example: https://platform.dnanexus.com/projects/F8PQ6380xf5bK0Qk0YPjB17P/monitor/job/FZZbVxj0xf5jZ07ZJ71gpFJ6
It's possible PR #977 may resolve the issue, so the changes there should be merged first.

Allow containments (where one sequence is shorter) when using bbmap clumpify to deduplicate

tomkinsc · 2019-08-26T13:34:47Z

We should either remove containment=t or wait to merge this PR until there's movement to address the apparent bug in bbmap clumpify identified by @yesimon: https://sourceforge.net/p/bbmap/tickets/18/

dpark01

Some specific comments in a few places above, mostly on the WDL, and broader question as a line-comment on the taxon_filter.deplete step.

One thing I can't quite tell from the code diffs: how well does clumpify.sh & our wrapper code handle the preservation of bam headers? This is something that we go to extreme lengths to preserve in all our other fastq-based tool invocations (to the extent of splitting out into separate files by RG, running things like novoalign separately on each set, and re-merging together at the end, to maintain proper RG to read mappings).

dpark01 · 2019-11-20T01:29:39Z

pipes/WDL/workflows/demux_metag.wdl

+      }
+  }
+
+  scatter(reads_bam in dedup.dedup_bam) {


I wouldn't double-scatter. You can just keep this as a single scatter block on the raw_reads and put all the task calls together in that single scatter. WDL interpreters/compilers are smart enough to figure out the DAG and parallelization opportunities within the scatter based on the dependencies between their inputs and outputs.

So specifically:

scatter(raw_reads in illumina_demux.raw_reads_unaligned_bams) { call reads.dedup_bam as dedup { input: in_bam = raw_reads } call reports.spikein_report as spikein { input: reads_bam = dedup.dedup_bam } call taxon_filter.deplete_taxa as deplete { input: raw_reads_unmapped_bam = dedup.dedup_bam } call assembly.assemble as spades { input: assembler = "spades", reads_unmapped_bam = deplete.cleaned_bam } }

Two more thoughts on this topic:

Importantly, merging them would allow the execution platform to keep going on the linear DAG portions of each sample as they become ready without waiting for all samples to complete dedup before proceeding to the next steps.

I wonder if we should consider running the spikein counting step on raw / non-deduplicated reads.... ERCC's are so short that we might quickly hit an artificial upper bound on the counts if we do it on dedup output.

dpark01 · 2019-11-20T01:32:39Z

pipes/WDL/workflows/demux_metag.wdl

@@ -5,18 +5,26 @@ import "tasks_metagenomics.wdl" as metagenomics
 import "tasks_taxon_filter.wdl" as taxon_filter


If you think this is ready and want to try it out, shouldn't you remove #DX_SKIP_WORKFLOW?

dpark01 · 2019-11-20T01:37:56Z

pipes/WDL/workflows/demux_metag.wdl

@@ -39,6 +47,6 @@ workflow demux_metag {
  }
  call metagenomics.kaiju as kaiju {
    input:


We haven't really used kaiju regularly via WDL yet, but I'm betting that we may want to consider moving it to a scatter-on-single-sample execution mode (like everything else in our WDLs except kraken). Its database is about 4x smaller (I'm guessing the localization time is just a few minutes) and the execution time of the algorithm is much slower, so the cost efficiency (algorithmic compute time vs VM wall clock time) of kaiju on a single sample is much better than kraken... so we might as well move it within the same scatter block as well.

dpark01 · 2019-11-20T01:38:35Z

pipes/WDL/workflows/demux_metag.wdl

@@ -27,7 +35,7 @@ workflow demux_metag {

  call metagenomics.krakenuniq as kraken {
    input:
-      reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams,
+      reads_unmapped_bam = dedup.dedup_bam
  }
  call reports.aggregate_metagenomics_reports as metag_summary_report {


Can we call aggregate_metageomics_reports a second time on the kaiju outputs as well?

metagenomics.py::taxlevel_summary() hasn't been adapted/tested to read kaiju summary files yet. I'd like that to be a separate PR (this one is already way beyond its initial scope).

dpark01 · 2019-11-20T01:39:05Z

pipes/WDL/workflows/demux_plus.wdl

+        }
+    }
+
+    scatter(reads_bam in dedup.dedup_bam) {


See my comment in demux_metag about combining the scatter blocks.

dpark01 · 2019-11-20T01:43:43Z

taxon_filter.py

-                                        sanitize      = not args.do_not_sanitize) as bamToDeplete:
+                                        sanitize      = not args.do_not_sanitize) as bam_to_dedup:
+
+        read_utils.rmdup_mvicuna_bam(bam_to_dedup, args.rmdupBam, JVMmemory=args.JVMmemory)


In this new world, should we consider:

dropping deduplication entirely from taxon_filter.deplete -- since you now include it in all the pipelines prior to depletion anyway, and since it never really fit in the scope of the name of the command, it was historically embedded in a funny place between depletion steps primary because of its performance profile: it was slower than bmtagger (so we ran it after that) but faster than blastn (so we ran it before that)

dropping mvicuna altogether if we think bbmap is better

…on raw rather than de-duped reads

fix bug in conda command quiet calling ('-q -y' must be after 'conda <command>')

for bbmap clumpify de-dup, merge like-library RGs and perform deduplication on each, then gather the IDs of kept reads, and filter the input sam based on the list of IDs to keep so as to maintain header and RG information. move most of the theprocessing to bbmap.py::dedup_clumpify so it has a more simple interface that accepts one bam and emits one bam. ToDo: parallelize across LBs

…kflow

dpark01 · 2019-11-20T21:45:16Z

pipes/WDL/workflows/demux_plus.wdl

-        input:
-            reads_unmapped_bam = illumina_demux.raw_reads_unaligned_bams
+        # classify de-duplicated reads to taxa via krakenuniq
+        call metagenomics.krakenuniq as krakenuniq {


Oh, I don't think I'd bump this one inside the scatter. Unless maybe you want to take the opportunity to collect some data from this branch on cost efficiency changes (I've always been curious). The kraken and krakenuniq dbs have always burned about 10-15mins per job on db localization and unpacking -- which, when multiplied by the number of samples on highmem machines, adds up (which motivated the batched approach). But on principle I've always wanted to improve our staging time and move it within the scatter, but I guess I always assumed we'd never get there until we move to kraken2 (much smaller databases). That said, what's the dollars-per-demux-plus on our test/CI flowcell comparing scattered kraken vs batched kraken? If it's only a couple bucks extra, I might be fine with moving there now.

I was curious of the cost delta, but CI builds don't show us costs since they're billed to the DNAnexus Science Team. I'll just move it back outside the scatter...

change to clumpify for pre-depletion dedup; dedup lication can be likely be removed from depletion entirely in the future once all calls in the codebase have been updated to have one fewer arg

remove rmdup from depletion call, remove *.rmdup.bam from positional arguments for depletion CLI parser, remove *.rmdup.bam from inputs where depletion is called (test cases, WDL), remove *.rmdup.bam from expected depletion outputs. Chance Snakemake merge_one_per_sample rule to call rmdup_clumpify_bam rather than rmdup_mvicuna_bam

yesimon · 2019-11-23T15:51:52Z

tools/bbmap.py

+                            for line in inf:
+                                if (line_num % 4) == 0:
+                                    idVal = line.rstrip('\n')[1:]
+                                    if idVal.endswith('/1'):


Does this handle novaseq-like read IDs? https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html

Do you have an example of what you mean? NovaSeq-like read IDs look the same from a few bam files I compared (NovaSeq vs HiSeq). Since we're doing the conversion to fastq files it should hopefully be fine? This is also the same code we've been using for mvicuna dedup, so the behavior should be the same (though that's not to say we missed a code path to update for NovaSeq support).

Read IDs that identify the pairs with a space and 1:N:0:barcodes instead of /1 or /2

yesimon · 2019-11-23T15:52:17Z

tools/bbmap.py

+                    for fq in (outFastq1, outFastq2):
+                        with util.file.open_or_gzopen(fq, 'rt') as inf:
+                            line_num = 0
+                            for line in inf:


for line_num, line in enumerate(inf)

yesimon · 2019-11-23T16:17:54Z

tools/bbmap.py

+
+                per_lb_read_lists.append(per_lb_read_list)
+        # merge per-library read lists together
+        util.file.concat(per_lb_read_lists, readListAll)


Can't the read lists be written to a single file to start with instead of concat'ing later?

The read lists are separate in prep for parallelizing across libraries.

single-end reads do not have /1 /2 mate suffix, so pass through IDs missing the suffix

tomkinsc · 2020-11-17T06:16:12Z

Here's another one, time to port over to viral-core?

tomkinsc added 4 commits July 2, 2019 15:24

add bbmap.BBMapTool().dedup_clumpify()

a5d58b5

add bbmap.BBMapTool().dedup_clumpify(), along with unit tests

pass JVMmemory; add read_utils.rmdup_clumpify_bam; dedup_bam WDL task

595764e

pass JVMmemory to bbmap and clumpify; add rmdup_clumpify_bam to read_utils.py; change TestRmdupUnaligned unit tests for bbmap to use read_utils.py::rmdup_clumpify_bam; add dedup_bam WDL task to tasks_read_utils.wdl

replace unicode apostrophe

df208ea

yesimon reviewed Jul 2, 2019

View reviewed changes

tomkinsc added 16 commits July 2, 2019 17:52

reduce clumpify max_mismatches 5->3

98ac4fc

dump dx-toolkit version and update URL to reflect new source

784877a

DNAnexus seems to have replaced their wiki with a new documentation page ( https://documentation.dnanexus.com/downloads ) and the old download URLs along with it

dedup prior to metagenomics classification in WDL workflows

232f9cd

add missing import

c01bb5b

rename read_utils.wdl -> downsample.wdl, dedup.wdl

e25ef52

rename dedup_bam wdl workflow to "dedup"

6ba96d4

increase dx instance size for dedup and memory spec

e8a4081

correct argparse parser attachement for rmdup_clumpify_bam

8280063

wrap WDL variable in dedup command block for var interpolation

b86b1c9

avoid collision

8afe18f

Merge branch 'master' into ct-add-clumpify

7d2f45a

Merge branch 'master' into ct-add-clumpify

f48038e

Merge branch 'master' into ct-add-clumpify

c78f246

add sambamba since bbtools looks for it?

72fb4cd

Merge branch 'master' into ct-add-clumpify

d97f773

remove sambamba

a685a8a

tomkinsc added 2 commits August 1, 2019 09:49

specify containment=t for bbmap clumpify

a2ce0f1

Allow containments (where one sequence is shorter) when using bbmap clumpify to deduplicate

Merge branch 'master' into ct-add-clumpify

6f26717

tomkinsc added 5 commits September 11, 2019 14:54

Merge branch 'master' into ct-add-clumpify

b73950e

Merge branch 'master' into ct-add-clumpify

a3010ea

enforce containment=False; more tolerant bbmap unit test

8199b12

update miniconda ssl certs

6bb3f6b

increase debug info emitted by build-conda.sh

97eff11

set bbmap jvmMemDefault='2g'; 1g for clumpify test

a0735c7

tomkinsc requested a review from dpark01 November 19, 2019 20:35

dpark01 reviewed Nov 20, 2019

View reviewed changes

tomkinsc added 8 commits November 19, 2019 21:18

no longer skip demux_metag from validation/compilation

663deba

demux_plus/demux_metag: merge linear parts of scatters, run spike-in …

5a7ed3b

…on raw rather than de-duped reads

add DNAnexus defaults for demux_metag, set inputs in demux_metag

2be4a85

rmdup_clumpify_bam: preserve sortorder value of input bam

1d691b2

bump bbmap version 38.71 -> 38.73

1ba7415

fix bug in conda command quiet calling

bb589a1

fix bug in conda command quiet calling ('-q -y' must be after 'conda <command>')

demux_plus/demux_metag: update dx defaults and pass explicitly in wor…

ca726d0

…kflow

dpark01 reviewed Nov 20, 2019

View reviewed changes

tomkinsc added 10 commits November 20, 2019 17:16

remove redundant defaults from dx wdl test inputs

3f9f188

move krakenuniq back outside scatter

995cf0d

respecify kaiju deps

d54eff3

WDL dedup_bam: report read count before & after dedup

21a6ac4

switch to clumpify for downsample dedup

13f5172

change to clumpify for pre-depletion dedup

c1d18be

change to clumpify for pre-depletion dedup; dedup lication can be likely be removed from depletion entirely in the future once all calls in the codebase have been updated to have one fewer arg

--JVMmemory=1g for TestDepleteHuman

7c45da6

expand arguments exposed for clumpify dedup

12f73cb

update expected depletion output now that we're not running dedup on it

16a2b50

yesimon reviewed Nov 23, 2019

View reviewed changes

tomkinsc added 3 commits December 2, 2019 16:12

scatter/gather clumpify dedup across libraries

f1f9a40

Merge branch 'master' into ct-add-clumpify

49bffcb

pass through single-end IDs for bbmap dedup

362d0f3

single-end reads do not have /1 /2 mate suffix, so pass through IDs missing the suffix

tomkinsc mentioned this pull request Mar 27, 2020

deplete_bwa_bam removes all single-ended reads broadinstitute/viral-classify#5

Closed

Merge branch 'master' into ct-add-clumpify

f816f6b

tomkinsc mentioned this pull request Apr 10, 2020

decouple read deduplication from taxonomic filtration broadinstitute/viral-classify#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add clumpify-based dedup #970

add clumpify-based dedup #970

tomkinsc commented Jul 2, 2019

tomkinsc commented Jul 16, 2019

tomkinsc commented Aug 26, 2019 •

edited

Loading

dpark01 left a comment

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

tomkinsc Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

dpark01 Nov 20, 2019

tomkinsc Nov 20, 2019

yesimon Nov 23, 2019

tomkinsc Dec 3, 2019

yesimon Dec 3, 2019 •

edited

Loading

yesimon Nov 23, 2019

yesimon Nov 23, 2019

tomkinsc Dec 2, 2019

tomkinsc commented Nov 17, 2020

		@@ -5,18 +5,26 @@ import "tasks_metagenomics.wdl" as metagenomics
		import "tasks_taxon_filter.wdl" as taxon_filter

add clumpify-based dedup #970

Are you sure you want to change the base?

add clumpify-based dedup #970

Conversation

tomkinsc commented Jul 2, 2019

tomkinsc commented Jul 16, 2019

tomkinsc commented Aug 26, 2019 • edited Loading

dpark01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yesimon Dec 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomkinsc commented Nov 17, 2020

tomkinsc commented Aug 26, 2019 •

edited

Loading

yesimon Dec 3, 2019 •

edited

Loading