Add group-wise co-assembly and different mapping strategies #146

skrakau · 2020-12-16T16:35:19Z

I added the option to perform group-wise co-assembly. For MEGAHIT, samples are grouped and handed over as lists, while for SPAdes the reads are actually merged before.
Adresses #21.

Also added different mapping strategies to compute the co-abundances used for binning: mapping the assemblies against all, group or own reads. The latter can only be specified, if no co-assembly was performed.
Addresses #91.

I will add a more detailed description to the docs afterwards, in a separate PR.

Feedback highly appreciated ;)

PR checklist

This comment contains a description of changes (with reason)
CHANGELOG.md is updated
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated
If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo

…emble_group' param

d4straub

Added some comments, also, manifest format needs to be explained and input (requires ".tsv").

main.nf

d4straub · 2020-12-18T12:44:16Z

main.nf

+        tag "$name"
+
+        input:
+        set val(name), val(grp), file(reads1), file(reads2) from ch_grouped_short_reads


file(reads1), file(reads2) doesn't work with single-end, does it? file(reads2) wouldn't exists? Doesn't it error than? We actually do not have a single end test case, do we?

reads2 should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990

But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.

reads2 should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990

I see, great.

But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.

A new test config could be added, such as test_single.config, with already available data but only forward reads?

the tests using single end only fail for MetaBAT, also for previous releases of the pipeline. Maybe we can add this in future PRs, anyway, I also wanted to add tests using the new functionalities. However, the tests take quite a while already and maybe we can optimize something there...

my bad, I probably forgot to set min_length_unbinned_contigs = 1 and max_unbinned_contigs = 2...

the tests take quite a while already and maybe we can optimize something there...

the nf-core template suggests parallelization of CI tests by strategy.matrix. Haven't researched or tested that yet.

main.nf

d4straub · 2020-12-18T12:58:59Z

nextflow_schema.json

@@ -385,6 +385,10 @@
            "description": "",
            "default": "",
            "properties": {
+                "coassemble_group": {
+                    "type": "boolean",
+                    "description": "Co-assemble samples within one group, instead of assembling each sample separately."


probably add: "with --input manifest.tsv only" or such

ah, no, if not using input.tsv but "reads_R{1,2}.fastq.gz", all reads will end up in the same group (0), and thus can also be co-assembed.

I will add this information to the docs, it might be a bit too much for help

so by default, the group information is not used, neither for co-assembly nor for computing co-abundances ...

That seems to be also a really important info.

Would it be better to make default for co-abundances to group? In case the input is without manifest.tsv, all samples have "group" 0 and it still would equal "all", correct?

For co-assembly, I am not sure whats the best default. If we choose here "params.coassemble_group=true", than without manifest.tsv, one assembly would be generated, correct? I am a little afraid that people (incl. me) will produce a neat manifest but forget to specify --coassemble_group...
edit: probably make params.coassemble_group=true default when specifying --input manifest.tsv?

I agree regarding setting the default of 'binning_map_mode' to group

When setting the default of coassemble_group to true, I could try to add different groups for each sample when not using a manifest.tsv. Setting different defaults might be confusing.

hm... having each sample in a different group is also not nice, since the assemblies will be named groupX then.

Why not call group like the name of the sample right when parsing input instead of "0"?

e.g. https://github.com/skrakau/mag/blob/f773cb74863cc5e07bc1e681b5890d5e2f91a094/main.nf#L183
instead of
.map { row -> [ row[0], 0, [ file(row[1][0], checkIfExists: true) ] ] }
use
.map { row -> [ row[0], row[0], [ file(row[1][0], checkIfExists: true) ] ] }
might that work?

true, anyway all samples require unique names in this case

see below...

d4straub · 2020-12-18T13:00:59Z

nextflow_schema.json

+                    "type": "string",
+                    "default": "all",
+                    "description": "Defines mapping strategy to compute co-abundances for binning, i.e. which samples will be mapped against assembly.",
+                    "help_text": "Available: 'all', 'group' or 'own'. Note that 'own' cannot be specififed in combination with --coassemble_group."


Probably add more explanation, e.g. that bowtie processes increase exponentially with samples or such.

skrakau · 2020-12-22T16:28:08Z

Ok, concerning the default behaviour and different input types again: from my view it would be best (and intuitively) to keep the following setting:

default: coassemble_group = false
default: binning_map_mode = group
input FASTQ files: assign to one group

For FASTQ input, if one would assign each sample to a different group, then one would need to change the default of binning_map_mode to all for this input type, otherwise this would cause a major difference to previous mag versions. And changing defaults for different input types would be confusing.
Moreover, assigning all to different groups with (co-assembly enabled by default) would still be counterintuitive (pooling processes are run, files are renamed "pooled", and also the new name is "group[group_id]", since I assumed group IDs might be rather short). Assigning all samples to one group would also have the advantage, that the user could still decide to turn on co-assembly for this input.

Regarding the default of coassemble_group, wouldn't it be reasonable to disable this by default, since there is an ongoing discussion, if or when samples should be pooled for this (so rather force people to consciously activate this than the other way round)? The group information can anyway be used for the two different layers, and it might not be necessary to use it by default for both.

I added some brief documentation.

d4straub

Looks good to me! Great work!

skrakau · 2021-01-07T13:51:15Z

Thanks @d4straub !

skrakau force-pushed the add_pooling branch from 915de41 to 01c7d2a Compare December 16, 2020 16:55

skrakau added 15 commits December 17, 2020 16:28

Init deprecated manifest param to avoid warning

2003f14

Add pooling for coassembly

c0408ab

Add group to filenames

9be1524

Update help and summary

d205eb4

Update schema

4965586

Add different binning mapping strategies

611ace0

Check if param 'binning_map_mode' is valid in combination with 'coass…

ef1a4c2

…emble_group' param

Check if 'binning_map_mode' param valid

4bad754

Update schema

75f90f0

Update CHANGELOG

8ceb62d

Renamed mapping modes

dee5f9d

Fix channels for direct fastq input

5dc282c

Replce Channel.from() with Channel.empty()

0580745

Changed comments

500a7a0

Remove view()s

2c41ac0

skrakau force-pushed the add_pooling branch from 01c7d2a to 2c41ac0 Compare December 17, 2020 18:42

skrakau marked this pull request as ready for review December 18, 2020 12:29

skrakau requested a review from d4straub December 18, 2020 12:30

d4straub reviewed Dec 18, 2020

View reviewed changes

skrakau added 8 commits December 18, 2020 14:35

Rephrase parameter checking

2a5290a

Renamed 'grp' variable to 'group'

445d80c

Removed unnecessary echo command

a4b5e5d

Fix for channel for single_end case

8c8f29c

Changed default of binning_map_mode to group

7b918ec

Update input description in help

f16b64c

Update input description in schema

9a5ede3

Update binning_map_mode description in schema

88b3acb

skrakau force-pushed the add_pooling branch 2 times, most recently from 351e3a4 to f3372ca Compare December 22, 2020 11:04

skrakau force-pushed the add_pooling branch 2 times, most recently from 7d90d0d to aaf4109 Compare December 22, 2020 15:22

skrakau added 2 commits December 22, 2020 16:33

Add input description to usage.md

cd2ad61

Add links to input description

d59fcc9

skrakau force-pushed the add_pooling branch from c9b1123 to edbf2ee Compare December 22, 2020 15:33

skrakau added 3 commits December 22, 2020 16:44

Added group info to output.md

05f755c

Describe group-wise co-assemby and co-abundance computation in README

2ae8221

Pool reads only if assembly will be run

8703ba6

skrakau force-pushed the add_pooling branch from edbf2ee to 8703ba6 Compare December 22, 2020 15:45

d4straub approved these changes Jan 7, 2021

View reviewed changes

skrakau merged commit 741df09 into nf-core:dev Jan 7, 2021

skrakau deleted the add_pooling branch May 31, 2021 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add group-wise co-assembly and different mapping strategies #146

Add group-wise co-assembly and different mapping strategies #146

skrakau commented Dec 16, 2020 •

edited

Loading

d4straub left a comment

d4straub Dec 18, 2020

skrakau Dec 18, 2020

d4straub Dec 18, 2020

skrakau Dec 18, 2020

skrakau Dec 18, 2020

d4straub Dec 18, 2020

d4straub Dec 18, 2020

skrakau Dec 18, 2020

skrakau Dec 18, 2020

skrakau Dec 18, 2020

d4straub Dec 18, 2020 •

edited

Loading

skrakau Dec 18, 2020

skrakau Dec 18, 2020

d4straub Dec 18, 2020

skrakau Dec 21, 2020

skrakau Dec 22, 2020

d4straub Dec 18, 2020

skrakau Dec 22, 2020

skrakau commented Dec 22, 2020

d4straub left a comment

skrakau commented Jan 7, 2021

Add group-wise co-assembly and different mapping strategies #146

Add group-wise co-assembly and different mapping strategies #146

Conversation

skrakau commented Dec 16, 2020 • edited Loading

PR checklist

d4straub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d4straub Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skrakau commented Dec 22, 2020

d4straub left a comment

Choose a reason for hiding this comment

skrakau commented Jan 7, 2021

skrakau commented Dec 16, 2020 •

edited

Loading

d4straub Dec 18, 2020 •

edited

Loading