-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add group-wise co-assembly and different mapping strategies #146
Conversation
…emble_group' param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, also, manifest format needs to be explained and input (requires ".tsv").
main.nf
Outdated
tag "$name" | ||
|
||
input: | ||
set val(name), val(grp), file(reads1), file(reads2) from ch_grouped_short_reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file(reads1), file(reads2)
doesn't work with single-end, does it? file(reads2)
wouldn't exists? Doesn't it error than? We actually do not have a single end test case, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reads2
should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990
But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reads2 should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990
I see, great.
But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.
A new test config could be added, such as test_single.config, with already available data but only forward reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests using single end only fail for MetaBAT, also for previous releases of the pipeline. Maybe we can add this in future PRs, anyway, I also wanted to add tests using the new functionalities. However, the tests take quite a while already and maybe we can optimize something there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, I probably forgot to set min_length_unbinned_contigs = 1
and max_unbinned_contigs = 2
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests take quite a while already and maybe we can optimize something there...
the nf-core template suggests parallelization of CI tests by strategy.matrix
. Haven't researched or tested that yet.
@@ -385,6 +385,10 @@ | |||
"description": "", | |||
"default": "", | |||
"properties": { | |||
"coassemble_group": { | |||
"type": "boolean", | |||
"description": "Co-assemble samples within one group, instead of assembling each sample separately." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably add: "with --input manifest.tsv
only" or such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, no, if not using input.tsv
but "reads_R{1,2}.fastq.gz"
, all reads will end up in the same group (0
), and thus can also be co-assembed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this information to the docs, it might be a bit too much for help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so by default, the group information is not used, neither for co-assembly nor for computing co-abundances ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be also a really important info.
Would it be better to make default for co-abundances to group? In case the input is without manifest.tsv, all samples have "group" 0 and it still would equal "all", correct?
For co-assembly, I am not sure whats the best default. If we choose here "params.coassemble_group=true", than without manifest.tsv, one assembly would be generated, correct? I am a little afraid that people (incl. me) will produce a neat manifest but forget to specify --coassemble_group
...
edit: probably make params.coassemble_group=true
default when specifying --input manifest.tsv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree regarding setting the default of 'binning_map_mode' to group
When setting the default of coassemble_group
to true
, I could try to add different groups for each sample when not using a manifest.tsv
. Setting different defaults might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... having each sample in a different group is also not nice, since the assemblies will be named groupX
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call group like the name of the sample right when parsing input instead of "0"?
e.g. https://github.com/skrakau/mag/blob/f773cb74863cc5e07bc1e681b5890d5e2f91a094/main.nf#L183
instead of
.map { row -> [ row[0], 0, [ file(row[1][0], checkIfExists: true) ] ] }
use
.map { row -> [ row[0], row[0], [ file(row[1][0], checkIfExists: true) ] ] }
might that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, anyway all samples require unique names in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below...
nextflow_schema.json
Outdated
"type": "string", | ||
"default": "all", | ||
"description": "Defines mapping strategy to compute co-abundances for binning, i.e. which samples will be mapped against assembly.", | ||
"help_text": "Available: 'all', 'group' or 'own'. Note that 'own' cannot be specififed in combination with --coassemble_group." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add more explanation, e.g. that bowtie processes increase exponentially with samples or such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
351e3a4
to
f3372ca
Compare
7d90d0d
to
aaf4109
Compare
Ok, concerning the default behaviour and different input types again: from my view it would be best (and intuitively) to keep the following setting:
For FASTQ input, if one would assign each sample to a different group, then one would need to change the default of Regarding the default of I added some brief documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Great work!
Thanks @d4straub ! |
I added the option to perform group-wise co-assembly. For MEGAHIT, samples are grouped and handed over as lists, while for SPAdes the reads are actually merged before.
Adresses #21.
Also added different mapping strategies to compute the co-abundances used for binning: mapping the assemblies against
all
,group
orown
reads. The latter can only be specified, if no co-assembly was performed.Addresses #91.
I will add a more detailed description to the docs afterwards, in a separate PR.
Feedback highly appreciated ;)
PR checklist
CHANGELOG.md
is updateddocs
is updated