Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add group-wise co-assembly and different mapping strategies #146

Merged
merged 28 commits into from
Jan 7, 2021

Conversation

skrakau
Copy link
Member

@skrakau skrakau commented Dec 16, 2020

I added the option to perform group-wise co-assembly. For MEGAHIT, samples are grouped and handed over as lists, while for SPAdes the reads are actually merged before.
Adresses #21.

Also added different mapping strategies to compute the co-abundances used for binning: mapping the assemblies against all, group or own reads. The latter can only be specified, if no co-assembly was performed.
Addresses #91.

I will add a more detailed description to the docs afterwards, in a separate PR.

Feedback highly appreciated ;)

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo

@skrakau skrakau marked this pull request as ready for review December 18, 2020 12:29
@skrakau skrakau requested a review from d4straub December 18, 2020 12:30
Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, also, manifest format needs to be explained and input (requires ".tsv").

main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Outdated
tag "$name"

input:
set val(name), val(grp), file(reads1), file(reads2) from ch_grouped_short_reads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file(reads1), file(reads2) doesn't work with single-end, does it? file(reads2) wouldn't exists? Doesn't it error than? We actually do not have a single end test case, do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reads2 should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990

But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reads2 should be an empty list in that case, see https://github.com/skrakau/mag/blob/2c41ac0b6949819c9ed93d53fa9a2c6a1e3e8c72/main.nf#L990

I see, great.

But I have to admit, I didn't really test this (also because I would not be surprised, if it would cause problems already for previous versions). I will try to test.

A new test config could be added, such as test_single.config, with already available data but only forward reads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests using single end only fail for MetaBAT, also for previous releases of the pipeline. Maybe we can add this in future PRs, anyway, I also wanted to add tests using the new functionalities. However, the tests take quite a while already and maybe we can optimize something there...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I probably forgot to set min_length_unbinned_contigs = 1 and max_unbinned_contigs = 2...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests take quite a while already and maybe we can optimize something there...

the nf-core template suggests parallelization of CI tests by strategy.matrix. Haven't researched or tested that yet.

main.nf Outdated Show resolved Hide resolved
main.nf Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Show resolved Hide resolved
@@ -385,6 +385,10 @@
"description": "",
"default": "",
"properties": {
"coassemble_group": {
"type": "boolean",
"description": "Co-assemble samples within one group, instead of assembling each sample separately."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably add: "with --input manifest.tsv only" or such

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, no, if not using input.tsv but "reads_R{1,2}.fastq.gz", all reads will end up in the same group (0), and thus can also be co-assembed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this information to the docs, it might be a bit too much for help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so by default, the group information is not used, neither for co-assembly nor for computing co-abundances ...

Copy link
Collaborator

@d4straub d4straub Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be also a really important info.

Would it be better to make default for co-abundances to group? In case the input is without manifest.tsv, all samples have "group" 0 and it still would equal "all", correct?

For co-assembly, I am not sure whats the best default. If we choose here "params.coassemble_group=true", than without manifest.tsv, one assembly would be generated, correct? I am a little afraid that people (incl. me) will produce a neat manifest but forget to specify --coassemble_group...
edit: probably make params.coassemble_group=true default when specifying --input manifest.tsv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree regarding setting the default of 'binning_map_mode' to group

When setting the default of coassemble_group to true, I could try to add different groups for each sample when not using a manifest.tsv. Setting different defaults might be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... having each sample in a different group is also not nice, since the assemblies will be named groupX then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call group like the name of the sample right when parsing input instead of "0"?

e.g. https://github.com/skrakau/mag/blob/f773cb74863cc5e07bc1e681b5890d5e2f91a094/main.nf#L183
instead of
.map { row -> [ row[0], 0, [ file(row[1][0], checkIfExists: true) ] ] }
use
.map { row -> [ row[0], row[0], [ file(row[1][0], checkIfExists: true) ] ] }
might that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, anyway all samples require unique names in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below...

"type": "string",
"default": "all",
"description": "Defines mapping strategy to compute co-abundances for binning, i.e. which samples will be mapped against assembly.",
"help_text": "Available: 'all', 'group' or 'own'. Note that 'own' cannot be specififed in combination with --coassemble_group."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add more explanation, e.g. that bowtie processes increase exponentially with samples or such.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@skrakau skrakau force-pushed the add_pooling branch 2 times, most recently from 351e3a4 to f3372ca Compare December 22, 2020 11:04
@skrakau skrakau force-pushed the add_pooling branch 2 times, most recently from 7d90d0d to aaf4109 Compare December 22, 2020 15:22
@skrakau
Copy link
Member Author

skrakau commented Dec 22, 2020

Ok, concerning the default behaviour and different input types again: from my view it would be best (and intuitively) to keep the following setting:

  • default: coassemble_group = false
  • default: binning_map_mode = group
  • input FASTQ files: assign to one group

For FASTQ input, if one would assign each sample to a different group, then one would need to change the default of binning_map_mode to all for this input type, otherwise this would cause a major difference to previous mag versions. And changing defaults for different input types would be confusing.
Moreover, assigning all to different groups with (co-assembly enabled by default) would still be counterintuitive (pooling processes are run, files are renamed "pooled", and also the new name is "group[group_id]", since I assumed group IDs might be rather short). Assigning all samples to one group would also have the advantage, that the user could still decide to turn on co-assembly for this input.

Regarding the default of coassemble_group, wouldn't it be reasonable to disable this by default, since there is an ongoing discussion, if or when samples should be pooled for this (so rather force people to consciously activate this than the other way round)? The group information can anyway be used for the two different layers, and it might not be necessary to use it by default for both.

I added some brief documentation.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Great work!

@skrakau
Copy link
Member Author

skrakau commented Jan 7, 2021

Thanks @d4straub !

@skrakau skrakau merged commit 741df09 into nf-core:dev Jan 7, 2021
@skrakau skrakau deleted the add_pooling branch May 31, 2021 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants