-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep large unbinned contigs for downstream steps #29
Conversation
Checks seem to fail because of internet connectivity? Weird. |
Ok, BUSCO obviously is now available in v4beta (we are using v3) and the location of the database changed. Hope I fixed this now. Default database is just 8Mb, maybe we could either upgrade to v4 or store the v3 default database in nf-core/test-datasets that we do not loose it again... edit: seriously, the file seems to have changed as well? Wow. |
Do you mind if I review/merge this after 1.0.0? I'd like the first release to be done before Christmas 😄 Concerning Busco I changed the url in dev, I can make a PR to update to v4 (after 1.0.0 ! 😁 ) |
Fine for me, but I think that's a major flaw. The solution proposed here works fine with test data, but it could be that a pooled bin that is produced right now and forwarded to downstream processes is too big with real data and needs a change. If you manage to get your release out before I optimize this step here (holiday right now, not working on it atm), go ahead ;) |
A pooled bin being too big seems like a separate issue? Since you are not forwarding the unbinned contigs to downstream processes? (Happy holidays!) |
D4straub v0.2
Still linting errors but coming closer ... |
No errors any more, review please. |
bin/split_fasta.py
Outdated
out_base = (os.path.splitext(input_file)[0]) | ||
|
||
# Read file | ||
fasta_sequences = SeqIO.parse(open(input_file),'fasta') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files should be opened with the with
context manager. that way the file handles will close automatically at the end of the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
with open(input_file) as f:
fasta_sequences = SeqIO.parse(f, 'fasta')
# rest of the code that needs the file to be open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! Changed in 8effb0b
main.nf
Outdated
@@ -906,12 +913,12 @@ process metabat { | |||
def name = "${assembler}-${sample}" | |||
""" | |||
jgi_summarize_bam_contig_depths --outputDepth depth.txt ${bam} | |||
metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${name}" -m ${min_size} | |||
metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${name}" -m ${min_size} --seed 1 --unbinned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there rather been no seed by default but an option to fix the seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My aim was to make sure that every time the workflow runs the same results are created. Therefore I fixed the seed here. But later I realized that at least SPAdes is not producing the identical results, obviously there is also some randomness in there and I haven't seen an option to fix the seed there. So I don't feel that it necessary at all any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 17065a9
This solves #27