Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to busco 5.1.0 and enable automated lineage selection #179

Merged
merged 44 commits into from
Apr 26, 2021

Conversation

skrakau
Copy link
Member

@skrakau skrakau commented Apr 8, 2021

Among others in preparation for adding GTDB-tk (#178).

This updates Busco from 4.1.4 to 5.1.0 and enables the use of the automated lineage selection:

  • by default now the BUSCO parameter --auto-lineage is used and the data is downloaded automatically (also for the tests currently)
  • the lineage can still be specified by providing a lineage dataset via the mag parameter --busco_reference (uses BUSCO parameter --lineage_dataset). This still requires the download of a file_versions.tsv file, which is used to check if the newest lineage dataset is used (BUSCO warns if not).
  • to run BUSCO in offline mode, the mag parameter --busco_download_path can be used (BUSCO: --download_path currently only in combination with --auto-lineage or --auto-lineage-prok)
  • additionally the mag parameter --busco_auto_lineage_prok can be used to ignore eukaryotes (BUSCO: --auto_lineage_prok)
  • the mag parameter --save_busco_reference is also used to save the lineage datasets downloaded by BUSCO, I added an extra process for this to only do this once and not for each BUSCO process

Error handling:

  • failed analysis due to no matching BUSCO genes or failed placements:
    • if run with --busco_reference the number of marker genes for the corresponding lineage is used and all output files just contain a "100% Missing" etc.
    • if run with --auto-lineage number of marker genes is unknown and no busco results file is generated -> these contigs are missing in the MultiQC BUSCO report. In the final busco_summary.txt I put a NA since I still thought those contigs should be listed (could also be done differently)
      • since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name
      • I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)
  • in my testdata there are a two unbinned contigs for which the analysis works when selecting the lineage, but fails for auto selection during placement. Since this only happens for unbinned contigs with > 96% missing I thought maybe it is OK, and we can just catch this error as well: warn if this happens for proper bins, and report this as 100% missing with NA as total number in the busco_summary.txt. What do you think?

Open todos:

  • when using --busco_reference: add also a warning if for a binned genome 100% are missing?
  • Placements failed: could be memory problem?
  • compress files when saving dowloaded lineage datasets
  • test if analysis can be reproduced with saved downloaded data?
  • update docs
  • update to 5.1.2: maybe in another PR, since I had problems with the current singularity images

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 5e9c00b

+| ✅ 117 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
  • conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
  • conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

Run details

  • nf-core/tools version 1.13.3
  • Run at 2021-04-26 11:43:10

@skrakau skrakau marked this pull request as ready for review April 8, 2021 18:11
@skrakau skrakau requested a review from d4straub April 8, 2021 18:12
@d4straub
Copy link
Collaborator

d4straub commented Apr 9, 2021

This is an awesome upgrade!

Error handling:

if run with --busco_reference the number of marker genes for the corresponding lineage is used and all output files just contain a "100% Missing" etc.

Sounds good

if run with --auto-lineage number of marker genes is unknown and no busco results file is generated -> these contigs are missing in the MultiQC BUSCO report. In the final busco_summary.txt I put a NA since I still thought those contigs should be listed (could also be done differently)

I guess you mean "bin" instead of "contig" here. Missing in the MultiQC report is fine. I absolutely agree that in busco_summary.txt bins with failed BUSCO analysis should appear anyhow. NA is ok, but if easily possible 100% Missing would be also great, e.g. by producing a dummy "short_summary*" file (and using NA only in the field for number of BUSCOs, i.e. "Total number"), but thats probably too much work for such a small improvement.

since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name

Not sure if this is needed since busco_summary.txt already contains that info, if I understood correctely? Nevertheless, one more file is fine as well. Just the bin name is fine.

I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)

I agree that this isn't really relevant for kept unbinned contigs. However, I'd hope that those still appear in the busco_summary.txt?

Open todos:

when using --busco_reference: add also a warning if for a binned genome 100% are missing?

Yes that would be great.

edit:

Placements failed: could be memory problem?

I think this would be great to check out.

for (bin in busco_failed_bins) {
failed_bins += " ${bin}\n"
}
log.info "-${colors.purple}[$workflow.manifest.name]${colors.red} For ${busco_failed_bins.size()} bin(s) the BUSCO analysis failed because no genes where found or placements failed:\n${failed_bins}See ${params.outdir}/GenomeBinning/QC/BUSCO/[bin]_busco.err for further information.${colors.reset}-"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm so this is also when placements failed, is that the same problem? I am uncertain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in both cases bins would be listed here.

The problem with the placement error is it says:
ERROR: Placements failed. Try to rerun increasing the memory or select a lineage manually.

In my case it was definitely not a memory problem. However, I am afraid that this could be caused by memory issues and then we would block retrying (such error messages do not really help...). That is why I pointed to the ${bin}_busco.err file, so the user could at least discover this. It would of course not be nice, but I also don't know how else to handle this.

Comment on lines +81 to +84
if [ \${#summaries[@]} -ne 1 ]; then
echo "ERROR: none or multiple 'BUSCO/short_summary.specific.*.BUSCO.txt' files found. Expected one."
exit 1
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: in what case can there be several specific summary files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know :D I think I just wanted to check if it is exactly 1

@skrakau
Copy link
Member Author

skrakau commented Apr 9, 2021

I guess you mean "bin" instead of "contig" here. Missing in the MultiQC report is fine. I absolutely agree that in busco_summary.txt bins with failed BUSCO analysis should appear anyhow. NA is ok, but if easily possible 100% Missing would be also great, e.g. by producing a dummy "short_summary*" file (and using NA only in the field for number of BUSCOs, i.e. "Total number"), but thats probably too much work for such a small improvement.

Currently the busco_summary.txt looks like this when using auto selection:

GenomeBin       %Complete       %Complete and single-copy       %Complete and duplicated        %Fragmented     %Missing        Total number
SPAdes-test_minigut.1.fa        17.7    17.7    0.0     0.8     81.5    124
SPAdes-test_minigut.2.fa        13.7    13.7    0.0     3.2     83.1    124
MEGAHIT-test_minigut.1.fa       12.1    12.1    0.0     3.2     84.7    124
MEGAHIT-test_minigut.2.fa       18.5    18.5    0.0     0.8     80.7    124
SPAdes-test_minigut_sample2.unbinned.1.fa       0.0%    0.0%    0.0%    0.0%    100.0%  NA
SPAdes-test_minigut_sample2.unbinned.2.fa       0.0%    0.0%    0.0%    0.0%    100.0%  NA
...

I adjusted the summary.busco.py script for this.

since in this case the "short_summary*" file from busco is missing, I added additional output files for failed analyses which get processed in the downstream BUSCO_SUMMARY process:${bin}_busco.failed_bins.txt containing just the bin name

Not sure if this is needed since busco_summary.txt already contains that info, if I understood correctely? Nevertheless, one more file is fine as well. Just the bin name is fine.

busco_summary.txt does not yet contain the info, because for those bins no summary file from BUSCO arrives. In theory the bin name would be sufficient yes, but I didn't get it done passing the bin names as val types, because collect() and passing it to the BUSCO_SUMMARY process only worked for files. Maybe there is a solution, but I didn't see a straight forward one. Do you know one?

I added a Nextflow warning if busco analysis failed for bins due these error types (not for kept unbinned contigs though!)

I agree that this isn't really relevant for kept unbinned contigs. However, I'd hope that those still appear in the busco_summary.txt?

sure :)

@skrakau skrakau force-pushed the update_busco_5.1.0 branch from ab890e8 to 60204c6 Compare April 9, 2021 16:21
@skrakau skrakau force-pushed the update_busco_5.1.0 branch from bd53286 to 61b541d Compare April 22, 2021 11:38
@skrakau
Copy link
Member Author

skrakau commented Apr 22, 2021

ok, finally another update :) roughly the following changes were done:

  • added Nextflow warning that BUSCO did not find any matching genes also when using --busco_reference
  • added used lineage dataset to busco_summary.tsv
  • when run in auto lineage selection mode: results for both the selected domain and a more specific lineage are provided if available (short_summary.domain.[lineage].[assembler]-[bin].txt, short_summary.specific_lineage.[lineage].[assembler]-[bin].txt, assembler]-[bin]_buscos.[lineage].fna.gz, [assembler]-[bin]_buscos.[lineage].faa.gz and in busco_summary.tsv)
  • domain results are now retrieved even if placement fails
  • add 'Viruses' for domain in busco_summary.tsv
  • fixed saving of datasets downloaded by BUSCO when using auto selection in online mode: since the complete reference data provided by BUSCO is > 100GB big (which could be used to ensure reproducibility), I wanted a setting which allows saving only the by BUSCO downloaded datasets. The downloaded data depends on the lineage selection. Therefore the process BUSCO_SAVE_DOWNLOAD is now run on a subset of the download folders (from different BUSCO runs), one for each selected specific lineage to make sure to retrieve all used datasets, while not overwriting already saved files. This now can be used to reproduce the BUSCO analysis (with same settings and same input data).
  • regarding the saving of the downloaded files, I did not compress them yet as they did not take that much space yet, and like this the folder can directly be used as input for --busco_download_path
  • docs output.md and usage.md updated

and the last time I forgot to mention:

  • the process BUSCO_PLOT does not output anymore the sample-wise summary files [assembler]-[sample]-busco_summary.txt, if this is needed I could easily add it again (generated with the BUSCO_SUMMARY process). Just wondered if they are used at all since they were also not mentioned in the output.md so far?

@d4straub
Copy link
Collaborator

d4straub commented Apr 22, 2021

Really helpful changes!

the process BUSCO_PLOT does not output anymore the sample-wise summary files [assembler]-[sample]-busco_summary.txt, if this is needed I could easily add it again (generated with the BUSCO_SUMMARY process). Just wondered if they are used at all since they were also not mentioned in the output.md so far?

[assembler]-[sample]-busco_summary.txt are not used at all, because all this information is supposed to be in busco_summary.tsv anyway.

@skrakau skrakau force-pushed the update_busco_5.1.0 branch from 61b541d to 83804e7 Compare April 22, 2021 13:54
@skrakau skrakau requested a review from d4straub April 22, 2021 15:41
@skrakau
Copy link
Member Author

skrakau commented Apr 22, 2021

Thanks a lot @d4straub for all your feedback! Ready for re-review again

@skrakau
Copy link
Member Author

skrakau commented Apr 26, 2021

Correction:
unfortunately there are a few differences regarding the handling of mollicutes lineages and failed placements when running BUSCO in online and offline mode. Thus the saved datasets downloaded automatically by BUSCO can currently not be used to reproduce the analysis by specifying the already downloaded files via --busco_download_path. I adjusted the docs accordingly.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@skrakau skrakau merged commit af7c17d into nf-core:dev Apr 26, 2021
@skrakau skrakau deleted the update_busco_5.1.0 branch May 31, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants