Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardises GTDB execution and allow pre-uncompressed GTDB input #477

Merged
merged 11 commits into from
Aug 10, 2023

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented Jul 13, 2023

To close #424

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Jul 13, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit e6a2c71

+| ✅ 156 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.9
  • Run at 2023-08-07 16:10:53

@jfy133 jfy133 marked this pull request as draft July 13, 2023 08:24
@jfy133
Copy link
Member Author

jfy133 commented Jul 13, 2023

Need to test, currently GTDB is not being executed at all;

@jfy133
Copy link
Member Author

jfy133 commented Jul 29, 2023

First pass tests:

  • Standard test profile (i.e., skip GTDBTK ensures no download of database)

    nextflow run ../main.nf -profile singularity,test --outdir ./results

  • Standard test profile (still skipping) but with GTDBDTK path still results in no download/execution

    nextflow run ../main.nf -profile singularity,test --outdir ./results --gtdb_db ~/cache/databases/gtdbtk_r202_data.tar.gz

  • Run GTDBK with pre-supplied archive tar (should DB prep it, and run gtdbk)

    nextflow run ../main.nf -profile singularity,test --outdir ./results --gtdb_db /home/james/cache/databases/gtdbtk_r202_data.tar.gz --skip_gtdbk false

    Working: but test dataset gets no completeness (all bins 'discarded' during ch_bins_metric)
    Trying new data (subset Maixner 2021).
    Working: but broken database

  • Run GTDBK with already decompressed tar archive with input as directory (no DB prep, and but still gtdbk)

    time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --gtdb_db /home/james/cache/databases/database --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

    Working: but broken database

  • Run GTDBK with already decompressed tar archive with input as directory (no DB prep, and but still gtdbk)

    time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --gtdb_db /home/james/cache/databases/database --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

    Working: but broken database

  • Run GTDBK but with no supplied database (i.e., should auto download)

    time nextflow run ../main.nf -profile singularity,test --input "*_{R1,R2}.fastq.gz" --outdir ./results --skip_gtdbtk false -dump-channels -resume --input samplesheet.2612.csv

    Working: but broken database

  • Run command but skipbinqc (should not autodownload

Remember: remove print and dumps!

@jfy133
Copy link
Member Author

jfy133 commented Jul 29, 2023

TODO: same for BUSCO

@jfy133
Copy link
Member Author

jfy133 commented Jul 30, 2023

Change my mind, BUSCO is a little more tricky, will do that in a follow up PR (if I do more than just accepting directory input)

@jfy133 jfy133 marked this pull request as ready for review July 30, 2023 06:07
nextflow.config Outdated Show resolved Hide resolved
Copy link
Contributor

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine to me, but don't have time (or a downloaded copy of the GTDB database) to test with myself!

Working: but broken database

Can I check what this means?

@jfy133
Copy link
Member Author

jfy133 commented Jul 31, 2023

It was saying something about couldn't find `tigram database' or something like, regardless if I user the auto-downloaded or manual download gtdb release 202 🤷 unless the error message was a bit funny and it meant it couldn't find features within the data maybe

But regardless the module definitly executed and got halfway through :)

@prototaxites
Copy link
Contributor

Poking around - the r202 database release is listed with a maximum GTDB version compatibility of 1.7.0 - the version in mag dev at the moment is 2.1.1.

Worth retrying with the R214 or R207v2 release, which are listed as compatible? A little bit of googling suggests Tigram is some kind of HMM database - maybe this has since been added to the GTDB database. Might be we have to bump the default database version as well.

https://ecogenomics.github.io/GTDBTk/installing/index.html

@jfy133
Copy link
Member Author

jfy133 commented Jul 31, 2023

Huh interesting... then will check that later, I guess somehow the module got updated at some point but not te URL?

@jfy133
Copy link
Member Author

jfy133 commented Jul 31, 2023

Thansk for the investgiation :D (will likely finish next week though as teaching all this week)

@CarsonJM
Copy link
Contributor

@jfy133 @prototaxites I think it would be great to get the r214 database set as default, since it is a pretty significant increase over r207. I'm happy to work on adding updating that this week if that would be helpful!

@jfy133
Copy link
Member Author

jfy133 commented Jul 31, 2023

That would be great @CarsonJM !

I honestly think it will just take updating the URL in the nextflow.config and maybe some docs. You're welcome to try it on my branch of you want and push the current!

If you could also test on your own (small) data for the the database cases: auto download, supply as a tar.gz, and also an unpacked tar (i.e. Directory) that would be also really helpful.

The database is too large for the GitHub CI nodes so I fear it's not tested sufficiently :(, thusb the more manual tests the better.

We should also maybe consider a release checklist to also run a range of e.g full AWS run/local HPC run with all the large databases activated...

@CarsonJM
Copy link
Contributor

Thanks for the guidance @jfy133 I will work on that this week and keep you all posted!

@CarsonJM
Copy link
Contributor

CarsonJM commented Aug 7, 2023

Sorry for falling behind on this. I made the code changes and started some tests last Thursday, but sorely underestimated the amount of time I would need to request for downloading/unpacking this database. Re-running the tests now!

@CarsonJM
Copy link
Contributor

CarsonJM commented Aug 7, 2023

Finished running all three tests (auto-download, .tar.gz, and directory) all worked great and it looks like all CI tests are going to pass as well. One thought on this would be to add label "process_high_memory" to GTDBTK_DB_PREPARATION so that by default it requests a lot of memory. Would that make sense?

@jfy133
Copy link
Member Author

jfy133 commented Aug 7, 2023

Thanks for adding that @CarsonJM ! Why do you think that process needs lots of memory? Isn't it just running untar?

@CarsonJM
Copy link
Contributor

CarsonJM commented Aug 7, 2023

Good point! @jfy133 I initially ran this without modifying the resource request, and it failed after > 4hrs (our default queues max runtime). When I requested 16 threads and 100GB mem, it completed in 40 min. After your comment I looked at the trace and mem requirement is definitely low! Would it be running faster because more cores were available?

Trace below:
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar

5 8b/0ce534 62494 NFCORE_MAG:MAG:GTDBTK:GTDBTK_DB_PREPARATION (gtdbtk_r214_data.tar.gz) COMPLETED 0 2023-08-07 07:33:11.880 40m 25s 40m 23s 37.7% 6.7 MB 11.1 MB 158.6 GB 161.9 GB

@jfy133
Copy link
Member Author

jfy133 commented Aug 8, 2023

AFAIK it's also single core... so I have no idea. Maybe the first time the node you were sent had lots of RAM intensive jobs going on?

I didn't have that issue myself personally (40m every time) when running on my laptop, so I'm more inclined to leave it as is for now?

Otherwise, if you're happy with the PR @CarsonJM please give the ✔️ and then we can merge, and dare I say it, make the release?

@jfy133 jfy133 merged commit b004f03 into dev Aug 10, 2023
@jfy133 jfy133 deleted the improve-database-handling branch August 10, 2023 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants