Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. #8561

Merged
merged 10 commits into from
Jul 9, 2024

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Oct 23, 2023

Copying over some discussion from Slack, with some slight modifications:

I took a quick stab at updating the environment for gCNV. Even taking out TensorFlow (assuming that the CNN will not be supported by this environment), it's a difficult task:

  1. The goal is to update Python from 3.6 to 3.10+, since Terra now requires the latter for officially supported images.
  2. However, gCNV relies on the PyMC3 package. PyMC3 3.1 is currently used in GATK master. 3.1 was released in 2017, not long before our release of gCNV in 2018, but it's very old now.
  3. The latest version of Python that is supported by PyMC3 3.1 in conda is Python 3.6.
  4. @asmirnov239 has a draft PR (Add pytorch to the conda environment #8094) that updates PyMC3 to 3.5 and Python to 3.7, which clearly still falls short of Python 3.10+. This PR also updated some gCNV code to make it compatible with PyMC3 3.5. (It also removed TensorFlow and added PyTorch.)
  5. @asmirnov239 also merged a PR that added tests for numerical reproducibility of GermlineCNVCaller in cohort mode in Added gCNV integration test to detect numerical differences in the outputs. #7889.
  6. The earliest version of PyMC that supports Python 3.10+ is PyMC 4, released in 2022.
  7. However, PyMC 4 introduces API changes, which will also require additional gCNV code changes and numerical testing.
  8. These API changes are because the underlying computational backend for PyMC was updated from Theano (think of this as an old alternative to TensorFlow) to Aesara.
  9. Since then, PyMC 5.9 has been released and the underlying backend has been updated again, from Aesara to PyTensor.
  10. So if we are going to update the environment to support Python 3.10+, it probably makes sense to go all the way to PyMC 5.9.

I've made some strides in this PR; as of 6b08f3a, I've made enough updates to accommodate API changes so that cohort-mode inference for both GermlineCNVCaller and DetermineGermlineContigPloidy runs successfully under Python 3.10 and PyMC 5.9.0---although note that 5.9.1 has been released in the interim!

However, our work has just begun. Results now produced in the numerical tests mentioned above are quite far off from the original expected results. It remains to be seen whether this is due to the randomness of inference, some slight changes to the model prior that were necessitated by the API changes, or some bugs introduced in other code updates. (Also note that I believe Andrey's PR in item 4 already broke these tests, although the numerical differences were much smaller and more reasonable---but perhaps he can confirm. Also noting here that I think determinism is still currently broken as of this commit---there have been some changes to PyTensor/PyMC seeding so that our previous theano/PyMC3 hack no longer applies.)

So I think the next step is to just go to scientific-level testing and see what the fallout is. Ideally, we'd still get good performance (or perhaps better! at least on the runtime side, hopefully...) and we can just update the numerical tests. But if performance tanks, then we might need to see whether I've introduced any bugs. @mwalker174 @asmirnov239 perhaps you can comment on what might be the appropriate test suite here----1kGP?

I'll also highlight again that this PR will remove TensorFlow and might require that the corresponding CNN implementations be supported by an alternate strategy, at least until the PyTorch implementation goes in.

@samuelklee samuelklee marked this pull request as draft October 23, 2023 18:31
@samuelklee samuelklee changed the title Sl python version update Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. Oct 23, 2023
@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee samuelklee force-pushed the sl_python_version_update branch 2 times, most recently from 6534430 to 558ccaf Compare November 9, 2023 20:48
@mwalker174
Copy link
Contributor

Thanks for your work on this @samuelklee! Testing on both wes and wgs would be ideal. For wgs we can use the gatk-sv reference panel, which is our standard (I can help with this once a docker is ready). For wes, 1kgp would work although it's definitely showing its age. Are the integration test differences large?

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@samuelklee
Copy link
Contributor Author

samuelklee commented Dec 8, 2023

OK, I think things are looking good! Updated a bunch of things, including the following:

  • conda 23.1.0 -> 23.10.0; in the base Docker, also disabled conda auto-updating and set the solver to the much faster libmamba (NOTE: before this PR went in, this change was actually made in Update the GATK base image to a newer LTS ubuntu release #8610)
  • python 3.6.10 -> 3.10.13
  • pymc 3.1 -> 5.10.0
  • theano 1.0.4 -> pytensor 2.18.1
  • added pytorch 2.1.0
  • removed tensorflow 1.15.0 and other CNN dependencies
  • added libblas-dev to the base Docker; I think MKL versions of all packages are being used, but we should verify!

These and other packages (numpy, scipy, etc.) are all pretty much at the latest available versions for python 3.10. I've also bumped version numbers for our internal python packages.

I also made all of the changes to the gCNV code to accommodate any changes introduced by PyMC/Pytensor. For the most part, these were minor renamings of theano/tt/etc. to pytensor/pt/etc.

However, there were some more nontrivial changes, including to 1) model priors (since some of the distributions previously used were removed or are now supported differently), 2) the implementation of posterior sampling, 3) some shape/dimshuffle operations, and other things along these lines.

Using a single test shard of 20 1kGP WES samples x 1000 intervals, I have verified determinism/reproducibility for DetermineGermlineContigPloidy COHORT/CASE modes, GermlineCNVCaller COHORT/CASE modes, and PostprocessGermlineCNVCalls. Numerical results are also relatively close to those from 4.4.0.0 for all identifiable call and model quantities (albeit far outside any reasonable exact-match thresholds, most likely due to differences in RNG, sampling, and the aforementioned priors).

Some remaining TODOs:

  • Rebuild and push the base Docker. EDIT: Mostly covered by Update the GATK base image to a newer LTS ubuntu release #8610, but this also includes an addition of libblas-dev.
  • Update expected results for integration tests, perhaps add any that might be missing. EDIT: These were generated on WSL Ubuntu 20.04.2, we'll see if things pass on 22.04. Note that changing the ARD priors does change the names of the expected files, since the transform is appended to the corresponding variable name. DetermineGermlineContigPloidy and PostprocessGermlineCNVCalls are missing exact-match tests and should probably have some, but I'll leave that to someone else.
  • Update other python integration tests.
  • Clean up some of the changes to the priors.
  • Clean up some TODO comments that I left to track code changes that might result in changed numerics. I'll try to go through and convert these to PR comments in an initial review pass.
  • Test over multiple shards on WGS and WES. Probably some scientific tests on ~100 samples in both cohort and case mode would do the trick. We should also double check runtime/memory performance (I noted ~1.5x speedups, but didn't measure carefully; I also want to make sure the changes to posterior sampling didn't introduce any memory issues). @mwalker174 will ping you when a Docker is ready! Might be good to loop in Isaac and/or Jack as well.
  • Perhaps add back the fix for 2-interval shards in Number of intervals edge case gCNV fix #8180, which I removed since the required functionality wasn't immediately available in Pytensor. Not sure if this actually broke things though---need to check. (However, I don't actually think this is a very important use case to support...)
  • Delete/deprecate/etc. CNN tools/tests as appropriate. Note that this has to be done concurrently, since we remove Tensorflow. @droazen perhaps I can take a first stab at this in a subsequent commit to this PR once more of the gCNV dust settles and/or has undergone a preliminary review? EDIT: Disabled integration/WDL tests. We should add some deprecation messages to the tools---we can note that they should still work in previous environments but will be untested. I might set up a separate PR for deletion, to be done at the appropriate time (but I call dibs on this, can't have @davidbenjamin overtaking my all-time record for number of lines deleted 😛).

@gatk-bot
Copy link

gatk-bot commented Dec 8, 2023

Github actions tests reported job failures from actions build 7143821808
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 7143821808.3 logs

@gatk-bot

This comment was marked as outdated.

@gatk-bot

This comment was marked as outdated.

@matthdsm
Copy link

matthdsm commented Jul 1, 2024

Hi all,

Any chance this will make it into a release soon? I was hoping this got merged with the recent docker image overhaul.

Thanks
Matthias

@samuelklee
Copy link
Contributor Author

@matthdsm this was intentionally left out of the recent 4.6 release, but should go into the next minor release. Would of course appreciate any testing/feedback from the community before then!

@samuelklee samuelklee marked this pull request as ready for review July 2, 2024 04:11
@samuelklee
Copy link
Contributor Author

samuelklee commented Jul 2, 2024

Released gatkbase-3.3.0 to broadinstitute/gatk:gatkbase-3.3.0, but getting Permission "artifactregistry.repositories.uploadArtifacts" denied on resource "projects/broad-gatk/locations/us/repositories/us.gcr.io" when trying to push to us.gcr.io/broad-gatk/gatk:gatkbase-3.3.0.

@samuelklee
Copy link
Contributor Author

Just added @DeprecatedFeature tags to the CNN tools. @droazen will help me push broadinstitute/gatk:gatkbase-3.3.0 to us.gcr.io/broad-gatk/gatk:gatkbase-3.3.0 (since it appears I no longer have permission, perhaps due to the recent migration). Then a thumbs up from him or @ldgauthier and I think this is good to go in!

@gatk-bot

This comment was marked as resolved.

@gatk-bot

This comment was marked as resolved.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the comparisons look great and I am confident in David's CNN->NV update plan -- let's do it!

@samuelklee samuelklee merged commit ddaf66f into master Jul 9, 2024
20 checks passed
@samuelklee samuelklee deleted the sl_python_version_update branch July 9, 2024 20:08
@droazen
Copy link
Contributor

droazen commented Jul 9, 2024

Woohoo, thank you @samuelklee !!

@matthdsm
Copy link

@droazen, do you think this warrants a new point release? That way we can finally fix the gatk-gcnvkernel recipe over at bioconda and make the conda recipe useable again 😄

@droazen
Copy link
Contributor

droazen commented Jul 10, 2024

@matthdsm Yes definitely -- there will be another release fairly soon to get this out. Before we can release, though, we do need to merge a couple of PRs that have been waiting on this change (in particular, a replacement tool for CNNScoreVariants that uses PyTorch). We're currently targeting the late July / early August timeframe for the next release.

Are you the maintainer of the GATK bioconda recipes, by the way? Let us know if there's anything else we can do in the upcoming release to fix bioconda-related issues!

@matthdsm
Copy link

I'm a bioconda maintainer, one of many, but I've got a vested interest in a functional gatk recipe 😅
At the moment, we're unable to get the latest version of the GATK to build because of the requirements for the gcnvkernel.
A new version with the changes above would fix most if not all of the issues we're currently seeing.

@matthdsm
Copy link

Hi @droazen,
Any updates on the timeframe for this new release? We're eagerly waiting for the next version so we can start updating everything on our side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants