Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added regularization to covariance in GMM maximization step to fix convergence issues in VQSR. #7709

Merged
merged 4 commits into from
Mar 10, 2022

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Mar 7, 2022

A quick patch to help out the Variants team, which is struggling with a problematic callset.

Note that a similar regularization to the effective number per component probably should have been applied to solve the issue in #6425. I'm not sure if the lack of this regularization will still lead to convergence issues, but I would hope that the fix that was implemented instead (treating vanishing components as a special case and skipping computation) suffices. As discussed there, we may also want to eventually remove the idiosyncratic finalize step; it’s likely this is the source of issues here, since the correct Bayesian M step is already regularized by the prior.

The covariance regularization term added here is standard (c.f. e.g. https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09bcc2eaeba98f7e737aac2ac782f0e5f1/sklearn/mixture/_gaussian_mixture.py#L154), but it may result in non-negligible changes to VQSLODs. As just discussed with the Variants team, we can probably use the WARP validation to convince ourselves that results are functionally equivalent.

I updated the exact-match tests without much close examination (by simply forcing IntegrationTestSpec.assertEqualTextFiles to overwrite the old expected files), so someone may want to sanity check them. There were also a few more interactions between the integration tests for different tools than I anticipated. Some tests use output generated by an upstream tool as input and break encapsulation.

@samuelklee samuelklee force-pushed the sl_vqsr_convergence branch 2 times, most recently from 612bcc3 to 9eb886c Compare March 7, 2022 14:06
@ldgauthier
Copy link
Contributor

@samuelklee I will build the docker and run the warp tests, but if we do this again I'm going to teach you to fish. Wasn't there a VariantRecalibrator integration test that had some sort of random seed hijinx? Hopefully your fix should mean we don't need any shenanigans anymore -- can you see if there are any other tests we can clean up?

@samuelklee
Copy link
Contributor Author

Thanks for the quick review, @ldgauthier!

I don't think my fix will address any non-determinism in the integration tests. I'm inclined to just do better with the new tools---there does seem to be enough duct tape in the integration tests regarding re/setting the RNG so that the exact-match tests consistently pass.

As for learning how to run the WARP tests, I think that would indeed be pretty useful---for anyone that might have to update code for VQSR or the new tools in the future! Can we teach everyone to fish? Isn't this what CARROT is for?

@ldgauthier
Copy link
Contributor

See "smart tests" here: broadinstitute/warp#639 Should finish in maybe an hour and a half? Then I'll look at the VerifyNA12878 results.

@ldgauthier
Copy link
Contributor

ldgauthier commented Mar 8, 2022

It's this I'm hoping we can take out:

Kevin has kindly volunteered to help me port the warp tests to Carrot, if I can ever make time. :-(

@ldgauthier
Copy link
Contributor

Warp exome JG tests had slightly different metrics, so they failed, but the NA12878 results look like this:

test
type TP FP FN RECALL PRECISION
SNP 17444 47 843 0.954 0.997
INDEL 300 73 103 0.744 0.804

truth
type TP FP FN RECALL PRECISION
SNP 17443 48 844 0.954 0.997
INDEL 304 74 99 0.754 0.804

(Over the exome) negligible change in SNPs, slight recall decrease in indels, which I would say is "in the noise" given how many truth indels are in the exome.

@samuelklee
Copy link
Contributor Author

samuelklee commented Mar 8, 2022

Thanks @ldgauthier! That canary certainly looks alive to me. Happy to merge whenever you and/or Variants team approve.

@ldgauthier
Copy link
Contributor

I will approve after you take a look at the integration test I linked above.

@samuelklee
Copy link
Contributor Author

samuelklee commented Mar 9, 2022

Sorry, I guess I didn't see your edit pointing out that line of code.

I've indeed looked at that test and more---there's a lot of similar duct tape and inconsistent resetting of the RNG throughout the entire test suite. But since I think we can reasonably assume that there's enough duct tape to make things deterministic overall, I don't think it's worth cleaning up the duct tape just to get neater, but equally deterministic behavior. (Or perhaps can you point to instances of persisting non-determinism, e.g. random failures in Travis?)

In any case, I think it makes more sense to focus effort on making cleaner tests for the new tools, rather than make an 11th hour effort to revamp these existing tests. Do you agree?

See e.g. #6112 for some related discussion.

Also added a note mentioning that the original GATK3 expected results have been updated, although now looking back at the commit history, I'm not sure if that was already true.

@ldgauthier
Copy link
Contributor

I just want to know if the need for that RNG draw is obviated by your convergence fix. If it is, then we can remove that line. If not, then we know there are still some issues yet to be addressed.

@samuelklee
Copy link
Contributor Author

As far as I can tell, even master passes when that line is removed. So the comments in the integration test are already misleading or out of date. Do you see differently?

@samuelklee
Copy link
Contributor Author

samuelklee commented Mar 10, 2022

Just for future reference, note that comments in testVariantRecalibratorSNPMaxAttempts are also incorrect or out of date. The test passes even if you limit it to one attempt.

// For this test, we deliberately *DON'T* sample a single random int as above; this causes
// the tool to require 4 attempts to acquire enough negative training data to succeed

So again, the tests were already "broken." But still, rather than attempt to fix them, I think it's best to follow the principle of not changing both production and test code to the extent that it is possible in this scenario. We've already updated enough exact-match expected results to make me a bit uncomfortable!

Someone else may want to tackle fixing the tests in a separate push, but I think it makes sense for me to focus on avoiding these sorts of issues when writing tests for the new tools.

EDIT: For the record, I confirmed that the undesired behavior in this test that the RNG hack was trying to avoid was fixed (and hence, the test was "broken") in #6425. Probably wasn't noticed because this is the only non-exact-match test and the test isn't strict enough to check that attempts 1-3 fail, it only checks that we succeed by attempt 4. Again, someone else may feel free to examine the actual coverage of this test and whether it's safe to remove it and/or clean up all the duct tape---but at some point, it becomes difficult to tell which pieces of duct tape are load bearing!

@samuelklee samuelklee merged commit b14f810 into master Mar 10, 2022
@samuelklee samuelklee deleted the sl_vqsr_convergence branch March 10, 2022 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants