-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset before request in order to work around VCert's "unmatched key modulus" #3
Conversation
bcbe7c5
to
fcfb524
Compare
…if it exists) while retrieving the certificate"
fcfb524
to
67c6b47
Compare
In Venafi#269, I got convinced that it would be a good solution to call "Reset" only if the "Retrieve" call returned a known message. Later on, we realized that there was a bad interaction between "Request" and "Reset(restart=true)". For some reason, when a problem arises (such as CA being down), TPP returns the old certificate, and vcert ends up showing the message "unmatched key modulus". We realized that calling "Reset(restart=false)" before Request prevents this bug. Although that's one extra HTTP call, it seems this call is very inexpensive. One downside that was brought up during the PR Venafi#269 was that any extra HTTP call would slow the TPP server because the HTTP called are "queued" (not concurrently processed).
67c6b47
to
fceb463
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some very minor remarks.
The best way to test this is using integration tests, which are not available on this fork.
So, I would like to merge this PR without adding tests to fix this issue quickly, we did manually test the fix.
Once the remarks are addressed, it think it is ready to be merged.
I added unit tests by mocking the HTTP calls. The reason I wanted to add these tests is because it is the only way to "review" what the error messages look like. I think it is useful to be able to review the error messages since it is a UI element. I also slightly changed the |
@maelvls All of my feedback has been resolved, feel free to merge this PR. |
Note: the issue addressed by this PR has not been fixed in v4.24.0 Venafi#273 (comment) so this fork is still used by cert-manager. |
We found out that Venafi#273 was causing venafi-enhanced-issuer to show the error:
We found that it was caused by running
request
andreset(restart=true)
back to back.To work around this issue, we decided to unconditionally reset any prior failed enrollment with
reset(restart=false)
so that we don't get stuck with "Fix any errors, and then click Retry." (60% of the time) or "WebSDK CertRequest" (40% of the time), and to revert the changes made in Venafi#269.For context, we compared two approaches: (1) unconditionally reset, and (2) conditionally reset synchronously.
Unconditionally Reset (this PR)
We will go with this solution as an "intermediate" solution.
Instead of conditionally resetting, we always reset. Since we don’t reset and retrieve back to back, we avoid the race condition that was found in Venafi#273.
I stress-tested TPP to calculate the impact of
reset(restart=false)
under load. The VM was on GCP, I used TPP 23.1 with ExpressSQL, a pd-balanced disk, 4 vCPUs, 16 GB memory.request; retrieve_every_second
1000 times in parallel = 1 min 57 secreset(restart=false); request; retrieve_every_second
1000 times in parallel = 2 min 11 secThe slow down is 9%.
Conditionally Reset Synchronously
Dmitry Philimonov suggested that we use
WorkToDoTimeout
to prevent the constant polling. And since we don’t reset and retrieve back to back, we avoid the race condition that was found in Venafi#273.This solution will have to be implemented outside of RetrieveCertificate and RequestCertificate, which means users of the Go SDK will need to change the way they are using the library.
For example: