Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset before request in order to work around VCert's "unmatched key modulus" #3

Merged
merged 6 commits into from
Jan 27, 2023

Conversation

maelvls
Copy link
Member

@maelvls maelvls commented Jan 20, 2023

⚠️ THIS PR WAS NOT MERGED UPSTREAM YET ⚠️

We found out that Venafi#273 was causing venafi-enhanced-issuer to show the error:

vcert error: your data contains problems: request doesn't match certificate: unmatched key modulus

We found that it was caused by running request and reset(restart=true) back to back.

To work around this issue, we decided to unconditionally reset any prior failed enrollment with reset(restart=false) so that we don't get stuck with "Fix any errors, and then click Retry." (60% of the time) or "WebSDK CertRequest" (40% of the time), and to revert the changes made in Venafi#269.

For context, we compared two approaches: (1) unconditionally reset, and (2) conditionally reset synchronously.

Unconditionally Reset (this PR)

We will go with this solution as an "intermediate" solution.

Instead of conditionally resetting, we always reset. Since we don’t reset and retrieve back to back, we avoid the race condition that was found in Venafi#273.

reset(restart=false)
request
loop {
    retrieve
    if err, return the error.
    if 200, return the cert.
    if 202, continue.
}

I stress-tested TPP to calculate the impact of reset(restart=false) under load. The VM was on GCP, I used TPP 23.1 with ExpressSQL, a pd-balanced disk, 4 vCPUs, 16 GB memory.

  • Scenario 1: I ran request; retrieve_every_second 1000 times in parallel = 1 min 57 sec
  • Scenario 2: I ran reset(restart=false); request; retrieve_every_second 1000 times in parallel = 2 min 11 sec

The slow down is 9%.

Conditionally Reset Synchronously

Dmitry Philimonov suggested that we use WorkToDoTimeout to prevent the constant polling. And since we don’t reset and retrieve back to back, we avoid the race condition that was found in Venafi#273.

loop {
    request(work_todo_timeout=60)
    if 500 and "Fix any errors, and then click Retry" {
        reset(restart=false)
        continue
    }
    if other error, return error.
    if 200, return cert.
}

This solution will have to be implemented outside of RetrieveCertificate and RequestCertificate, which means users of the Go SDK will need to change the way they are using the library.

For example:

for {
    cert, err := connector.RequestCertificate(vcert.Request{WorkToDoTimeout: 60 * time.Second})
    if vcert.ErrIsClickToRetry(err) {
        err = connector.ResetCertificate(vcert.Request{WorkToDoTimeout: 0, Restart: false})
        if err != nil {
            return err
        }
        continue
    }
    if err != nil {
        return err
    }
    return cert
}

…if it exists) while retrieving the certificate"
@maelvls maelvls force-pushed the reset-before-request branch from fcfb524 to 67c6b47 Compare January 24, 2023 14:02
In Venafi#269, I got convinced that it
would be a good solution to call "Reset" only if the "Retrieve" call
returned a known message.

Later on, we realized that there was a bad interaction between "Request"
and "Reset(restart=true)". For some reason, when a problem arises (such
as CA being down), TPP returns the old certificate, and vcert ends up
showing the message "unmatched key modulus".

We realized that calling "Reset(restart=false)" before Request prevents
this bug. Although that's one extra HTTP call, it seems this call is
very inexpensive. One downside that was brought up during the PR Venafi#269
was that any extra HTTP call would slow the TPP server because the HTTP
called are "queued" (not concurrently processed).
@maelvls maelvls force-pushed the reset-before-request branch from 67c6b47 to fceb463 Compare January 24, 2023 17:11
pkg/venafi/tpp/connector.go Outdated Show resolved Hide resolved
pkg/venafi/tpp/tpp.go Outdated Show resolved Hide resolved
Copy link

@inteon inteon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some very minor remarks.
The best way to test this is using integration tests, which are not available on this fork.
So, I would like to merge this PR without adding tests to fix this issue quickly, we did manually test the fix.
Once the remarks are addressed, it think it is ready to be merged.

@maelvls
Copy link
Member Author

maelvls commented Jan 26, 2023

I added unit tests by mocking the HTTP calls. The reason I wanted to add these tests is because it is the only way to "review" what the error messages look like. I think it is useful to be able to review the error messages since it is a UI element.

I also slightly changed the reset func that you had written. The "not found" check was a bit weird, so I decided to change it so that it returns an error that we can errors.As.

@maelvls maelvls marked this pull request as ready for review January 26, 2023 18:14
@maelvls maelvls changed the title Reset before request Reset before request in order to work around VCert's "unmatched key modulus" Jan 27, 2023
@inteon
Copy link

inteon commented Jan 27, 2023

@maelvls All of my feedback has been resolved, feel free to merge this PR.

@irbekrm
Copy link

irbekrm commented May 22, 2023

Note: the issue addressed by this PR has not been fixed in v4.24.0 Venafi#273 (comment) so this fork is still used by cert-manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants