Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use new client token when CreateVolume returns IdempotentParameterMismatch #2075

Conversation

ConnorJC3
Copy link
Contributor

@ConnorJC3 ConnorJC3 commented Jun 25, 2024

Is this a bug fix or adding new feature?

Fixes #1951

What is this PR about? / Why do we need it?

If CreateVolume fails for any reason (examples: the user provides an invalid KMS key, due to an EBS-side issue, etc), we retry with the same client token. However, despite the fact that the CreateVolume call corresponds to a volume that was never created, our client token is burned.

In the scenario where we detect this has occurred, this PR tries again with a different token. However, to prevent volume leaks, the token must follow a predictable pattern. To do this, I append -2 to the volume id pre-hashing for the token, and if that request fails, I instead append -3, -4, etc. This is done strictly in order, so a CSI driver that crashes (or restarts for any other reason, such as an upgrade) can consistently reuse the same tokens until it reaches the 'correct' token.

To keep track of which token we have most recently used during runtime, I use an expiring cache, similar to the existing "likely bad names" cache implementation. In order to DRY up the code, the first two commits of this PR migrate that implementation to the util package (commit 1) and then migrate the existing likelyBadNames implementation to use the util version (commit 2). Finally, this PR implements the above described change using this cache (commit 3).

What testing is done?

Added/updated unit tests, CI, manual

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 25, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 25, 2024
@ConnorJC3 ConnorJC3 force-pushed the fix-createvolume-idempotency-race-condition branch from dadb2ef to 0aaf246 Compare June 25, 2024 20:01
Copy link

github-actions bot commented Jun 25, 2024

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go 84.7% 85.4% 0.7
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/expiringcache/expiring_cache.go Does not exist 100.0%

@ConnorJC3 ConnorJC3 force-pushed the fix-createvolume-idempotency-race-condition branch 2 times, most recently from 8725a56 to f0bb44e Compare June 25, 2024 22:05
@ConnorJC3 ConnorJC3 changed the title [WIP] Use new client token when CreateVolume returns IdempotentParameterMismatch Use new client token when CreateVolume returns IdempotentParameterMismatch Jun 25, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2024
@ConnorJC3
Copy link
Contributor Author

/retest

1 similar comment
@ConnorJC3
Copy link
Contributor Author

/retest

pkg/cloud/cloud.go Outdated Show resolved Hide resolved
pkg/cloud/cloud.go Show resolved Hide resolved
pkg/cloud/cloud.go Show resolved Hide resolved
pkg/util/expiring_cache.go Outdated Show resolved Hide resolved
pkg/util/expiring_cache.go Outdated Show resolved Hide resolved
pkg/util/expiring_cache.go Outdated Show resolved Hide resolved
pkg/cloud/devicemanager/allocator.go Outdated Show resolved Hide resolved
pkg/util/expiring_cache.go Outdated Show resolved Hide resolved
@ConnorJC3 ConnorJC3 force-pushed the fix-createvolume-idempotency-race-condition branch 2 times, most recently from 1395040 to c691175 Compare July 2, 2024 18:08
Signed-off-by: Connor Catlett <conncatl@amazon.com>
Signed-off-by: Connor Catlett <conncatl@amazon.com>
…match

Signed-off-by: Connor Catlett <conncatl@amazon.com>
@ConnorJC3 ConnorJC3 force-pushed the fix-createvolume-idempotency-race-condition branch from c691175 to 0a0d5ef Compare July 3, 2024 15:21
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 5, 2024
@AndrewSirenko
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit c257e5b into kubernetes-sigs:master Jul 8, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible race condition in EBS volume creation
5 participants