Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in EBS volume creation #1951

Closed
Sinjo opened this issue Feb 29, 2024 · 4 comments 路 Fixed by #2075
Closed

Possible race condition in EBS volume creation #1951

Sinjo opened this issue Feb 29, 2024 · 4 comments 路 Fixed by #2075
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Sinjo
Copy link

Sinjo commented Feb 29, 2024

Hi 馃憢馃徎

I think I've found a race condition that happens very occasionally when creating a PV for a new PVC. The outcome of the race condition is a PVC that is permanently stuck in Pending and has to be deleted and recreated.

As a ballpark estimate, based on the number of CreateVolume calls per day in CloudTrail and how frequently we see this issue, I would guess that this happens on the order of 1 in 100,000 volume creations.

I've substituted the actual PV and EBS volume names below with placeholders like pvc-foo for ease of reading. The identifier of each PVC corresponds directly to the identifier of the EBS volume (i.e. PV pvc-foo corresponds to EBS volume vol-foo).

Summary

Based on CloudTrail events, it seems that aws-ebs-csi-driver sometimes creates and then shortly after deletes an EBS volume to provide a PV for a new PVC.

Once that has happened, the controller repeatedly retries the creation of the EBS volume, but it always fails because of how the API request to AWS is formed. Specifically:

  • The name of the PV is generated from the UID of the PVC.
  • The clientToken, which is used as an idempotency key in the request to the AWS API, is generated from the name of the PV.
  • This means that clientToken will always be the same for a specific PVC, and the retries will fail indefinitely.

From the driver side, every retry results in the following error:

E0228 05:00:15.227309       1 driver.go:155] "GRPC error" err="rpc error: code = AlreadyExists desc = Could not create volume \"pvc-foo\": Parameters on this idempotent request are inconsistent with parameters used in previous request(s)"

and in the CloudTrail event, the error message is:

The client token you have provided is associated with a resource that is already deleted. Please use a different client token.

All of the CloudTrail events all have the same userIdentity (the service account assumed by the driver) and source IP address (the driver's pod IP).

Possible EBS volume ID confusion

One really weird thing I saw in the driver logs was this:

E0228 05:00:14.031881       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-bar' does not exist.
                status code: 400, request id: [redacted]
 >

It seems to suggest that the driver was looking for an EBS volume with the ID vol-bar when creating the PV pvc-foo. That other volume belonged to a PV that was created about 10 minutes earlier and deleted while pvc-foo was in the process of being provisioned (see timeline below).

In our application, due to backup schedules, we do a fair amount of disk creation and deletion on the hour mark. I'm wondering if the increased load at those times is enough to trigger a race condition inside the driver.

Timeline

All times in UTC

04:50:03 - CloudTrail - EBS volume vol-bar created for pvc-bar
05:00:09 - CloudTrail - EBS volume vol-foo created for pvc-foo
05:00:13 - CloudTrail - EBS volume vol-bar deleted
05:00:14 - aws-ebs-csi-driver - Could not create PV pvc-foo because it couldn't find EBS volume vol-bar in AWS (see above section "Possible EBS volume ID confusion")
05:00:14 - CloudTrail - EBS volume vol-foo deleted
05:00:15 - CloudTrail - EBS volume vol-foo creation attempted again, but fails with Client.IdempotentParameterMismatch (this repeats indefinitely)

Note: I can't be sure of the order of events between aws-ebs-csi-driver and CloudTrail. In particular, the two events at 05:00:14 may have happened in a different order.

Software versions

Kubernetes: v1.26.12-eks-5e0fdde
aws-ebs-csi-driver: v1.28.0

Potentially related issues

I don't think any of these are the same issue as they affect every PVC created by the people reporting them. They seem more like config or compatibility issues on their end, whereas this seems more like a race condition.

More info

I'm happy to get any more info you need to debug this. I think CloudTrail events stick around for 90 days and our own logs should stick around for a decent amount of time too.

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 29, 2024
@Sinjo
Copy link
Author

Sinjo commented Mar 4, 2024

So I just got another two of these today, right at the same time and in the same Kubernetes cluster. What I found really interesting was that the logs showed them both pointing to a single EBS volume that was created for a different PVC:

E0304 11:46:12.516570       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-foo": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
                status code: 400, request id: [redacted]
 >
E0304 11:46:12.525748       1 driver.go:155] "GRPC error" err=<
        rpc error: code = Internal desc = Could not create volume "pvc-bar": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-baz' does not exist.
                status code: 400, request id: [redacted]
 >

That PVC shows as being deleted at 11:46:11 in CloudTrail, which fits the pattern from last time.

@ConnorJC3
Copy link
Contributor

ConnorJC3 commented Mar 15, 2024

Hi, thanks for the great writeup - we've been vaguely aware of this issue for some time and it's on our backlog to fix. In the meantime, deleting and recreating the PVC/VolumeSnapshot object that is affected should act as a workaround for the rare times you hit this issue.

@vivek-siva
Copy link

For us its happening all the time(may be 95/100), any updates would be appreciated

@ConnorJC3
Copy link
Contributor

ConnorJC3 commented Apr 23, 2024

For us its happening all the time(may be 95/100), any updates would be appreciated

We're working on a fix for this issie. However, that failure rate is highly unusual and likely indicates your volume(s) are failing to create because of invalid parameters (for example, an invalid KMS key or one the driver doesn't have access to).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants