Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition - PVs don't get reused when starting new node #1797

Open
koreyGambill opened this issue Oct 24, 2023 · 4 comments
Open

Race Condition - PVs don't get reused when starting new node #1797

koreyGambill opened this issue Oct 24, 2023 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@koreyGambill
Copy link

/kind bug

What happened?

I'm running into an issue where existing PVs are only re-used if a node is available at time of PVC creation.

When I scale up the pods they create new PVCs right away (while the pod moves into ContainerCreating). If there are nodes available, an existing PV is bound to the PVC right away. If there are no nodes available then the PVC is pending and as soon as a new node moves into a Ready state, a new PV will be provisioned and bound even though there are 100+ existing PVs that meet the requirements. Then if I schedule a new pod on that new node, an existing PV will be used for subsequent pods attached to the node. It's worth noting I am scaling nodes with karpenter, and have locked it down to a single availability zone so all PVs are in a single zone.

I've ended up with hundreds of PVs for something that dynamically scales between 0 and 6 pods. This is an actions-runner from the actions-runner-controller to run github actions on EKS.

Additional Testing

I deleted all the PVs
Then in a single AZ I created 60 pods which created 60 PVs
Then I scaled to 0, waited a while and made sure everything was available, then scaled to 60 - this created 28 more pods for a total of 88. The rest were bound to existing pods.
Then I did it again and this time it created 25 more pods for a total of 113. This was because there were some 2xl nodes that allowed for more pods to join.
It seems that that the first pod to join the node is creating a new PV while the second (and sometimes 3rd) pod to join is using an existing PV.

Relevant Logs

the only logs the csi-controller produces are

I1024 14:16:00.140116       1 cloud.go:713] "Waiting for volume state" volumeID="vol-07da60cb4e75fa23b" actual="attaching" desired="attached"
I1024 14:16:45.736756       1 cloud.go:713] "Waiting for volume state" volumeID="vol-017224c77fe3e01f6" actual="attaching" desired="attached"
And the ebs-csi-node that comes up in response to the new node shows
Defaulted container "ebs-plugin" out of: ebs-plugin, node-driver-registrar, liveness-probe
I1024 14:16:39.952320       1 driver.go:75] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.19.0"
I1024 14:16:39.952362       1 node.go:85] "regionFromSession Node service" region=""
I1024 14:16:39.952371       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I1024 14:16:39.953346       1 metadata.go:92] "ec2 metadata is available"
I1024 14:16:39.953741       1 metadata_ec2.go:25] "Retrieving EC2 instance identity metadata" regionFromSession=""
I1024 14:16:49.743118       1 mount_linux.go:517] Disk "/dev/nvme1n1" appears to be unformatted, attempting to format as type: "ext4" with options: [-F -m0 /dev/nvme1n1]
I1024 14:16:50.081231       1 mount_linux.go:528] Disk successfully formatted (mkfs): ext4 - /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/ad9bcd0a40bcd21382425af4ee754c0bd51e9e1a07000680a9e75a86ab0bb7d5/globalmount
I1024 14:16:50.081317       1 mount_linux.go:245] Detected OS without systemd

which seem to pertain to the root volume provisioning (which is working well), but I'm concerned about the mounted volume

          volumeMounts:
            - name: var-lib-docker
              mountPath: /var/lib/docker
...
  volumeClaimTemplates:
    - metadata:
        name: var-lib-docker
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 22Gi
        storageClassName: arc-cache-infra-tests

which uses the storage class

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: arc-cache-infra-tests
  labels:
    content: arc-cache-infra-tests
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Expected Behavior

PVs should be reused when a new node starts. New PVs should only be created when existing PVs are unavailable.

Reproduction Steps

How to reproduce it (as minimally and precisely as possible)?
You can use the actions-runners, but I have also simulated this with statefulsets to make it easier to reproduce.

# StorageClass yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: arc-cache-infra-tests
  labels:
    content: arc-cache-infra-tests
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StatefulSet yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: busybox-statefulset
  namespace: actions-runner-system
spec:
  serviceName: "busybox"
  replicas: 20
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      serviceAccountName: runner-sa
      tolerations:
        - key: purpose
          operator: Equal
          value: github-runner
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: purpose
                    operator: In
                    values:
                      - github-runner
      containers:
      - name: busybox
        image: busybox
        command: ["tail", "-f", "/dev/null"]
        resources:
          requests:
            cpu: "1500m"
            memory: "1500Mi"
          limits:
            cpu: "1500m"
            memory: "1500Mi"
        volumeMounts:
        - name: var-lib-docker
          mountPath: /var/lib/docker
  volumeClaimTemplates:
  - metadata:
      name: var-lib-docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 22Gi
      storageClassName: arc-cache-infra-tests
# Karpenter Provisioner and AWSNodeTemplate
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: github-runner-testing-cpu-75-c7a
spec:
  weight: 50
  limits:
    resources:
      cpu: '300'
  providerRef:
    name: github-runner-75
  consolidation:
    enabled: false
  ttlSecondsUntilExpired: 600  #  10 mins
  ttlSecondsAfterEmpty: 600  #  10 mins
  taints:
    - key: purpose
      value: github-runner
      effect: NoSchedule
  labels:
    scheduler: karpenter
    purpose: github-runner
    constraint: cpu  # cpu or memory
    size: large
    lifecycle: ephemeral  # ephemeral or persistent
    usage: testing
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: [spot]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [c7a]
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: [xlarge]
    - key: topology.kubernetes.io/zone
      operator: In
      values: [us-west-2a]
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: github-runner-75
spec:
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 75Gi
        volumeType: gp3
        encrypted: true
  subnetSelector:
    karpenter.sh/discovery: primary-cluster
  securityGroupSelector:
    karpenter.sh/discovery: primary-cluster
  instanceProfile: github-instance-profile
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: optional

Environment
AWS EKS

  • Kubernetes version (use kubectl version):
    Client Version: v1.28.1
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.27.4-eks-2d98532
  • Driver version: 1.19
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 24, 2023
@danavatavu
Copy link

I see the same issue for the following environment:
Environment
AWS EKS

Kubernetes version (use kubectl version):
Client Version: v1.26.11
Kustomize Version: v4.5.7
Server Version: v1.25.16-eks-8cb36c9
Driver version: 2.22.0

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 14, 2024
@AndrewSirenko
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants