Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVCs do not get assigned to correct AZ as where the pod needs to be scheduled #49906

Closed
d-shi opened this issue Jul 31, 2017 · 21 comments
Closed
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@d-shi
Copy link

d-shi commented Jul 31, 2017

/kind bug

What happened: I am trying to deploy a stateful set with PVCs to a set of 5 instances spread across 2 AZs. The stateful set has 5 pods, and I have set anti affinity so that there will be 1 pod per instance. The instances came up in AZs a-b-a-b-a. When I deploy the stateful set, their PVCs come up in AZs b-a-b-a-b. Thus the last pod in the set cannot be scheduled due to NoVolumeZoneConflict.

What you expected to happen: The PVCs should be assigned to the same AZ as the instance where the pod that requires it gets scheduled. Is there any way for me to make this happen?

How to reproduce it (as minimally and precisely as possible):
Deploy a stateful set of 5 replicas with PVCs and anti pod affinity to a cluster with 5 nodes on 2 AZs:

spec:
  replicas: 5
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - foo
            topologyKey: kubernetes.io/hostname
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 20Gi

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.1", GitCommit:"1dc5c66f5dd61da08412a74221ecc79208c2165b", GitTreeState:"clean", BuildDate:"2017-07-14T02:00:46Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration**: AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (e.g. uname -a):
Linux 4.4.65-k8s x86_64 GNU/Linux
  • Install tools: custom
  • Others:
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 31, 2017
@d-shi
Copy link
Author

d-shi commented Jul 31, 2017

/sig aws

@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 31, 2017
@msau42
Copy link
Member

msau42 commented Aug 2, 2017

PVC binding is currently done independently of pod scheduler, and is therefore not integrated with pod anti-affinity policies. I am working on making the PVC binding more integrated with pod scheduling, but it's a big design change and will take a few releases to be fully functional.

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 2, 2017
@msau42
Copy link
Member

msau42 commented Aug 2, 2017

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2018
@msau42
Copy link
Member

msau42 commented Jan 11, 2018

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 11, 2018
@toidiu
Copy link

toidiu commented Feb 2, 2018

is there a work around until this is addressed?

Currently what I am thinking of doing is attaching a label to each node based on AZ and then have a node selector policy on the Pod. But this seems like an awful work around to me since kubernetes is supposed to abstract out the underlying cloud provider.

@msau42
Copy link
Member

msau42 commented Feb 2, 2018

Actually, looking at the OP's example again, I'm not clear and why it failed. Each pod in the statefulset should have followed the zone where the PV was provisioned. The only times it would fail scheduling is if you run out of resources in the zones where the PV was provisioned.

@toidiu are you seeing the same issue as the OP? Can you paste your example here?

@msau42
Copy link
Member

msau42 commented Feb 2, 2018

Oh nm I see the problem. The OP's nodes came up a-b-a-b-a, but the volumes were provisioned as b-a-b-a-b.

The only workarounds I see are to:

  • Overprovision your nodes + 1
  • Use even number of replicas
  • Don't require hard spreading across nodes. Generally, the scheduler already tries to prefer spreading across nodes even when you don't specify anti-affinity.

@toidiu, I'm not sure I see how a node selector would help, unless you break up your StatefulSet to one per zone.

@toidiu
Copy link

toidiu commented Feb 2, 2018

@msau42 I was seeing problems when working with only a handful of nodes. Once I over-provisioned this seems to be less of a problem.

I believe this is still an issues if not enough memory/cpu is available in the AZ the PVC is attached to. Kube will not automatically move pods to other node (nor should it) and therefore you might need to manually delete pods in a particular AZ.

So the problem is that currently one still needs to think about what the underlying cloud-provider is doing, which Kube should be able to abstract.

@msau42
Copy link
Member

msau42 commented Feb 3, 2018

Exactly, it's mostly an issue when you are tight on resources. Unfortunately there are not really any good workarounds today.

I'm hoping to get PV scheduling to beta in 1.10. But it will still be at least another release before dynamic provisioning support will be added. Then, we can migrate existing zonal PVs to use it and solve this problem.

Ref kubernetes/enhancements#490

@nrmitchi
Copy link

Just following up here, I'm having a similar problem that is not resource based, but still the same root problem, and may be a helpful consideration.

I have a StatefulSet deploying across 3 AZs (using node affinity on instance group labels, and an anti-affinity to prevent colocating). The StatefulSet is limited to only those 3 AZs (a/b/c), despite the cluster operating across more (a/b/c/d/e).

When deployed, I'm seeing the volume be created in zone e, which does not work with the hard node-affinity.

@toidiu
Copy link

toidiu commented Mar 13, 2018

@nrmitchi are you using a custom storage class to restrict your volumes to only those AZs? https://kubernetes.io/docs/concepts/storage/storage-classes/#aws

@nrmitchi
Copy link

I was not; I had looked at it but dismissed it because I thought it would only restrict to a single AZ (which appears to have been incorrect).

Thanks!

@msau42
Copy link
Member

msau42 commented Mar 27, 2018

Design proposal for integrating dynamic provisioning into the scheduler is here: kubernetes/community#1857

@msau42
Copy link
Member

msau42 commented Sep 21, 2018

Topology-aware dynamic provisioning for the gce-pd, aws-ebs and azure disks is available as beta in 1.12

/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

Topology-aware dynamic provisioning for the gce-pd, aws-ebs and azure disks is available as beta in 1.12

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link

ghost commented Sep 18, 2019

Again I am facing the same issue in Kubernetes version 1.13

@Sewci0
Copy link

Sewci0 commented Sep 19, 2019

I am hitting the same issue, both in Kubernetes 1.13 and 1.14.

@msau42
Copy link
Member

msau42 commented Sep 19, 2019

Please try https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode

@ghost
Copy link

ghost commented Sep 20, 2019

For default storage class, volumeBindingMode will be set to Immediate where pvc will be created without the knowledge of pod. set volumeBindingMode: WaitForFirstConsumer in your storage class. it will work fine then

@Sewci0
Copy link

Sewci0 commented Sep 20, 2019

@msau42 @vbasavani That seems to have fixed it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

8 participants