Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume mount affinity issues: Volume in different zone than server #735

Closed
aktech opened this issue Jul 14, 2021 · 3 comments
Closed

Volume mount affinity issues: Volume in different zone than server #735

aktech opened this issue Jul 14, 2021 · 3 comments
Labels
type: bug 🐛 Something isn't working

Comments

@aktech
Copy link
Member

aktech commented Jul 14, 2021

This is the case with AWS. Imagine the following situation

Nodes

We have two nodes in following zones:
general - us-east-2a
user_worker - us-east-2b

Volume

Now imagine the situation when the volume mounts get deployed in us-east-2b

Server

The conda-store server is set to be deployed on general node:

https://github.com/Quansight/qhub/blob/2dd321a0a5c56672398df734e4b63dc8da053e3c/qhub/template/%7B%7B%20cookiecutter.repo_directory%20%7D%7D/infrastructure/kubernetes.tf#L87-L115

Now in this case conda-store-server will be stuck, because:

  • general (us-east-2a): It cannot be deployed in general node becase the volume mounts are in us-east-2b
  • user_worker (us-east-2b): It cannot be deployed in user_worker node because its affinity is set to general node.

Hence the following issue (for dev/qhub-conda-store pod):

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason             Age                  From                Message                                                                                                                                                             │
│   ----     ------             ----                 ----                -------                                                                                                                                                             │
│   Normal   NotTriggerScaleUp  96s (x208 over 46m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 1 node(s) didn't match node selector                  │
│   Warning  FailedScheduling   36s (x34 over 46m)   default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict.

Same with hub pod:

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason             Age                    From                Message                                                                                                                                                           │
│   ----     ------             ----                   ----                -------                                                                                                                                                           │
│   Warning  FailedScheduling   48m (x2 over 48m)      default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.                                                                         │
│   Normal   NotTriggerScaleUp  27m (x26 over 47m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector, 1 node(s) had no available volume zone                │
│   Normal   NotTriggerScaleUp  2m48s (x219 over 47m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 1 node(s) didn't match node selector                │
│   Warning  FailedScheduling   108s (x32 over 47m)    default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict.
@dharhas
Copy link
Member

dharhas commented Jul 14, 2021

meta question. why are we deploying things in multiple zones.

@tylerpotts
Copy link
Contributor

@dharhas AWS requires at least 2 local zones for the EKS cluster to be deployed. IE us-east-1a and us-east-1b

@aktech
Copy link
Member Author

aktech commented Jul 28, 2021

This should be fixed by #740, we can reopen if we see this again, for now I am closing this.

@aktech aktech closed this as completed Jul 28, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants