Volume mount affinity issues: Volume in different zone than server #735

aktech · 2021-07-14T16:54:02Z

This is the case with AWS. Imagine the following situation

Nodes

We have two nodes in following zones:
general - us-east-2a
user_worker - us-east-2b

Volume

Now imagine the situation when the volume mounts get deployed in us-east-2b

Server

The conda-store server is set to be deployed on general node:

https://github.com/Quansight/qhub/blob/2dd321a0a5c56672398df734e4b63dc8da053e3c/qhub/template/%7B%7B%20cookiecutter.repo_directory%20%7D%7D/infrastructure/kubernetes.tf#L87-L115

Now in this case conda-store-server will be stuck, because:

general (us-east-2a): It cannot be deployed in general node becase the volume mounts are in us-east-2b
user_worker (us-east-2b): It cannot be deployed in user_worker node because its affinity is set to general node.

Hence the following issue (for dev/qhub-conda-store pod):

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason             Age                  From                Message                                                                                                                                                             │
│   ----     ------             ----                 ----                -------                                                                                                                                                             │
│   Normal   NotTriggerScaleUp  96s (x208 over 46m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 1 node(s) didn't match node selector                  │
│   Warning  FailedScheduling   36s (x34 over 46m)   default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict.

Same with hub pod:

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason             Age                    From                Message                                                                                                                                                           │
│   ----     ------             ----                   ----                -------                                                                                                                                                           │
│   Warning  FailedScheduling   48m (x2 over 48m)      default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict.                                                                         │
│   Normal   NotTriggerScaleUp  27m (x26 over 47m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector, 1 node(s) had no available volume zone                │
│   Normal   NotTriggerScaleUp  2m48s (x219 over 47m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no available volume zone, 1 node(s) didn't match node selector                │
│   Warning  FailedScheduling   108s (x32 over 47m)    default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict.

The text was updated successfully, but these errors were encountered:

dharhas · 2021-07-14T17:19:17Z

meta question. why are we deploying things in multiple zones.

tylerpotts · 2021-07-15T13:53:00Z

@dharhas AWS requires at least 2 local zones for the EKS cluster to be deployed. IE us-east-1a and us-east-1b

aktech · 2021-07-28T12:30:01Z

This should be fixed by #740, we can reopen if we see this again, for now I am closing this.

This was referenced Jul 15, 2021

Conda-store pods get evicted on AWS #738

Closed

Fix conda-store pod eviction and volume conflicts #740

Merged

costrouc added the type: bug 🐛 Something isn't working label Jul 27, 2021

aktech closed this as completed Jul 28, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume mount affinity issues: Volume in different zone than server #735

Volume mount affinity issues: Volume in different zone than server #735

aktech commented Jul 14, 2021

dharhas commented Jul 14, 2021

tylerpotts commented Jul 15, 2021

aktech commented Jul 28, 2021

Volume mount affinity issues: Volume in different zone than server #735

Volume mount affinity issues: Volume in different zone than server #735

Comments

aktech commented Jul 14, 2021

Nodes

Volume

Server

dharhas commented Jul 14, 2021

tylerpotts commented Jul 15, 2021

aktech commented Jul 28, 2021