Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conda-store pods get evicted on AWS #738

Closed
aktech opened this issue Jul 15, 2021 · 8 comments · Fixed by #740
Closed

Conda-store pods get evicted on AWS #738

aktech opened this issue Jul 15, 2021 · 8 comments · Fixed by #740
Labels
type: bug 🐛 Something isn't working

Comments

@aktech
Copy link
Member

aktech commented Jul 15, 2021

Describe the bug

Seems like conda-store pods get evicted on AWS on a fresh deployment. This is tested with latest main commit: e799211

Describe on the pod:

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason            Age                From                Message                                                                                                                                                                │
│   ----     ------            ----               ----                -------                                                                                                                                                                │
│   Warning  FailedScheduling  51s (x2 over 52s)  default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.           │
│   Normal   TriggeredScaleUp  41s                cluster-autoscaler  pod triggered scale-up: [{eks-20bd5579-b270-ddc9-c256-f021f1d7978b 1->2 (max: 5)}]                                                                                     │
│   Warning  FailedScheduling  6s (x2 over 6s)    default-scheduler   0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, │
│  that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.                                                                                                                                                                 │
│
@aktech
Copy link
Member Author

aktech commented Jul 15, 2021

This is most likely related to #735

@aktech
Copy link
Member Author

aktech commented Jul 16, 2021

Some more logging

│                  node.kubernetes.io/unreachable:NoExecute for 300s                                                                                                                                                                         │
│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason             Age                  From                                                Message                                                                                                                             │
│   ----     ------             ----                 ----                                                -------                                                                                                                             │
│   Normal   NotTriggerScaleUp  23m                  cluster-autoscaler                                  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict, 2 node(s) didn't │
│  match node selector                                                                                                                                                                                                                       │
│   Warning  FailedScheduling   18m (x7 over 23m)    default-scheduler                                   0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had volu │
│ me node affinity conflict, 2 node(s) didn't match Pod's node affinity.                                                                                                                                                                     │
│   Normal   Scheduled          18m                  default-scheduler                                   Successfully assigned dev/qhub-conda-store-7d455445fc-47wk9 to ip-10-10-10-77.us-west-2.compute.internal                            │
│   Warning  FailedMount        2m50s (x5 over 14m)  kubelet, ip-10-10-10-77.us-west-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[nfs-export-fast], unattached volumes=[conda-environments nfs-export-fast defa │
│ lt-token-m49k4[]: timed out waiting for the condition                                                                                                                                                                                      │
│   Warning  FailedMount        108s (x16 over 18m)  kubelet, ip-10-10-10-77.us-west-2.compute.internal  MountVolume.SetUp failed for volume "pvc-5b5438a6-d6ee-4cae-ae90-5f7bed6681fb" : mount failed: exit status 32                       │
│ Mounting command: mount                                                                                                                                                                                                                    │
│ Mounting arguments:  -o bind /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-west-2a/vol-0007f44b94ed6de07 /var/lib/kubelet/pods/30a166bc-f5cb-4745-bda1-741d95cf3c3e/volumes/kubernetes.io~aws-ebs/pvc-5b5438a6-d6ee-4cae-ae │
│ 90-5f7bed6681fb                                                                                                                                                                                                                            │
│ Output: mount: /var/lib/kubelet/pods/30a166bc-f5cb-4745-bda1-741d95cf3c3e/volumes/kubernetes.io~aws-ebs/pvc-5b5438a6-d6ee-4cae-ae90-5f7bed6681fb: special device /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-west-2a/vol- │
│ 0007f44b94ed6de07 does not exist.                                                                                                                                                                                                          │
│   Warning  FailedMount  36s (x3 over 16m)  kubelet, ip-10-10-10-77.us-west-2.compute.internal  Unable to attach or mount volumes: unmounted volumes=[nfs-export-fast], unattached volumes=[nfs-export-fast default-token-m49k4 conda-envir │
│ nments[]: timed out waiting for the condition                                                                                                                                                                                              │

@aktech
Copy link
Member Author

aktech commented Jul 20, 2021

This is due to to the default node size on AWS, which is 20GB. We need to make this customisable. For GCP its 100 GB IIRC

@costrouc
Copy link
Member

@aktech does this mean that the eks provisioned nodes need to have a larger disk space? Or is it the size of the shared filesystem.

@aktech
Copy link
Member Author

aktech commented Jul 27, 2021

It's the size of the eks nodes need to have a larger disk space.

@iameskild
Copy link
Member

After my redeployment - and installing from the latest commit of qhub (0dff706) - I ran into the same problems described above: both the conda-store and hub pods fail to scale up and are stuck in a Pending state.

conda-store pod:

   Type     Reason             Age                   From                Message                                                                                     
   ----     ------             ----                  ----                -------                                                                                     
   Warning  FailedScheduling   24m (x8 over 25m)     default-scheduler   0/3 nodes are available: 3 node(s) were unschedulable.                                      
   Warning  FailedScheduling   21m (x2 over 23m)     default-scheduler   0/2 nodes are available: 2 node(s) were unschedulable.                                      
   Warning  FailedScheduling   19m (x7 over 20m)     default-scheduler   0/3 nodes are available: 1 node(s) didn't match node selector, 2 node(s) were unschedulable.                                                                                                                                                                   
   Normal   NotTriggerScaleUp  19m                   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size 
  reached, 1 node(s) didn't match node selector                                                                                                                      
   Warning  FailedScheduling   17m (x4 over 19m)     default-scheduler   0/1 nodes are available: 1 node(s) didn't match node selector.                              
   Warning  FailedScheduling   17m (x2 over 17m)     default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.ku 
 bernetes.io/not-ready: }, that the pod didn't tolerate.                                                                                                             
   Normal   NotTriggerScaleUp  4m53s (x60 over 16m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't matc 
 h node selector, 1 node(s) had no available volume zone                                                                                                             
   Warning  FailedScheduling   29s (x13 over 16m)    default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node af 
 finity conflict.                  

hub pod:

   Type     Reason             Age                    From                Message                                                                                    
   ----     ------             ----                   ----                -------                                                                                    
   Normal   NotTriggerScaleUp  37m                    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't mat 
 ch node selector, 1 max node group size reached                                                                                                                     
   Warning  FailedScheduling   36m (x5 over 37m)      default-scheduler   0/1 nodes are available: 1 node(s) didn't match node selector.                             
   Warning  FailedScheduling   35m (x2 over 36m)      default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.k 
 ubernetes.io/not-ready: }, that the pod didn't tolerate.                                                                                                            
   Normal   NotTriggerScaleUp  7m39s (x30 over 35m)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had no ava 
 ilable volume zone, 1 node(s) didn't match node selector                                                                                                            
   Normal   NotTriggerScaleUp  2m38s (x163 over 35m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't mat 
 ch node selector, 1 node(s) had no available volume zone                                                                                                            
   Warning  FailedScheduling   79s (x25 over 35m)     default-scheduler   0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had volume node a 
 ffinity conflict. 

Deploying qhub on AWS.

@aktech
Copy link
Member Author

aktech commented Jul 28, 2021

I think you would need a new deployment, if the volume node is already spun up in a conflicting zone then its very less likely that it will be moved after updating it to the latest version.

@iameskild
Copy link
Member

That makes sense. Using the AWS console to confirm, the Availability Zones for the general node that these pods were running on was in us-east-2a whereas the 50 GB volume mounts are in `us-east-2b.

To get back to a working state, I drained the general node:

kubectl drain ip-10-10-4-189.us-east-2.compute.internal --ignore-daemonsets --delete-emptydir-data --force

And then I will manually kill any pods that won't be forced drained. This will put the node in a "cordoned" state and a new node should soon after spin up (and if you're lucky and the node is launched in the same AZ as your volume mounts), then the pods that were drained will be spun up on the new node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants