Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident: Kubernetes couldn't schedule pods #287

Closed
peterbourgon opened this issue Jan 13, 2016 · 6 comments
Closed

Incident: Kubernetes couldn't schedule pods #287

peterbourgon opened this issue Jan 13, 2016 · 6 comments

Comments

@peterbourgon
Copy link

Timeline

All times UTC on Wed 13 Jan 2016

  • 0943 @pidster noticed his proxy container to prod wasn't starting
  • 0944 Checking kubectl get events -w, saw lots of Pod FailedScheduling - Failed for reason PodExceedsFreeCPU and possibly others
  • 0947 @peterbourgon begins investigation and notices minions are high CPU
  • 0951 At @pidster request, @peterbourgon identifies which customers are affected
  • 1008 A total of 6 customers have their instances in a bad state (Pending)
  • 1015 @peterbourgon researching mitigation
  • 1031 @peterbourgon tries increasing dev autoscaling group size
  • 1041 New minions created and join dev cluster with no complication
  • 1115 Process repeated for prod
  • 1126 New nodes in prod, all customer instances in a good state (Running)

Downtime

The oldest Pending container was in a bad state for a total of 18 hours.

Root cause

We hit cluster resource limits faster than anticipated. Kubernetes refused to schedule new workloads.

Fix

  • Short-term: added minions to clusters. Documented procedure in infra/ README.
  • Medium-term: determine how Kubernetes determines its resource limits have been reached, and possibly set quotas accordingly. Update monitoring to keep an eye on these things.
  • Long-term: rearchitect Service to not require one App per customer.

Action items

  • Link
  • To
  • Other
  • Tickets
  • Here
@2opremio
Copy link

We hit cluster resource limits faster than anticipated.

The problem wasn't really that. top reveals that the Scope probes (v0.11.1) I deployed yesterday as part of #282 are consuming 70% CPU per node, peaking at 90%.

I think that is the culprit

@2opremio
Copy link

As an action, we should at the very least have an alarm when the cluster runs out of resources. K8s knows it so I guess there should be a way for k8s to trigger an alarm from the k8s events.

2opremio pushed a commit that referenced this issue Jan 13, 2016
@2opremio
Copy link

Related: weaveworks/scope#812

@2opremio
Copy link

I will stop the Scope probes in production until weaveworks/scope#812 is resolved.

@2opremio
Copy link

Another cause is that kubelet is consuming ~40% of CPU in the production nodes (almost 80% in the dev nodes). We should investigate this.

@2opremio
Copy link

2opremio commented Feb 3, 2016

This is already tracked by other tickets, so I am closing:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants