Incident: Kubernetes couldn't schedule pods #287

peterbourgon · 2016-01-13T11:32:53Z

Timeline

All times UTC on Wed 13 Jan 2016

0943 @pidster noticed his proxy container to prod wasn't starting
0944 Checking kubectl get events -w, saw lots of Pod FailedScheduling - Failed for reason PodExceedsFreeCPU and possibly others
0947 @peterbourgon begins investigation and notices minions are high CPU
0951 At @pidster request, @peterbourgon identifies which customers are affected
1008 A total of 6 customers have their instances in a bad state (Pending)
1015 @peterbourgon researching mitigation
1031 @peterbourgon tries increasing dev autoscaling group size
1041 New minions created and join dev cluster with no complication
1115 Process repeated for prod
1126 New nodes in prod, all customer instances in a good state (Running)

Downtime

The oldest Pending container was in a bad state for a total of 18 hours.

Root cause

We hit cluster resource limits faster than anticipated. Kubernetes refused to schedule new workloads.

Fix

Short-term: added minions to clusters. Documented procedure in infra/ README.
Medium-term: determine how Kubernetes determines its resource limits have been reached, and possibly set quotas accordingly. Update monitoring to keep an eye on these things.
Long-term: rearchitect Service to not require one App per customer.

Action items

Link
To
Other
Tickets
Here

The text was updated successfully, but these errors were encountered:

2opremio · 2016-01-13T12:01:37Z

We hit cluster resource limits faster than anticipated.

The problem wasn't really that. top reveals that the Scope probes (v0.11.1) I deployed yesterday as part of #282 are consuming 70% CPU per node, peaking at 90%.

I think that is the culprit

2opremio · 2016-01-13T12:08:15Z

As an action, we should at the very least have an alarm when the cluster runs out of resources. K8s knows it so I guess there should be a way for k8s to trigger an alarm from the k8s events.

In order to debug incident #287

2opremio · 2016-01-13T12:48:13Z

Related: weaveworks/scope#812

2opremio · 2016-01-13T13:10:50Z

I will stop the Scope probes in production until weaveworks/scope#812 is resolved.

2opremio · 2016-01-13T14:03:53Z

Another cause is that kubelet is consuming ~40% of CPU in the production nodes (almost 80% in the dev nodes). We should investigate this.

2opremio · 2016-02-03T11:31:59Z

This is already tracked by other tickets, so I am closing:

High CPU of Scope probes (Probe using 70% CPU scope#812) seems to be resolved by Fix reading of network namespace inodes scope#898 but I need to confirm through some testing
High CPU of kubelets (Kubelet consumes 50% of CPU #289) will be resolved as soon as we upgrade kubernetes (Upgrade Kubernetes #300)
We still want alarms when there's an error creating a POD Add alert on error creating a POD #290

2opremio added the Incident label Jan 13, 2016

2opremio pushed a commit that referenced this issue Jan 13, 2016

Deploy Scope probe DaemonSet with profiling in production

caec0ea

In order to debug incident #287

This was referenced Jan 13, 2016

Alerting #280

Merged

Kubelet consumes 50% of CPU #289

Closed

Add alert on error creating a POD #290

Closed

2opremio closed this as completed Feb 3, 2016

2opremio mentioned this issue Feb 29, 2016

Monitor service with scope 0.13 #312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident: Kubernetes couldn't schedule pods #287

Incident: Kubernetes couldn't schedule pods #287

peterbourgon commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Feb 3, 2016

Incident: Kubernetes couldn't schedule pods #287

Incident: Kubernetes couldn't schedule pods #287

Comments

peterbourgon commented Jan 13, 2016

Timeline

Downtime

Root cause

Fix

Action items

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Jan 13, 2016

2opremio commented Feb 3, 2016