Controller stops accepting jobs from the cluster queue #302

aressem · 2024-04-08T12:28:43Z

We have the agent-stack-k8s up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):

2024-04-08T11:38:23.100Z	DEBUG	limiter	scheduler/limiter.go:77	max-in-flight reached	{"in-flight": 25}

We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID} in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s makes the controller happy again and it starts jobs from the queue.

I suspect that there is something that should decrement the in-flight number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.

The text was updated successfully, but these errors were encountered:

DrJosh9000 · 2024-04-23T04:59:23Z

Hi @aressem, did you discover anything with your tests where the number is set to 0?

aressem · 2024-04-23T07:07:31Z

@DrJosh9000 , the pipeline works as expected with in-flight set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)

artem-zinnatullin · 2024-05-21T21:43:16Z

Same issue when testing with max-in-flight: 1 on v0.11.0, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.

2024-05-21T21:31:57.923Z	DEBUG	limiter	scheduler/limiter.go:79	max-in-flight reached	{"in-flight": 1}

calvinbui · 2024-10-17T00:32:52Z

i saw the same issue, num-in-flight does not decrease so the available-tokens eventually reaches 0 and no new jobs are run.

DrJosh9000 · 2024-10-17T00:49:21Z

num-in-flight and available-tokens are now somewhat decoupled, so it would be useful to compare available-tokens against the number of job pods actually pending or running in the k8s cluster.

🤔 Maybe the controller should periodically survey the cluster, and adjust tokens accordingly.

artem-zinnatullin · 2024-11-12T22:20:00Z

We just had CI outage partially caused by this behavior, here is the gist:

We've gradually (%) switched CI jobs from https://github.com/EmbarkStudios/k8s-buildkite-plugin to https://github.com/buildkite/agent-stack-k8s
Due to how this Controller works it adds its own (agent, etc) containers to the Job definition thus raising the resources request of the actual container definition which is fine and we override but EmbarkStudio plugin runs agents completely separately and it just never added on top of a K8S job definition even if overall overhead is the same (again it's fine, but it is an important detail)
We have cron jobs for benchmarking which try to acquire entire CI K8S Node so that other jobs can't affect its performance, we did it by matching resources.requests.memory very close to Node limit (should've probably used taints)
Due to overhead of additional containers added to the K8S Job by https://github.com/buildkite/agent-stack-k8s we now overshot on these few benchmark jobs and they got stuck in K8S because they couldn't fit any node memory
For about a week these benchmark jobs accumulated in the Buildkite queue
At some point https://github.com/buildkite/agent-stack-k8s v0.18.0 stopped taking new Buildkite jobs at 93 jobs in our case even though we have max-in-flight: 250
I've tried deploying max-in-flight: 0 to remove the limit but controller still wasn't taking any new jobs even though people were pushing more PRs and those builds would've fit the nodes, kubetcl get jobs was only displaying these 93 jobs that could never run in the cluster.
Controller didn't pick up any new jobs until I cancelled all stale pending benchmark jobs in Buildkite UI, then other jobs started to be processed

The logs indicated that there were available tokens but yet it got stuck at lower number.

2024-11-12T17:59:37.861Z	DEBUG	limiter	scheduler/limiter.go:87	
Create: job is already in-flight	{"uuid": "01931db6-67ea-403c-8687-e01ab64e8e94", 
"num-in-flight": 93, "available-tokens": 162}

We will be adding alarms for stale Buildkite jobs in the queue, but something still seems wrong with the controller because it should've still scheduled other K8S Jobs into the cluster.

evict · 2024-11-21T15:38:44Z

I am also running in to this issue, I have to restart my kubernetes deployment basically every day. 😅

DrJosh9000 · 2024-12-01T23:05:48Z

I'm still looking into this one.

I have a new theory: k8s jobs can be successfully created, but fail without ever starting a pod. This state isn't handled properly: the job remains present until the TTL, so can't be recreated under the same name. That remains the oldest job available, so before #427 the controller repeatedly tries and fails to recreate it. With #427 other jobs get a shot at being created instead, but this isn't much help if the jobs are failing because the cluster is very busy.

DrJosh9000 mentioned this issue Aug 21, 2024

Overhaul of MaxInFlightLimiter #370

Merged

This was referenced Nov 24, 2024

Fix limiter token tracking (again) #432

Merged

Add Prometheus metrics #419

Merged

DrJosh9000 linked a pull request Dec 3, 2024 that will close this issue

Add job watcher #442

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller stops accepting jobs from the cluster queue #302

Controller stops accepting jobs from the cluster queue #302

aressem commented Apr 8, 2024

DrJosh9000 commented Apr 23, 2024

aressem commented Apr 23, 2024

artem-zinnatullin commented May 21, 2024

calvinbui commented Oct 17, 2024

DrJosh9000 commented Oct 17, 2024

artem-zinnatullin commented Nov 12, 2024

evict commented Nov 21, 2024

DrJosh9000 commented Dec 1, 2024

Controller stops accepting jobs from the cluster queue #302

Controller stops accepting jobs from the cluster queue #302

Comments

aressem commented Apr 8, 2024

DrJosh9000 commented Apr 23, 2024

aressem commented Apr 23, 2024

artem-zinnatullin commented May 21, 2024

calvinbui commented Oct 17, 2024

DrJosh9000 commented Oct 17, 2024

artem-zinnatullin commented Nov 12, 2024

evict commented Nov 21, 2024

DrJosh9000 commented Dec 1, 2024