tasks: Introduce elastic cloud runner mode #615

martinpitt · 2024-04-08T03:38:00Z

As the EC2 bare metal instances are expensive, we don't want them to run
permanently, but only in times of high demand. They should then
terminate themselves when the queue is running out of work.

Define "run out of work" as "the number of job-runner entries in the
AMQP queue drops below 10". At that level, our permanent PSI runners can
keep up. This is are more robust global criterion than checking if
run-queue encountered an empty queue, as that is more prone to
terminating only some of the instances while some others keep picking
up brand new queue entries.

Introduce an "idle poweroff" mode in which the cockpit-tasks main loop
exits with code 100 instead of slumbering when work is running low.
Configure the slice to automatically power off the machine once all
cockpit-tasks instances exited cleanly (we don't want this on failures,
so that we can ssh in and examine them). Use the poweroff-immediate
heavy hammer there, to avoid potential hangs on shutdown -- there is
nothing to rescue from the instance anyway.

Plumb that through the AWS Ansible role and document it.

There was surprisingly little to fix -- only a self-inflicted incompatibility with Fedora CoreOS.

I tested the auto-poweroff approach locally with the new "localvm" playbook. The AMQP queue is currently empty, so the containers ran out of work and the VM powered off immediately.

built on top of Fix EC2 instances, add local tasks playbook testing #616
build the tasks container to pick up the IDLE_POWEROFF handling
Test-deploy this to EC2 in standard mode (it's been a while and it might need more adjustments)
Figure out why metal instance doesn't terminate after poweroff
Test-deploy in -e idle_poweroff=1 mode

martinpitt · 2024-04-08T06:55:39Z

Meh - I tested idle_poweroff on AWS, and while it does power down the machine, it doesn't actually terminate it, i.e. the instance_initiated_shutdown_behavior: terminate flag doesn't seem to work on metal instances? 😢

Putting this on the shelf then, but we at least need the other three commits.

With this we can react to all tasks containers shutting down. It also makes it easier to stop all instances without having to resort to globbing or knowing how many instances a host has. Note: Avoid a `-` in the name, as that will create a sub-directory/slice.

As the EC2 bare metal instances are expensive, we don't want them to run permanently, but only in times of high demand. They should then terminate themselves when the queue is running out of work. Define "run out of work" as "the number of job-runner entries in the AMQP queue drops below 10". At that level, our permanent PSI runners can keep up. This is are more robust global criterion than checking if `run-queue` encountered an empty queue, as that is more prone to terminating only *some* of the instances while some others keep picking up brand new queue entries. Introduce an "idle poweroff" mode in which the `cockpit-tasks` main loop exits with code 100 instead of slumbering when work is running low. Configure the slice to automatically power off the machine once all cockpit-tasks instances exited cleanly (we don't want this on failures, so that we can ssh in and examine them). Use the `poweroff-immediate` heavy hammer there, to avoid potential hangs on shutdown -- there is nothing to rescue from the instance anyway. Plumb that through the AWS Ansible role and document it.

martinpitt · 2024-04-08T07:10:56Z

This was already proposed two years ago in #477, and Lis didn't like it. So, manual for now.

martinpitt marked this pull request as draft April 8, 2024 03:40

martinpitt force-pushed the tasks-oneshot branch from 989c817 to 67295a4 Compare April 8, 2024 06:53

martinpitt and others added 2 commits April 8, 2024 09:09

martinpitt force-pushed the tasks-oneshot branch from 67295a4 to 902ebf3 Compare April 8, 2024 07:09

martinpitt closed this Apr 8, 2024

martinpitt deleted the tasks-oneshot branch April 8, 2024 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tasks: Introduce elastic cloud runner mode #615

tasks: Introduce elastic cloud runner mode #615

martinpitt commented Apr 8, 2024 •

edited

Loading

martinpitt commented Apr 8, 2024

martinpitt commented Apr 8, 2024

tasks: Introduce elastic cloud runner mode #615

tasks: Introduce elastic cloud runner mode #615

Conversation

martinpitt commented Apr 8, 2024 • edited Loading

martinpitt commented Apr 8, 2024

martinpitt commented Apr 8, 2024

martinpitt commented Apr 8, 2024 •

edited

Loading