tasks: Introduce elastic cloud runner mode #615
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As the EC2 bare metal instances are expensive, we don't want them to run
permanently, but only in times of high demand. They should then
terminate themselves when the queue is running out of work.
Define "run out of work" as "the number of job-runner entries in the
AMQP queue drops below 10". At that level, our permanent PSI runners can
keep up. This is are more robust global criterion than checking if
run-queue
encountered an empty queue, as that is more prone toterminating only some of the instances while some others keep picking
up brand new queue entries.
Introduce an "idle poweroff" mode in which the
cockpit-tasks
main loopexits with code 100 instead of slumbering when work is running low.
Configure the slice to automatically power off the machine once all
cockpit-tasks instances exited cleanly (we don't want this on failures,
so that we can ssh in and examine them). Use the
poweroff-immediate
heavy hammer there, to avoid potential hangs on shutdown -- there is
nothing to rescue from the instance anyway.
Plumb that through the AWS Ansible role and document it.
There was surprisingly little to fix -- only a self-inflicted incompatibility with Fedora CoreOS.
I tested the auto-poweroff approach locally with the new "localvm" playbook. The AMQP queue is currently empty, so the containers ran out of work and the VM powered off immediately.
IDLE_POWEROFF
handling-e idle_poweroff=1
mode