Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks: Introduce elastic cloud runner mode #615

Closed
wants to merge 2 commits into from
Closed

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented Apr 8, 2024

As the EC2 bare metal instances are expensive, we don't want them to run
permanently, but only in times of high demand. They should then
terminate themselves when the queue is running out of work.

Define "run out of work" as "the number of job-runner entries in the
AMQP queue drops below 10". At that level, our permanent PSI runners can
keep up. This is are more robust global criterion than checking if
run-queue encountered an empty queue, as that is more prone to
terminating only some of the instances while some others keep picking
up brand new queue entries.

Introduce an "idle poweroff" mode in which the cockpit-tasks main loop
exits with code 100 instead of slumbering when work is running low.
Configure the slice to automatically power off the machine once all
cockpit-tasks instances exited cleanly (we don't want this on failures,
so that we can ssh in and examine them). Use the poweroff-immediate
heavy hammer there, to avoid potential hangs on shutdown -- there is
nothing to rescue from the instance anyway.

Plumb that through the AWS Ansible role and document it.


There was surprisingly little to fix -- only a self-inflicted incompatibility with Fedora CoreOS.

I tested the auto-poweroff approach locally with the new "localvm" playbook. The AMQP queue is currently empty, so the containers ran out of work and the VM powered off immediately.

@martinpitt martinpitt marked this pull request as draft April 8, 2024 03:40
@martinpitt
Copy link
Member Author

Meh - I tested idle_poweroff on AWS, and while it does power down the machine, it doesn't actually terminate it, i.e. the instance_initiated_shutdown_behavior: terminate flag doesn't seem to work on metal instances? 😢

Putting this on the shelf then, but we at least need the other three commits.

martinpitt and others added 2 commits April 8, 2024 09:09
With this we can react to all tasks containers shutting down. It also
makes it easier to stop all instances without having to resort to
globbing or knowing how many instances a host has.

Note: Avoid a `-` in the name, as that will create a
sub-directory/slice.
As the EC2 bare metal instances are expensive, we don't want them to run
permanently, but only in times of high demand. They should then
terminate themselves when the queue is running out of work.

Define "run out of work" as "the number of job-runner entries in the
AMQP queue drops below 10". At that level, our permanent PSI runners can
keep up. This is are more robust global criterion than checking if
`run-queue` encountered an empty queue, as that is more prone to
terminating only *some* of the instances while some others keep picking
up brand new queue entries.

Introduce an "idle poweroff" mode in which the `cockpit-tasks` main loop
exits with code 100 instead of slumbering when work is running low.
Configure the slice to automatically power off the machine once all
cockpit-tasks instances exited cleanly (we don't want this on failures,
so that we can ssh in and examine them). Use the `poweroff-immediate`
heavy hammer there, to avoid potential hangs on shutdown -- there is
nothing to rescue from the instance anyway.

Plumb that through the AWS Ansible role and document it.
@martinpitt
Copy link
Member Author

This was already proposed two years ago in #477, and Lis didn't like it. So, manual for now.

@martinpitt martinpitt closed this Apr 8, 2024
@martinpitt martinpitt deleted the tasks-oneshot branch April 8, 2024 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant