Skip to content

Commit

Permalink
tasks: Introduce elastic cloud runner mode
Browse files Browse the repository at this point in the history
As the EC2 bare metal instances are expensive, we don't want them to run
permanently, but only in times of high demand. They should then
terminate themselves when the queue is running out of work.

Define "run out of work" as "the number of job-runner entries in the
AMQP queue drops below 10". At that level, our permanent PSI runners can
keep up. This is are more robust global criterion than checking if
`run-queue` encountered an empty queue, as that is more prone to
terminating only *some* of the instances while some others keep picking
up brand new queue entries.

Introduce an "idle poweroff" mode in which the `cockpit-tasks` main loop
exits with code 100 instead of slumbering when work is running low.
Configure the slice to automatically power off the machine once all
cockpit-tasks instances exited cleanly (we don't want this on failures,
so that we can ssh in and examine them). Use the `poweroff-immediate`
heavy hammer there, to avoid potential hangs on shutdown -- there is
nothing to rescue from the instance anyway.

Plumb that through the AWS Ansible role and document it.
  • Loading branch information
martinpitt committed Apr 8, 2024
1 parent 14208d6 commit 902ebf3
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 0 deletions.
2 changes: 2 additions & 0 deletions ansible/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ Create and configure the instance:

If you run more than one at a time, set a custom host name with `-e hostname=cockpit-aws-tasks-2` or similar, so that GitHub test statuses remain useful to identify where a test runs.

There is also an "elastic" mode where the tasks bots keep running until the AMQP queue runs low. Use that for situations where AWS instances act as extra high-demand capacity instead of being the primary runners. Enable that mode with `-e idle_poweroff=1`.

Webhook setup
-------------
AWS runs our primary webhook. Deploy or update it with:
Expand Down
1 change: 1 addition & 0 deletions ansible/roles/tasks-systemd/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,4 +146,5 @@
export INSTANCES={{ instances | default(1) }}
export TEST_NOTIFICATION_MX={{ notification_mx | default('') }}
export TEST_NOTIFICATION_TO={{ notification_to | default('') }}
export IDLE_POWEROFF={{ idle_poweroff | default('') }}
/run/install-service
10 changes: 10 additions & 0 deletions tasks/container/cockpit-tasks
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,17 @@ function update_bots() {
}

# wait between 1 and 10 minutes, with an override to speed up tests
# in IDLE_POWEROFF mode, also check queue size
function slumber() {
if [ -n "${IDLE_POWEROFF:-}" ] && [ -e ./inspect-queue ]; then
# only consider job-runner entries, not statistics or webhook
NUM_JOBS=$(./inspect-queue | grep --count '"job":')
if [ "$NUM_JOBS" -lt 10 ]; then
echo "Job queue running low, exiting"
exit 100
fi
fi

if [ -n "${SLUMBER:-}" ]; then
sleep "$SLUMBER"
else
Expand Down
14 changes: 14 additions & 0 deletions tasks/install-service
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ After=podman.socket
[Service]
Slice=cockpittasks.slice
Restart=always
# cockpit-tasks exits with 100 in IDLE_POWEROFF mode when queue is running low
SuccessExitStatus=100
RestartPreventExitStatus=100
RestartSec=60
# give image pull enough time
TimeoutStartSec=10min
Expand All @@ -63,13 +66,24 @@ ExecStart=/usr/bin/podman run --name=cockpit-tasks-%i --hostname=${CONTAINER_HOS
--env=GIT_AUTHOR_EMAIL=cockpituous@cockpit-project.org \
--env=TEST_NOTIFICATION_MX=${TEST_NOTIFICATION_MX} \
--env=TEST_NOTIFICATION_TO=${TEST_NOTIFICATION_TO} \
--env=IDLE_POWEROFF=${IDLE_POWEROFF:-} \
ghcr.io/cockpit-project/tasks cockpit-tasks --verbose
ExecStop=/usr/bin/podman rm -f cockpit-tasks-%i
[Install]
WantedBy=multi-user.target
EOF

# mode for elastic cloud runners
if [ -n "${IDLE_POWEROFF:-}" ]; then
mkdir -p /etc/systemd/system/cockpittasks.slice.d
cat <<EOF > /etc/systemd/system/cockpittasks.slice.d/poweroff.conf
[Unit]
StopWhenUnneeded=yes
SuccessAction=poweroff-immediate
EOF
fi

systemctl daemon-reload

for i in `seq $INSTANCES`; do systemctl enable --now cockpit-tasks@$i; done

0 comments on commit 902ebf3

Please sign in to comment.