`Maximise GH runner space` step failing in some repositories CIs #813

DnPlas · 2024-02-04T23:48:07Z

Bug Description

Running the automated integration tests in the CI on a PR is not possible as theMaximise GH runner space step is failing with the following message:

Unmounting and removing swap file.
Creating LVM Volume.
  Creating LVM PV on root fs.
fallocate: invalid length value specified
Error: Process completed with exit code 1.

To Reproduce

Create a PR in any of the Charmed Kubeflow owned repositories where the aforementioned step is used.

Environment

CI environment.

Relevant Log Output

Besides the one provided already, you can refer to this error

Affected repositories (from PRs)

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2024-02-04T23:48:15Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5300.

This message was autogenerated

NohaIhab · 2024-02-05T09:18:48Z

There's a similar issue filed in the action's repo easimon/maximize-build-space#38

NohaIhab · 2024-02-05T10:36:07Z

Looking at the runner's storage with the different ubuntu image version:
image version 20240126.1.0 (the older one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   54G   30G  65% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2015
/dev/loop1       41M   41M     0 100% /snap/snapd/20290
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7726760039/job/21063679458#step:2:42)61
/dev/sdb1        63G  4.1G   56G   7% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

image version 20240131.1.0 (the current one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   54G   19G  75% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2105
/dev/sdb15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       41M   41M     0 100% /snap/snapd/20671
/dev/loop1       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7782600114/job/21219363719#step:3:42)61
/dev/sda1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

it seems that the distribution of space on the runner has changed, where:

root disk space dropped from total 84G to 73G i.e. a decrease of 11G
temp disk mounted on /mnt increased from total 63G to 74G i.e. an increase of 11G

so the action is no longer able to free on root filesystem 40G, specified in the input root-reserve-mb
in the past the action was able to get the available space on root from 30G to 40G, so now it should be able to get root from 19G to 29G, setting the input root-reserve-mb as 29696 (the input is in mb)

Note that the freed up space at the end should not be affected, because the extra space in the temp disk will be utilized

NohaIhab · 2024-02-05T10:38:29Z

I tested with root-reserve-mb set to 29696 in this run

it would be ideal if we can make the input dynamic, but I'm not sure if that's feasible

DnPlas · 2024-02-05T11:09:52Z

Thanks for looking into it @NohaIhab ! I think the approach is right, we need to decrease the root-reserve-mb number to fit the new storage size. Let's do a cannon run for all the affected repositories.

Due to a change in the GH runners storage, we cannot longer reserve more than ~30GB of storage with the easimon/maximize-build-space action, as we'd be hitting the following error: fallocate: invalid length value specified This commit changes the value to 29696. Part of canonical/bundle-kubeflow#813

Due to a change in the GH runners storage, we cannot longer reserve more than ~30GB of storage with the easimon/maximize-build-space action, as we'd be hitting the following error: fallocate: invalid length value specified This commit changes the value to 29696 to avoid issues. Part of canonical/bundle-kubeflow#813

addresses canonical/bundle-kubeflow#813

addresses #813

addresses canonical/bundle-kubeflow#813

ci: Decrease root-reserve-mb to fit the new runner storage addresses canonical/bundle-kubeflow#813 Co-authored-by: Noha Ihab <49988746+NohaIhab@users.noreply.github.com>

* ci: Decrease root-reserve-mb to fit the new runner storage (#223) to fix canonical/bundle-kubeflow#813 * fix: use mysql-k8s `8.0/edge` in the integration tests and bundles (#225) * fix: use mysql-k8s edge * fix: add comment to revert to stable

addresses canonical/bundle-kubeflow#813

to fix canonical/bundle-kubeflow#813

addresses canonical/bundle-kubeflow#813

to fix canonical/bundle-kubeflow#813

misohu · 2024-04-05T12:18:22Z

I still see the issue e.g. in canonical/kfp-operators#416

looks like there is not enough disk space even after the @NohaIhab PR. I debugged by ssh into the worker after the failed tests. Disk space is almost completely exhausted

runner@fv-az572-42:~/work/kfp-operators/kfp-operators$ df
Filesystem                  1K-blocks     Used Available Use% Mounted on
/dev/root                    76026616 74390372   1619860  99% /
devtmpfs                      8183156        0   8183156   0% /dev
tmpfs                         8187672        4   8187668   1% /dev/shm
tmpfs                         1637536     3212   1634324   1% /run
tmpfs                            5120        0      5120   0% /run/lock
tmpfs                         8187672        0   8187672   0% /sys/fs/cgroup
/dev/sdb15                     106858     6186    100673   6% /boot/efi
/dev/loop0                      65536    65536         0 100% /snap/core20/2182
/dev/loop1                      40064    40064         0 100% /snap/snapd/21184
/dev/loop2                      94080    94080         0 100% /snap/lxd/24061
/dev/sda1                    76829444 71714284   1166720  99% /mnt
tmpfs                         1637532        0   1637532   0% /run/user/1001
/dev/mapper/buildvg-buildlv  76661516   264724  76380408   1% /home/runner/work/kfp-operators/kfp-operators
/dev/loop5                     106496   106496         0 100% /snap/core/16928
/dev/loop6                      76032    76032         0 100% /snap/core22/1122
/dev/loop7                     152192   152192         0 100% /snap/lxd/27049
tmpfs                            1024        0      1024   0% /var/snap/lxd/common/ns
/dev/loop8                      93568    93568         0 100% /snap/juju/25751
/dev/loop9                        256      256         0 100% /snap/jq/6
/dev/loop10                     28032    28032         0 100% /snap/charm/712
/dev/loop11                     29312    29312         0 100% /snap/charmcraft/2453
/dev/loop12                      1536     1536         0 100% /snap/juju-bundle/25
/dev/loop13                     12544    12544         0 100% /snap/juju-crashdump/271
/dev/loop14                     57088    57088         0 100% /snap/core18/2812
/dev/loop15                    167552   167552         0 100% /snap/microk8s/6575
/dev/loop16                     12288    12288         0 100% /snap/kubectl/3206

I can also see pods not being scheduled because of

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m13s  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Did more digging and the main problem are the images in microk8s which we use in kfp-bundle tests which take 15Gbs of disk space. Some of these images are deployed multiple times as pods.

ca-scribner · 2024-04-09T19:33:34Z

In investigating canonical/kfp-operators#426 I noticed how the easimon/maximize-build-space no longer frees up close to as much space as it used to:

Prior to Jan2024, the action left us with ~40GB free space
After Jan 2024, the action left us with ~29GB free space
We can see that in the CI logs from:

Prior to Jan 2024:

log snippet from disk space step

Run echo "Memory and swap:"
echo "Memory and swap:"
free
echo
swapon --show
echo

echo "Available storage:"
df -h
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
Memory and swap:
total used free shared buff/cache available
Mem: 16375356 689700 13991684 34480 1693972 15297100
Swap: 4194300 0 4194300

NAME TYPE SIZE USED PRIO
/dev/dm-0 partition 4G 0B -2

Available storage:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 44G 40G 52% /
...

After to Jan 2024:

log snippet from disk space step

Run echo "Memory and swap:"
Memory and swap:
total used free shared buff/cache available
Mem: 16375356 707068 14054144 34340 1614144 15280072
Swap: 4194300 0 4194300

NAME TYPE SIZE USED PRIO
/dev/dm-0 partition 4G 0B -2

Available storage:
Filesystem Size Used Avail Use% Mounted on
/dev/root 73G 44G 29G 60% /
...

Probably this happened at the same time as the GH runner change that is discussed in this issue.

ca-scribner · 2024-04-09T19:34:59Z

fwiw, I've found the jlumbroso/free-disk-space action with default settings is working better, leaving a runner with ~45GB free after execution. An example of this is in kfp's tests

Use jlumbroso/free-disk-space@v1.3.1 Ref canonical/bundle-kubeflow#813

DnPlas · 2024-06-03T10:22:28Z

I haven't seen this issue anymore, and we changed a lot of CIs around. Will close it because all our CIs that use the mazimise runner space action are not failing anymore (see all the attached PRs and commits).

DnPlas added the bug Something isn't working label Feb 4, 2024

DnPlas changed the title ~~Maximise GH runner space step failing in some repositories' CIs~~ Maximise GH runner space step failing in some repositories CIs Feb 4, 2024

NohaIhab mentioned this issue Feb 5, 2024

[DO NOT MERGE] debug: Maximise GH runner space action failure canonical/mlflow-operator#222

Closed

DnPlas mentioned this issue Feb 6, 2024

Update the centraldashboard rockcraft project canonical/kubeflow-rocks#71

Closed

DnPlas mentioned this issue Feb 6, 2024

fix, ci: set root-reserve-mb to 29696 in maximise GH runner step canonical/charmed-kubeflow-workflows#45

Closed

orfeas-k pushed a commit to canonical/katib-operators that referenced this issue Feb 9, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#168)

25d7c1c

addresses canonical/bundle-kubeflow#813

orfeas-k pushed a commit that referenced this issue Feb 9, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#814)

ed57498

addresses #813

orfeas-k pushed a commit to canonical/seldonio-rocks that referenced this issue Feb 9, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#88)

c00b520

addresses canonical/bundle-kubeflow#813

orfeas-k pushed a commit to canonical/kfp-operators that referenced this issue Feb 9, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#406)

5767624

addresses canonical/bundle-kubeflow#813

This was referenced Feb 14, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#238) canonical/seldon-core-operator#241

Merged

fix: correct metrics path for MetricsEndpointProvider (#236) canonical/seldon-core-operator#240

Merged

NohaIhab mentioned this issue Mar 4, 2024

fix: merge dev branch to main to fix CI failure canonical/mlflow-operator#228

Merged

NohaIhab closed this as completed in canonical/mlflow-operator@ed3621b Mar 6, 2024

NohaIhab added a commit to canonical/katib-operators that referenced this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#168)

0d54645

addresses canonical/bundle-kubeflow#813

NohaIhab mentioned this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#168) canonical/katib-operators#170

Merged

NohaIhab added a commit to canonical/mlflow-operator that referenced this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#223)

d062fd5

to fix canonical/bundle-kubeflow#813

NohaIhab mentioned this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#223) canonical/mlflow-operator#229

Merged

NohaIhab added a commit to canonical/kfp-operators that referenced this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#406)

4b24f64

addresses canonical/bundle-kubeflow#813

NohaIhab mentioned this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#406) canonical/kfp-operators#411

Merged

NohaIhab added a commit to canonical/mlflow-operator that referenced this issue Mar 6, 2024

ci: Decrease root-reserve-mb to fit the new runner storage (#223) (#229)

6e89808

to fix canonical/bundle-kubeflow#813

NohaIhab mentioned this issue Mar 6, 2024

fix: backport CI fixes to track/2.1 canonical/mlflow-operator#231

Merged

NohaIhab mentioned this issue Mar 15, 2024

fix: merge dev branch to fix CI issues canonical/mlflow-operator#234

Merged

misohu reopened this Apr 5, 2024

misohu mentioned this issue Apr 9, 2024

Failing to run bundle_v2 integration tests running out of space canonical/kfp-operators#426

Closed

orfeas-k mentioned this issue Apr 29, 2024

ci: Add Maximise GH runner space canonical/training-operator#157

Merged

orfeas-k added a commit to canonical/training-operator that referenced this issue Apr 29, 2024

ci: Add Maximise GH runner space (#157)

c41ac54

Use jlumbroso/free-disk-space@v1.3.1 Ref canonical/bundle-kubeflow#813

DnPlas closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Maximise GH runner space` step failing in some repositories CIs #813

`Maximise GH runner space` step failing in some repositories CIs #813

DnPlas commented Feb 4, 2024 •

edited

Loading

syncronize-issues-to-jira bot commented Feb 4, 2024

NohaIhab commented Feb 5, 2024

NohaIhab commented Feb 5, 2024

NohaIhab commented Feb 5, 2024

DnPlas commented Feb 5, 2024

misohu commented Apr 5, 2024

ca-scribner commented Apr 9, 2024

ca-scribner commented Apr 9, 2024

DnPlas commented Jun 3, 2024 •

edited

Loading

Maximise GH runner space step failing in some repositories CIs #813

Maximise GH runner space step failing in some repositories CIs #813

Comments

DnPlas commented Feb 4, 2024 • edited Loading

Bug Description

To Reproduce

Environment

Relevant Log Output

Affected repositories (from PRs)

syncronize-issues-to-jira bot commented Feb 4, 2024

NohaIhab commented Feb 5, 2024

NohaIhab commented Feb 5, 2024

NohaIhab commented Feb 5, 2024

DnPlas commented Feb 5, 2024

misohu commented Apr 5, 2024

ca-scribner commented Apr 9, 2024

ca-scribner commented Apr 9, 2024

DnPlas commented Jun 3, 2024 • edited Loading

`Maximise GH runner space` step failing in some repositories CIs #813

`Maximise GH runner space` step failing in some repositories CIs #813

DnPlas commented Feb 4, 2024 •

edited

Loading

DnPlas commented Jun 3, 2024 •

edited

Loading