Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximise GH runner space step failing in some repositories CIs #813

Closed
DnPlas opened this issue Feb 4, 2024 · 9 comments
Closed

Maximise GH runner space step failing in some repositories CIs #813

DnPlas opened this issue Feb 4, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Feb 4, 2024

Bug Description

Running the automated integration tests in the CI on a PR is not possible as theMaximise GH runner space step is failing with the following message:

Unmounting and removing swap file.
Creating LVM Volume.
  Creating LVM PV on root fs.
fallocate: invalid length value specified
Error: Process completed with exit code 1.

To Reproduce

Create a PR in any of the Charmed Kubeflow owned repositories where the aforementioned step is used.

Environment

CI environment.

Relevant Log Output

Besides the one provided already, you can refer to this error

Affected repositories (from PRs)

@DnPlas DnPlas added the bug Something isn't working label Feb 4, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5300.

This message was autogenerated

@DnPlas DnPlas changed the title Maximise GH runner space step failing in some repositories' CIs Maximise GH runner space step failing in some repositories CIs Feb 4, 2024
@NohaIhab
Copy link
Contributor

NohaIhab commented Feb 5, 2024

There's a similar issue filed in the action's repo easimon/maximize-build-space#38

@NohaIhab
Copy link
Contributor

NohaIhab commented Feb 5, 2024

Looking at the runner's storage with the different ubuntu image version:
image version 20240126.1.0 (the older one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   54G   30G  65% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2015
/dev/loop1       41M   41M     0 100% /snap/snapd/20290
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7726760039/job/21063679458#step:2:42)61
/dev/sdb1        63G  4.1G   56G   7% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

image version 20240131.1.0 (the current one):

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   54G   19G  75% /
devtmpfs        7.9G     0  7.9G   0% /dev
tmpfs           7.9G  4.0K  7.9G   1% /dev/shm
tmpfs           1.6G  1.2M  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/loop0       64M   64M     0 100% /snap/core20/2105
/dev/sdb15      105M  6.1M   99M   6% /boot/efi
/dev/loop2       41M   41M     0 100% /snap/snapd/20671
/dev/loop1       92M   92M     0 100% /snap/lxd/2[40](https://github.com/canonical/mlflow-operator/actions/runs/7782600114/job/21219363719#step:3:42)61
/dev/sda1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G     0  1.6G   0% /run/user/1001

it seems that the distribution of space on the runner has changed, where:

  • root disk space dropped from total 84G to 73G i.e. a decrease of 11G
  • temp disk mounted on /mnt increased from total 63G to 74G i.e. an increase of 11G

so the action is no longer able to free on root filesystem 40G, specified in the input root-reserve-mb
in the past the action was able to get the available space on root from 30G to 40G, so now it should be able to get root from 19G to 29G, setting the input root-reserve-mb as 29696 (the input is in mb)

Note that the freed up space at the end should not be affected, because the extra space in the temp disk will be utilized

@NohaIhab
Copy link
Contributor

NohaIhab commented Feb 5, 2024

I tested with root-reserve-mb set to 29696 in this run

it would be ideal if we can make the input dynamic, but I'm not sure if that's feasible

@DnPlas
Copy link
Contributor Author

DnPlas commented Feb 5, 2024

Thanks for looking into it @NohaIhab ! I think the approach is right, we need to decrease the root-reserve-mb number to fit the new storage size. Let's do a cannon run for all the affected repositories.

DnPlas added a commit to canonical/charmed-kubeflow-workflows that referenced this issue Feb 6, 2024
Due to a change in the GH runners storage, we cannot longer reserve more than ~30GB of storage with
the easimon/maximize-build-space action, as we'd be hitting the following error:
  fallocate: invalid length value specified
This commit changes the value to 29696.

Part of canonical/bundle-kubeflow#813
DnPlas pushed a commit to canonical/charmed-kubeflow-workflows that referenced this issue Feb 6, 2024
Due to a change in the GH runners storage, we cannot longer reserve more than ~30GB of storage with the easimon/maximize-build-space action, as we'd be hitting the following error:
fallocate: invalid length value specified
This commit changes the value to 29696 to avoid issues.

Part of canonical/bundle-kubeflow#813
orfeas-k pushed a commit to canonical/katib-operators that referenced this issue Feb 9, 2024
orfeas-k pushed a commit to canonical/seldonio-rocks that referenced this issue Feb 9, 2024
orfeas-k pushed a commit to canonical/kfp-operators that referenced this issue Feb 9, 2024
DnPlas added a commit to canonical/seldon-core-operator that referenced this issue Feb 14, 2024
ci: Decrease root-reserve-mb to fit the new runner storage
addresses canonical/bundle-kubeflow#813

Co-authored-by: Noha Ihab <49988746+NohaIhab@users.noreply.github.com>
NohaIhab added a commit to canonical/mlflow-operator that referenced this issue Mar 6, 2024
* ci: Decrease root-reserve-mb to fit the new runner storage (#223)
to fix canonical/bundle-kubeflow#813
* fix: use mysql-k8s `8.0/edge` in the integration tests and bundles (#225)
* fix: use mysql-k8s edge
* fix: add comment to revert to stable
NohaIhab added a commit to canonical/katib-operators that referenced this issue Mar 6, 2024
NohaIhab added a commit to canonical/mlflow-operator that referenced this issue Mar 6, 2024
NohaIhab added a commit to canonical/kfp-operators that referenced this issue Mar 6, 2024
@misohu
Copy link
Member

misohu commented Apr 5, 2024

I still see the issue e.g. in canonical/kfp-operators#416

looks like there is not enough disk space even after the @NohaIhab PR. I debugged by ssh into the worker after the failed tests. Disk space is almost completely exhausted

runner@fv-az572-42:~/work/kfp-operators/kfp-operators$ df
Filesystem                  1K-blocks     Used Available Use% Mounted on
/dev/root                    76026616 74390372   1619860  99% /
devtmpfs                      8183156        0   8183156   0% /dev
tmpfs                         8187672        4   8187668   1% /dev/shm
tmpfs                         1637536     3212   1634324   1% /run
tmpfs                            5120        0      5120   0% /run/lock
tmpfs                         8187672        0   8187672   0% /sys/fs/cgroup
/dev/sdb15                     106858     6186    100673   6% /boot/efi
/dev/loop0                      65536    65536         0 100% /snap/core20/2182
/dev/loop1                      40064    40064         0 100% /snap/snapd/21184
/dev/loop2                      94080    94080         0 100% /snap/lxd/24061
/dev/sda1                    76829444 71714284   1166720  99% /mnt
tmpfs                         1637532        0   1637532   0% /run/user/1001
/dev/mapper/buildvg-buildlv  76661516   264724  76380408   1% /home/runner/work/kfp-operators/kfp-operators
/dev/loop5                     106496   106496         0 100% /snap/core/16928
/dev/loop6                      76032    76032         0 100% /snap/core22/1122
/dev/loop7                     152192   152192         0 100% /snap/lxd/27049
tmpfs                            1024        0      1024   0% /var/snap/lxd/common/ns
/dev/loop8                      93568    93568         0 100% /snap/juju/25751
/dev/loop9                        256      256         0 100% /snap/jq/6
/dev/loop10                     28032    28032         0 100% /snap/charm/712
/dev/loop11                     29312    29312         0 100% /snap/charmcraft/2453
/dev/loop12                      1536     1536         0 100% /snap/juju-bundle/25
/dev/loop13                     12544    12544         0 100% /snap/juju-crashdump/271
/dev/loop14                     57088    57088         0 100% /snap/core18/2812
/dev/loop15                    167552   167552         0 100% /snap/microk8s/6575
/dev/loop16                     12288    12288         0 100% /snap/kubectl/3206

I can also see pods not being scheduled because of

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m13s  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

Did more digging and the main problem are the images in microk8s which we use in kfp-bundle tests which take 15Gbs of disk space. Some of these images are deployed multiple times as pods.

@ca-scribner
Copy link
Contributor

In investigating canonical/kfp-operators#426 I noticed how the easimon/maximize-build-space no longer frees up close to as much space as it used to:

  • Prior to Jan2024, the action left us with ~40GB free space
  • After Jan 2024, the action left us with ~29GB free space
    We can see that in the CI logs from:

Prior to Jan 2024:

log snippet from disk space step

Run echo "Memory and swap:"
echo "Memory and swap:"
free
echo
swapon --show
echo

echo "Available storage:"
df -h
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
Memory and swap:
total used free shared buff/cache available
Mem: 16375356 689700 13991684 34480 1693972 15297100
Swap: 4194300 0 4194300

NAME TYPE SIZE USED PRIO
/dev/dm-0 partition 4G 0B -2

Available storage:
Filesystem Size Used Avail Use% Mounted on
/dev/root 84G 44G 40G 52% /
...

After to Jan 2024:

log snippet from disk space step

Run echo "Memory and swap:"
Memory and swap:
total used free shared buff/cache available
Mem: 16375356 707068 14054144 34340 1614144 15280072
Swap: 4194300 0 4194300

NAME TYPE SIZE USED PRIO
/dev/dm-0 partition 4G 0B -2

Available storage:
Filesystem Size Used Avail Use% Mounted on
/dev/root 73G 44G 29G 60% /
...

Probably this happened at the same time as the GH runner change that is discussed in this issue.

@ca-scribner
Copy link
Contributor

fwiw, I've found the jlumbroso/free-disk-space action with default settings is working better, leaving a runner with ~45GB free after execution. An example of this is in kfp's tests

orfeas-k added a commit to canonical/training-operator that referenced this issue Apr 29, 2024
@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 3, 2024

I haven't seen this issue anymore, and we changed a lot of CIs around. Will close it because all our CIs that use the mazimise runner space action are not failing anymore (see all the attached PRs and commits).

@DnPlas DnPlas closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants