Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Talos OS #87

Merged
merged 3 commits into from
Apr 28, 2023
Merged

Conversation

gerhard
Copy link
Contributor

@gerhard gerhard commented Apr 21, 2023

/etc on Talos OS is read-only, which means that new PVs will fail to create.

This change makes the hard-coded /etc/lvm configurable via the --hostwritepath flag.

This is a requirement for:


⚠️ This also changes the current /run/lock/lvm to /etc/lvm/lock

@gerhard gerhard changed the title Add Talos OS support Add support for Talos OS Apr 21, 2023
gerhard added a commit to gerhard/helm-charts that referenced this pull request Apr 22, 2023
`/etc` on Talos OS is read-only, which means that this chart will fail
to install as is. The new `hostWritePath` option allows the default
`/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath:
/var/etc/lvm` fixes the install issue. This is just half the story.

After this installs, creating a PVs will fail since the controller has
some hard-coded paths starting with `/etc`. Here is the second half of
this fix: metal-stack/csi-driver-lvm#87

First step is to get the other PR merged first, then update the image
versions in this one before merging this change.

Learn more about the Talos' file system:
https://www.talos.dev/v1.4/learn-more/architecture/

Resolves metal-stack/csi-driver-lvm#86

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
@gerhard gerhard force-pushed the add-talos-support branch 6 times, most recently from 29b339f to 890312f Compare April 22, 2023 21:56
@gerhard
Copy link
Contributor Author

gerhard commented Apr 22, 2023

Given the following local diff:

diff --git a/tests/bats/test.bats b/tests/bats/test.bats
index 6f4d684..bc4e003 100644
--- a/tests/bats/test.bats
+++ b/tests/bats/test.bats
@@ -1,7 +1,7 @@
 #!/usr/bin/env bats -p

 @test "deploy csi-lvm-controller" {
-    run helm upgrade --install --repo ${HELM_REPO} csi-driver-lvm csi-driver-lvm --values values.yaml --wait --timeout=120s
+    run helm upgrade --install csi-driver-lvm helm-charts/charts/csi-driver-lvm --values values.yaml --wait --timeout=120s
     [ "$status" -eq 0 ]
 }

diff --git a/tests/values.yaml b/tests/values.yaml
index e9703d7..2888828 100644
--- a/tests/values.yaml
+++ b/tests/values.yaml
@@ -11,3 +11,4 @@ provisionerImage:

 lvm:
   devicePattern: /dev/loop100,/dev/loop101
+  hostWritePath: /var/etc/lvm

Given a local tests/helm-charts repo with my changes from metal-stack/helm-charts#64, make test passed locally:

image

FWIW, the first integration test run failed 👇 - flaky test I am assuming - but the second one passed 👆. I refactored all tests to not use hard-coded sleeps, test a single thing, and leverage kubectl wait --for=jsonpath<CONDITION> --timeout=10s as much as possible.

image

Next, I am going to deploy all these changes in my Talos OS cluster and check that PVs are now created as expected. That will be the test that matters the most to me.

I am unable to check if the GitHub workflow still works. Will need one of the maintainers to approve this PR before it runs.

Is there anything that needs to change before this gets merged?


After this goes in, I can finish metal-stack/helm-charts#64.

@gerhard gerhard marked this pull request as ready for review April 22, 2023 21:57
@gerhard gerhard requested a review from a team as a code owner April 22, 2023 21:57
@gerhard gerhard force-pushed the add-talos-support branch from 890312f to 381c492 Compare April 22, 2023 22:06
gerhard added a commit to gerhard/helm-charts that referenced this pull request Apr 22, 2023
`/etc` on Talos OS is read-only, which means that this chart will fail
to install as is. The new `hostWritePath` option allows the default
`/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath:
/var/etc/lvm` fixes the install issue. This is just half the story.

After this installs, creating a PVs will fail since the controller has
some hard-coded paths starting with `/etc`. Here is the second half of
this fix: metal-stack/csi-driver-lvm#87

First step is to get the other PR merged first, then update the image
versions in this one before merging this change.

Learn more about the Talos' file system:
https://www.talos.dev/v1.4/learn-more/architecture/

Resolves metal-stack/csi-driver-lvm#86

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
@gerhard gerhard force-pushed the add-talos-support branch 7 times, most recently from e11d28d to 6c55448 Compare April 23, 2023 13:33
@gerhard
Copy link
Contributor Author

gerhard commented Apr 24, 2023

I deployed https://github.com/users/gerhard/packages/container/csi-driver-lvm/87639324?tag=2023-04-23.2 & https://github.com/users/gerhard/packages/container/csi-driver-lvm-provisioner/87639357?tag=2023-04-23.2 in Talos OS v1.4.0 running K8s v1.26.4 and I am able to provision, resize & delete PVs as expected.

This is what the PV looks like:

# k get pv/pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: lvm.csi.metal-stack.io
  creationTimestamp: "2023-04-23T18:22:25Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
  resourceVersion: "957454"
  uid: a01ef0a6-1253-4c8a-bb9d-8340cd1d5581
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1G
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mariadb
    namespace: wp
    resourceVersion: "957400"
    uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
  csi:
    driver: lvm.csi.metal-stack.io
    fsType: xfs
    volumeAttributes:
      RequiredBytes: "1000000000"
      fsType: xfs
      storage.kubernetes.io/csiProvisionerIdentity: 1682259423049-8081-lvm.csi.metal-stack.io
      type: linear
    volumeHandle: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.lvm.csi/node
          operator: In
          values:
          - h12-london
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-driver-lvm-linear-xfs
  volumeMode: Filesystem
status:
  phase: Bound

And this is the related PVC:

# k get pvc/mariadb -n wp -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"labels":{"app":"wordpress"},"name":"mariadb","namespace":"wp"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1G"}},"storageClassName":"csi-driver-lvm-linear-xfs"}}
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
    volume.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
  creationTimestamp: "2023-04-23T18:22:19Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: wordpress
  name: mariadb
  namespace: wp
  resourceVersion: "957456"
  uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1G
  storageClassName: csi-driver-lvm-linear-xfs
  volumeMode: Filesystem
  volumeName: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1G
  phase: Bound

This is what vgdisplay returns:

nsenter --uts --ipc --mount --pid --target 1 vgdisplay

  --- Volume group ---
  VG Name               csi-lvm
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  20
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <465.76 GiB
  PE Size               4.00 MiB
  Total PE              119234
  Alloc PE / Size       478 / <1.87 GiB
  Free  PE / Size       118756 / 463.89 GiB
  VG UUID               5XscWv-q1Ba-xg2a-XvQz-Jvuz-cgiC-9wzMEN

And here is pvs & lvs:

nsenter --uts --ipc --mount --pid --target 1 pvs
  PV         VG      Fmt  Attr PSize    PFree
  /dev/sda   csi-lvm lvm2 a--  <465.76g 463.89g

nsenter --uts --ipc --mount --pid --target 1 lvs
  LV                                       VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 csi-lvm -wi-ao---- 956.00m
  pvc-99830868-a06a-4dff-904f-483e0dfb5121 csi-lvm -wi-ao---- 956.00m

Docker Build GitHub Actions workflow passed on my fork: https://github.com/gerhard/csi-driver-lvm/actions/runs/4775196008

What else needs to happen before this gets merged?

gerhard added 3 commits April 24, 2023 13:05
`/etc` on Talos OS is read-only, which means that new PVs will fail to
create. This change makes the hard-coded `/etc/lvm` configurable via the
`--hostwritepath` flag.

NOTE that this also changes the current `/run/lock/lvm` to
`/etc/lvm/lock`.

This is a requirement for
metal-stack/helm-charts#64

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
After this change, people which expect integration tests to be
self-contained, will not be disappointed. It took me a while to figure
out why **some** integration tests were failing locally. I eventually
found out about this requirement in this doc page:
https://docs.metal-stack.io/stable/external/csi-driver-lvm/README/. The
GitHub Actions workflow also helped. Even then, the mknod command was
not mentioned anywhere. My NixOS host did not have these special files
/dev/loop100 & /dev/loop101 created. With this change, `make test` is
self-contained & it should work the same on all Linux hosts, whether
it's a local development workstation or running in GitHub Actions.

Speaking of GitHub Actions, we do not want to run the build-platforms
job if the DOCKER_REGISTRY_TOKEN secret is not set. If we don't check
for this, the job will fail in repo forks, where these secrets will not
be available by default. FWIW, `${{ secrets. }}` is not available in
`if` conditions. The secret value needs to be exposed as an env for the
`if` condition to work correctly. FTR:
https://github.com/orgs/community/discussions/26726

I also remembered to remove the loop devices part of `make
test-cleanup` & double-check that the loop device has been actually
removed. I have hit a situation where the file was deleted, but
/dev/loop100 was still left dangling. Had to `sudo dmsetup remove` it.

Lastly, Docker CLI is configured to ignore the *.img files. These are
created in the same directory and should not be sent to Docker when
running `docker build`.

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
Remove all hard-coded sleeps **except** the last one, when we delete the
csi-lvm-controller, otherwise PVCs may not get deleted before the
controller is deleted. When this happens, the loop devices will not be
cleared correctly when running `make test-cleanup`.

We also want to test one thing per test, otherwise we may not know why a
test failed. We leverage `kubectl wait --for=jsonpath=` as much as
possible. This way the tests do not need to check for specific strings,
we let `--for=jsonpath=` do that. The best part with this approach is
that we can use the `--timeout` flag. This brings the **entire**
integration test suite duration to 70 seconds. Before this change, the
sleeps alone (170s) would take longer than that.

To double-check for race conditions or flaky tests, I ran all tests
locally 100 times with `RERUN=100 make test`. All 100 runs passed. This
looks good to me!

Separately, I have also tested this in Talos v1.4.0 running K8s 1.26.4.
Everything works as expected now. See this PR comment for more details:
metal-stack#87 (comment)

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
@gerhard gerhard force-pushed the add-talos-support branch from 6c55448 to 53ea613 Compare April 24, 2023 12:16
@gerhard
Copy link
Contributor Author

gerhard commented Apr 24, 2023

Thanks for the approve @majst01! I improved the commit messages, all code changes remained the same.

Just waiting for your 👀 & 👍 @mwennrich

@gerhard
Copy link
Contributor Author

gerhard commented Apr 27, 2023

WDYT @mwennrich ?

Copy link
Contributor

@mwennrich mwennrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jleeh
Copy link

jleeh commented Apr 27, 2023

Thanks for making this change @gerhard. I just wanted to use this on a talos cluster and hit the issue, stumbled across this PR. Can confirm both your chart and this change is working for me on our cluster also.

@gerhard
Copy link
Contributor Author

gerhard commented Apr 28, 2023

So glad that this helped @jleeh.

Thanks for the 👍 @mwennrich. Can this be merged now & a new release cut?

Once that is done, I can finish #86 so that a new chart version can be made available.

FWIW, I have deployed this into production last weekend. So far, so good 🤞
Screenshot 2023-04-28 at 07 18 00

@gerhard
Copy link
Contributor Author

gerhard commented Apr 28, 2023

⭐️⭐️⭐️⭐️⭐️ @majst01

@gerhard gerhard deleted the add-talos-support branch April 28, 2023 06:25
@Gerrit91
Copy link
Contributor

Gerrit91 commented May 2, 2023

Awesome additions! Thank you!

gerhard added a commit to gerhard/helm-charts that referenced this pull request Jun 14, 2023
Resolves metal-stack/csi-driver-lvm#86

`/etc` on Talos OS is read-only, which means that this chart will fail
to install as is. The new `hostWritePath` option allows the default
`/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath:
/var/etc/lvm` fixes the install issue. This is just half the story.

After this installs, creating a PVs will fail since the controller has
some hard-coded paths starting with `/etc`. Here is the second half of
this fix: metal-stack/csi-driver-lvm#87

First step is to get that PR 👆 merged first, then update the image
versions in this one before merging this change.

Learn more about the Talos' file system:
https://www.talos.dev/v1.4/learn-more/architecture/

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
Gerrit91 pushed a commit to metal-stack/helm-charts that referenced this pull request Jun 15, 2023
* Add support for Talos OS

Resolves metal-stack/csi-driver-lvm#86

`/etc` on Talos OS is read-only, which means that this chart will fail
to install as is. The new `hostWritePath` option allows the default
`/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath:
/var/etc/lvm` fixes the install issue. This is just half the story.

After this installs, creating a PVs will fail since the controller has
some hard-coded paths starting with `/etc`. Here is the second half of
this fix: metal-stack/csi-driver-lvm#87

First step is to get that PR 👆 merged first, then update the image
versions in this one before merging this change.

Learn more about the Talos' file system:
https://www.talos.dev/v1.4/learn-more/architecture/

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>

* Bump version in csi-driver-lvm chart

Otherwise linting fails:
https://github.com/metal-stack/helm-charts/actions/runs/5264062650/jobs/9518891737?pr=64#step:6:32

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>

---------

Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants