-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Talos OS #87
Conversation
`/etc` on Talos OS is read-only, which means that this chart will fail to install as is. The new `hostWritePath` option allows the default `/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath: /var/etc/lvm` fixes the install issue. This is just half the story. After this installs, creating a PVs will fail since the controller has some hard-coded paths starting with `/etc`. Here is the second half of this fix: metal-stack/csi-driver-lvm#87 First step is to get the other PR merged first, then update the image versions in this one before merging this change. Learn more about the Talos' file system: https://www.talos.dev/v1.4/learn-more/architecture/ Resolves metal-stack/csi-driver-lvm#86 Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
29b339f
to
890312f
Compare
Given the following local diff: diff --git a/tests/bats/test.bats b/tests/bats/test.bats
index 6f4d684..bc4e003 100644
--- a/tests/bats/test.bats
+++ b/tests/bats/test.bats
@@ -1,7 +1,7 @@
#!/usr/bin/env bats -p
@test "deploy csi-lvm-controller" {
- run helm upgrade --install --repo ${HELM_REPO} csi-driver-lvm csi-driver-lvm --values values.yaml --wait --timeout=120s
+ run helm upgrade --install csi-driver-lvm helm-charts/charts/csi-driver-lvm --values values.yaml --wait --timeout=120s
[ "$status" -eq 0 ]
}
diff --git a/tests/values.yaml b/tests/values.yaml
index e9703d7..2888828 100644
--- a/tests/values.yaml
+++ b/tests/values.yaml
@@ -11,3 +11,4 @@ provisionerImage:
lvm:
devicePattern: /dev/loop100,/dev/loop101
+ hostWritePath: /var/etc/lvm Given a local FWIW, the first integration test run failed 👇 - flaky test I am assuming - but the second one passed 👆. I refactored all tests to not use hard-coded sleeps, test a single thing, and leverage Next, I am going to deploy all these changes in my Talos OS cluster and check that PVs are now created as expected. That will be the test that matters the most to me. I am unable to check if the GitHub workflow still works. Will need one of the maintainers to approve this PR before it runs. Is there anything that needs to change before this gets merged? After this goes in, I can finish metal-stack/helm-charts#64. |
890312f
to
381c492
Compare
`/etc` on Talos OS is read-only, which means that this chart will fail to install as is. The new `hostWritePath` option allows the default `/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath: /var/etc/lvm` fixes the install issue. This is just half the story. After this installs, creating a PVs will fail since the controller has some hard-coded paths starting with `/etc`. Here is the second half of this fix: metal-stack/csi-driver-lvm#87 First step is to get the other PR merged first, then update the image versions in this one before merging this change. Learn more about the Talos' file system: https://www.talos.dev/v1.4/learn-more/architecture/ Resolves metal-stack/csi-driver-lvm#86 Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
e11d28d
to
6c55448
Compare
I deployed https://github.com/users/gerhard/packages/container/csi-driver-lvm/87639324?tag=2023-04-23.2 & https://github.com/users/gerhard/packages/container/csi-driver-lvm-provisioner/87639357?tag=2023-04-23.2 in Talos OS This is what the PV looks like: # k get pv/pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: lvm.csi.metal-stack.io
creationTimestamp: "2023-04-23T18:22:25Z"
finalizers:
- kubernetes.io/pv-protection
name: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
resourceVersion: "957454"
uid: a01ef0a6-1253-4c8a-bb9d-8340cd1d5581
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1G
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: mariadb
namespace: wp
resourceVersion: "957400"
uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
csi:
driver: lvm.csi.metal-stack.io
fsType: xfs
volumeAttributes:
RequiredBytes: "1000000000"
fsType: xfs
storage.kubernetes.io/csiProvisionerIdentity: 1682259423049-8081-lvm.csi.metal-stack.io
type: linear
volumeHandle: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.lvm.csi/node
operator: In
values:
- h12-london
persistentVolumeReclaimPolicy: Delete
storageClassName: csi-driver-lvm-linear-xfs
volumeMode: Filesystem
status:
phase: Bound And this is the related PVC: # k get pvc/mariadb -n wp -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"labels":{"app":"wordpress"},"name":"mariadb","namespace":"wp"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1G"}},"storageClassName":"csi-driver-lvm-linear-xfs"}}
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
volume.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
creationTimestamp: "2023-04-23T18:22:19Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: wordpress
name: mariadb
namespace: wp
resourceVersion: "957456"
uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1G
storageClassName: csi-driver-lvm-linear-xfs
volumeMode: Filesystem
volumeName: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 1G
phase: Bound This is what nsenter --uts --ipc --mount --pid --target 1 vgdisplay
--- Volume group ---
VG Name csi-lvm
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 20
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size <465.76 GiB
PE Size 4.00 MiB
Total PE 119234
Alloc PE / Size 478 / <1.87 GiB
Free PE / Size 118756 / 463.89 GiB
VG UUID 5XscWv-q1Ba-xg2a-XvQz-Jvuz-cgiC-9wzMEN And here is nsenter --uts --ipc --mount --pid --target 1 pvs
PV VG Fmt Attr PSize PFree
/dev/sda csi-lvm lvm2 a-- <465.76g 463.89g
nsenter --uts --ipc --mount --pid --target 1 lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 csi-lvm -wi-ao---- 956.00m
pvc-99830868-a06a-4dff-904f-483e0dfb5121 csi-lvm -wi-ao---- 956.00m
What else needs to happen before this gets merged? |
`/etc` on Talos OS is read-only, which means that new PVs will fail to create. This change makes the hard-coded `/etc/lvm` configurable via the `--hostwritepath` flag. NOTE that this also changes the current `/run/lock/lvm` to `/etc/lvm/lock`. This is a requirement for metal-stack/helm-charts#64 Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
After this change, people which expect integration tests to be self-contained, will not be disappointed. It took me a while to figure out why **some** integration tests were failing locally. I eventually found out about this requirement in this doc page: https://docs.metal-stack.io/stable/external/csi-driver-lvm/README/. The GitHub Actions workflow also helped. Even then, the mknod command was not mentioned anywhere. My NixOS host did not have these special files /dev/loop100 & /dev/loop101 created. With this change, `make test` is self-contained & it should work the same on all Linux hosts, whether it's a local development workstation or running in GitHub Actions. Speaking of GitHub Actions, we do not want to run the build-platforms job if the DOCKER_REGISTRY_TOKEN secret is not set. If we don't check for this, the job will fail in repo forks, where these secrets will not be available by default. FWIW, `${{ secrets. }}` is not available in `if` conditions. The secret value needs to be exposed as an env for the `if` condition to work correctly. FTR: https://github.com/orgs/community/discussions/26726 I also remembered to remove the loop devices part of `make test-cleanup` & double-check that the loop device has been actually removed. I have hit a situation where the file was deleted, but /dev/loop100 was still left dangling. Had to `sudo dmsetup remove` it. Lastly, Docker CLI is configured to ignore the *.img files. These are created in the same directory and should not be sent to Docker when running `docker build`. Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
Remove all hard-coded sleeps **except** the last one, when we delete the csi-lvm-controller, otherwise PVCs may not get deleted before the controller is deleted. When this happens, the loop devices will not be cleared correctly when running `make test-cleanup`. We also want to test one thing per test, otherwise we may not know why a test failed. We leverage `kubectl wait --for=jsonpath=` as much as possible. This way the tests do not need to check for specific strings, we let `--for=jsonpath=` do that. The best part with this approach is that we can use the `--timeout` flag. This brings the **entire** integration test suite duration to 70 seconds. Before this change, the sleeps alone (170s) would take longer than that. To double-check for race conditions or flaky tests, I ran all tests locally 100 times with `RERUN=100 make test`. All 100 runs passed. This looks good to me! Separately, I have also tested this in Talos v1.4.0 running K8s 1.26.4. Everything works as expected now. See this PR comment for more details: metal-stack#87 (comment) Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
6c55448
to
53ea613
Compare
Thanks for the approve @majst01! I improved the commit messages, all code changes remained the same. Just waiting for your 👀 & 👍 @mwennrich |
WDYT @mwennrich ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for making this change @gerhard. I just wanted to use this on a talos cluster and hit the issue, stumbled across this PR. Can confirm both your chart and this change is working for me on our cluster also. |
So glad that this helped @jleeh. Thanks for the 👍 @mwennrich. Can this be merged now & a new release cut? Once that is done, I can finish #86 so that a new chart version can be made available. FWIW, I have deployed this into production last weekend. So far, so good 🤞 |
⭐️⭐️⭐️⭐️⭐️ @majst01 |
Awesome additions! Thank you! |
Resolves metal-stack/csi-driver-lvm#86 `/etc` on Talos OS is read-only, which means that this chart will fail to install as is. The new `hostWritePath` option allows the default `/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath: /var/etc/lvm` fixes the install issue. This is just half the story. After this installs, creating a PVs will fail since the controller has some hard-coded paths starting with `/etc`. Here is the second half of this fix: metal-stack/csi-driver-lvm#87 First step is to get that PR 👆 merged first, then update the image versions in this one before merging this change. Learn more about the Talos' file system: https://www.talos.dev/v1.4/learn-more/architecture/ Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
* Add support for Talos OS Resolves metal-stack/csi-driver-lvm#86 `/etc` on Talos OS is read-only, which means that this chart will fail to install as is. The new `hostWritePath` option allows the default `/etc/lvm` to be overwritten. For Talos OS, setting `hostWritePath: /var/etc/lvm` fixes the install issue. This is just half the story. After this installs, creating a PVs will fail since the controller has some hard-coded paths starting with `/etc`. Here is the second half of this fix: metal-stack/csi-driver-lvm#87 First step is to get that PR 👆 merged first, then update the image versions in this one before merging this change. Learn more about the Talos' file system: https://www.talos.dev/v1.4/learn-more/architecture/ Signed-off-by: Gerhard Lazu <gerhard@lazu.ch> * Bump version in csi-driver-lvm chart Otherwise linting fails: https://github.com/metal-stack/helm-charts/actions/runs/5264062650/jobs/9518891737?pr=64#step:6:32 Signed-off-by: Gerhard Lazu <gerhard@lazu.ch> --------- Signed-off-by: Gerhard Lazu <gerhard@lazu.ch>
/etc
on Talos OS is read-only, which means that new PVs will fail to create.This change makes the hard-coded
/etc/lvm
configurable via the--hostwritepath
flag.This is a requirement for:
/run/lock/lvm
to/etc/lvm/lock