Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos Linux support #5707

Closed
Hex4dec1mal opened this issue Nov 15, 2023 · 3 comments · Fixed by #5766
Closed

Talos Linux support #5707

Hex4dec1mal opened this issue Nov 15, 2023 · 3 comments · Fixed by #5766
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. reported-by/end-user Issues reported by end users.

Comments

@Hex4dec1mal
Copy link

Talos Linux does not have modprobe, which leads to the install-cni.sh script not working correctly, which prevents the installation of CNI, although the necessary module is already built directly into the kernel.

At the same time, CAP_SYS_MODULE is blocked in Talos Linux, which also makes it impossible to install CNI in the standard configuration.

Also, due to the fact that the file system in Talos Linux is read-only, installing the binary files required in the install-cni.sh script into the system is impossible.

Is it possible to solve this problem?

@Hex4dec1mal Hex4dec1mal added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 15, 2023
@antoninbas
Copy link
Contributor

I think we can take care of this by adding the right configuration options to the Helm chart. I see that Talos has installation instructions for Cilium, which includes the correct Helm values to provide: https://www.talos.dev/v1.5/kubernetes-guides/network/deploying-cilium/

2 things I am not sure about:

  1. CNI binary installation: there is no way around it actually, the CNI binary must be copied to the host file system (/opt/cni/bin). This is the case for all K8s CNI plugins. Are you sure that Talos doesn't make an exception for this directory?
  2. Antrea Agent runs in privileged mode. Would that be a problem for Talos?

@antoninbas antoninbas added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Nov 15, 2023
@dm3ch
Copy link

dm3ch commented Nov 15, 2023

  1. /opt/cni/bin maybe is writable, cause it's mentioned in docs. But there're other pathes in install script and maybe one of them would fail.
  2. CAP_SYS_MODULE is restricted by Talos (https://www.talos.dev/v1.5/learn-more/process-capabilities/) and currently helm chart have no option to disabling it - https://github.com/antrea-io/antrea/blob/main/build/charts/antrea/templates/agent/daemonset.yaml#L100
  3. Running pod in privileged mod is not a problem
  4. But modpeprobe in install script is a problem - https://github.com/antrea-io/antrea/blob/main/build/images/scripts/install_cni#L58 , cause Talos doesn't include it as far as I understood. But needed kernel modules should be already builtin to kernel - https://github.com/siderolabs/pkgs/blob/252a59ffe374ce98c71b0c9b959e691addd38919/kernel/build/config-amd64#L1687-L1690 . So it's needed to somehome bypass modprobe

@antoninbas antoninbas self-assigned this Nov 15, 2023
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 16, 2023
When running on some K8s distributions, users may want to adjust the
securityContext for antrea-agent containers. This is reserved for "power
users", and most users should not modify the default values. When
modifying the securityContext, some Antrea functions may break.

For antrea-io#5707

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 16, 2023
When running on some K8s distributions, users may want to adjust the
securityContext for antrea-agent containers. This is reserved for "power
users", and most users should not modify the default values. When
modifying the securityContext, some Antrea functions may break.

For antrea-io#5707

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit that referenced this issue Nov 20, 2023
When running on some K8s distributions, users may want to adjust the
securityContext for antrea-agent containers. This is reserved for "power
users", and most users should not modify the default values. When
modifying the securityContext, some Antrea functions may break.

For #5707

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Nov 23, 2023
@antoninbas
Copy link
Contributor

An update on this:

  • we have merged Make Pod securityContext configurable in antrea Helm chart #5718 which lets users adjust the securityContext for Antrea containers, and in particular drop capabilities
  • we still need a way to skip loading kernel modules in both the install-cni initContainer and the antrea-ovs container (the OVS start script will try to do modprobe openvswitch as well)
    • when running Talos in Docker (for dev only), modprobe will always fail 1
    • even when running Talos using VMs (qemu in my case), modprobe is failing even though this comes as a surprise to me. The /lib/modules/$(uname -r)/modules.builtin file seems correct, and modprobe should not try to load modules listed in this file (and hence should not fail). I have opened an issue for this: Invalid /lib/modules/6.1.61-talos/modules.builtin.alias.bin file? siderolabs/talos#7980

Footnotes

  1. This is the reason for Docker:
    In a Talos Node, modules information is in /lib/modules/6.1.61-talos/ (kernel version will change based on the Talos version), but uname reports the kernel version from the host (in my case 6.4.16-linuxkit given that I am using Docker Desktop). Tools will usually check for built-in modules or available modules under /lib/modules/$(uname -r), which in this case won’t work given the kernel mismatch.

antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 29, 2023
In order to support some specialized distributions, we may need to
provide users with the ability to skip loading kernel modules. In
particular, this is required to support Talos Linux (see antrea-io#5707).

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 29, 2023
In order to support some specialized distributions, we may need to
provide users with the ability to skip loading kernel modules. In
particular, this is required to support Talos Linux (see antrea-io#5707).

The Antrea Agent may try to load modules in 2 places:

 1. in the install-cni initContainer: we try to load modules, mostly as
    a sanity check. If loading the openvswitch module fails, the
    container fails.
 2. in the antrea-ovs container: this is outside of our direct control,
    but the ovs-ctl start script will try to load the openvswitch module
    if not detected.

For install-cni, we introduce an environment variable,
SKIP_LOADING_KERNEL_MODULES. If set, we do not run modprobe at all.

For antrea-ovs, we introduce a new flag, `--skip-kmod`, to the start_ovs
script. If provided, we ensure that ovs-ctl will not try to run
modprobe, by replacing the ovs-kmod-ctl utility script by a no-op.

To simplify usage, we introduce a new Helm configuration value,
`agent.dontLoadKernelModules`. If set to true, we will take care of both
configurations above.

Note that even when skipping "explicit" Kernel module loading, the
module will still be automatically loaded on the host when starting OVS
if needed. This seems to be expected for recent Linux Kernel versions.

With this change, Antrea can run on Talos Linux (confirmed with both the
Docker and QEMU provisioners).

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Nov 29, 2023
In order to support some specialized distributions, we may need to
provide users with the ability to skip loading kernel modules. In
particular, this is required to support Talos Linux (see antrea-io#5707).

The Antrea Agent may try to load modules in 2 places:

 1. in the install-cni initContainer: we try to load modules, mostly as
    a sanity check. If loading the openvswitch module fails, the
    container fails.
 2. in the antrea-ovs container: this is outside of our direct control,
    but the ovs-ctl start script will try to load the openvswitch module
    if not detected.

For install-cni, we introduce an environment variable,
SKIP_LOADING_KERNEL_MODULES. If set, we do not run modprobe at all.

For antrea-ovs, we introduce a new flag, `--skip-kmod`, to the start_ovs
script. If provided, we ensure that ovs-ctl will not try to run
modprobe, by replacing the ovs-kmod-ctl utility script by a no-op.

To simplify usage, we introduce a new Helm configuration value,
`agent.dontLoadKernelModules`. If set to true, we will take care of both
configurations above.

Note that even when skipping "explicit" Kernel module loading, the
module will still be automatically loaded on the host when starting OVS
if needed. This seems to be expected for recent Linux Kernel versions.

With this change, Antrea can run on Talos Linux (confirmed with both the
Docker and QEMU provisioners).

As part of this change, we also introduce the `agent.antreaOVS.extraEnv`
Helm value, to inject arbitrary environment variables in the antrea-ovs
container. This is for parity with other antrea-agent containers, and is
not strictly required.

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit that referenced this issue Dec 1, 2023
In order to support some specialized distributions, we may need to
provide users with the ability to skip loading kernel modules. In
particular, this is required to support Talos Linux (see #5707).

The Antrea Agent may try to load modules in 2 places:

 1. in the install-cni initContainer: we try to load modules, mostly as
    a sanity check. If loading the openvswitch module fails, the
    container fails.
 2. in the antrea-ovs container: this is outside of our direct control,
    but the ovs-ctl start script will try to load the openvswitch module
    if not detected.

For install-cni, we introduce an environment variable,
SKIP_LOADING_KERNEL_MODULES. If set, we do not run modprobe at all.

For antrea-ovs, we introduce a new flag, `--skip-kmod`, to the start_ovs
script. If provided, we ensure that ovs-ctl will not try to run
modprobe, by replacing the ovs-kmod-ctl utility script by a no-op.

To simplify usage, we introduce a new Helm configuration value,
`agent.dontLoadKernelModules`. If set to true, we will take care of both
configurations above. It will also cause the host's /lib/modules not not
be mounted any more.

Note that even when skipping "explicit" Kernel module loading, the
module will still be automatically loaded on the host when starting OVS
if needed. This seems to be expected for recent Linux Kernel versions.

With this change, Antrea can run on Talos Linux (confirmed with both the
Docker and QEMU provisioners).

As part of this change, we also introduce the `agent.antreaOVS.extraEnv`
Helm value, to inject arbitrary environment variables in the antrea-ovs
container. This is for parity with other antrea-agent containers, and is
not strictly required.

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Dec 1, 2023
Starting with Antrea v1.15, Antrea can be used as the CNI for Talos
clusters. This requires custom Helm values.

This support was tested using both the Docker provisioner and the QEMU
provisioner.

Fixes antrea-io#5707

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Dec 1, 2023
Starting with Antrea v1.15, Antrea can be used as the CNI for Talos
clusters. This requires custom Helm values.

This support was tested using both the Docker provisioner and the QEMU
provisioner.

Fixes antrea-io#5707

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Dec 1, 2023
Starting with Antrea v1.15, Antrea can be used as the CNI for Talos
clusters. This requires custom Helm values.

This support was tested using both the Docker provisioner and the QEMU
provisioner.

Fixes antrea-io#5707

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Dec 1, 2023
@antoninbas antoninbas added this to the Antrea v1.15 release milestone Dec 1, 2023
antoninbas added a commit that referenced this issue Dec 4, 2023
Starting with Antrea v1.15, Antrea can be used as the CNI for Talos
clusters. This requires custom Helm values.

This support was tested using both the Docker provisioner and the QEMU
provisioner.

Fixes #5707

Signed-off-by: Antonin Bas <abas@vmware.com>
@tnqn tnqn added the reported-by/end-user Issues reported by end users. label Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants