Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F34: systemd.path doesn't always work on first boot #861

Closed
wkruse opened this issue Jun 11, 2021 · 16 comments · Fixed by coreos/fedora-coreos-config#1113
Closed

F34: systemd.path doesn't always work on first boot #861

wkruse opened this issue Jun 11, 2021 · 16 comments · Fixed by coreos/fedora-coreos-config#1113

Comments

@wkruse
Copy link

wkruse commented Jun 11, 2021

Describe the bug
We are using Typhoon (https://typhoon.psdn.io/fedora-coreos/bare-metal/) to provision Fedora CoreOS and Kubernetes on bare metal. Since upgrade to F34 (last tested version 34.20210518.3.0), we see following systemd.path unit not activating (often, but not always) on first boot:

variant: fcos
version: 1.2.0
systemd:
  units:
    - name: kubelet.path
      enabled: true
      contents: |
        [Unit]
        Description=Watch for kubeconfig
        [Path]
        PathExists=/etc/kubernetes/kubeconfig
        [Install]
        WantedBy=multi-user.target

The directory is created by ignition:

storage:
  directories:
    - path: /etc/kubernetes

From the logs we see, that the kubelet.path is created by ignition and started by systemd, the kubeconfig is moved to /etc/kubernetes/kubeconfig, but the kubelet.path is still waiting:

Jun 11 14:37:23 localhost ignition[876]: INFO     : files: op(21): [started]  processing unit "kubelet.path"
Jun 11 14:37:23 localhost ignition[876]: INFO     : files: op(21): op(22): [started]  writing unit "kubelet.path" at "/sysroot/etc/systemd/system/kubelet.path"
Jun 11 14:37:23 localhost ignition[876]: INFO     : files: op(21): op(22): [finished] writing unit "kubelet.path" at "/sysroot/etc/systemd/system/kubelet.path"
Jun 11 14:37:23 localhost ignition[876]: INFO     : files: op(21): [finished] processing unit "kubelet.path"
...
Jun 11 14:37:24 localhost ignition[876]: INFO     : files: op(2a): [started]  setting preset to enabled for "kubelet.path"
Jun 11 14:37:24 localhost ignition[876]: INFO     : files: op(2a): [finished] setting preset to enabled for "kubelet.path"
...
Jun 11 14:37:39 xxx systemd[1]: Started Watch for kubeconfig.
...
Jun 11 14:38:07 xxx sudo[4306]:     core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/usr/bin/mv /var/home/core/kubeconfig /etc/kubernetes/kubeconfig

Restarting the kubelet.path fixes it.

Jun 11 16:22:54 xxx sudo[11343]:     core : TTY=pts/1 ; PWD=/var/home/core ; USER=root ; COMMAND=/usr/bin/systemctl restart kubelet.path
...
Jun 11 16:22:54 xxx systemd[1]: kubelet.path: Deactivated successfully.
Jun 11 16:22:54 xxx systemd[1]: Stopped Watch for kubeconfig.
Jun 11 16:22:54 xxx systemd[1]: Stopping Watch for kubeconfig.
Jun 11 16:22:54 xxx systemd[1]: Started Watch for kubeconfig.
...
Jun 11 16:22:54 xxx systemd[1]: Starting Kubelet (System Container)...

Is it a systemd bug? What could we do, to avoid the manual step in provisioning?

Expected behavior
kubelet.path is running as soon as the /etc/kubernetes/kubeconfig exists.

Actual behavior
kubelet.path manual restart is needed after the first boot.

System details

  • Bare Metal
  • Fedora CoreOS 34.20210518.3.0
@dustymabe
Copy link
Member

It would be useful to know the last release of FCOS where it worked. Then we can pinpoint the package set and try to determine the culprit.

Even better would be to go through our testing-devel history and find the first testing-devel version where it failed. This would give us a smaller set of changed packages to investigate. https://builds.coreos.fedoraproject.org/browser?stream=testing-devel

@wkruse
Copy link
Author

wkruse commented Jun 14, 2021

@dustymabe FCOS 33.20210429.20.0 was the last release, that worked. In the first F34 34.20210429.20.0 it broke.

@dustymabe
Copy link
Member

That one has systemd 246.7-1.fc33.x86_64 → 248-2.fc34.x86_64. I just scanned to see what touched path.c to see what commits were new in 246 and 248. Here's what I see:

@dustymabe
Copy link
Member

Just for clarity, Ignition creates /etc/kubernetes, what creates /etc/kubernetes/kubeconfig?

@dghubble
Copy link
Member

Yeah, I've observed systemd path units not activating recently as well. It does seem to be around F34/systemd 248 timeframe and doesn't get a ton of eyes (Typhoon bare-metal and DigitalOcean use path units, but most platforms don't).

When I observe a path unit ignoring file existence, no amount of ssh'ing to touch the file, move the file away and back, write the file, etc. can trick the waiting path unit into activating, even though the file exists. Its unusual behvaior. What CAN activate the path unit is restarting the path unit (as the OP mentioned) or touching the parent directory. I'm guessing this may be somehow related to systemd path units being implemented atop inotify.

The file (kubeconfig in this report) is rsync'd to bare-metal machines out-of-band, when ssh'd becomes available (Terraform just loops trying). Just like attempting to tweak the file manually, the path unit never activates.

A workaround seems to be to temporarily use mkdir instead of having Ignition create the parent directory. systemd is then able to observe the file existing, as expected. I'm not sure why this would matter or why it suddenly matters to systems / inotify now (this isn't a new approach).

-  directories:
-    - path: /etc/kubernetes

@dghubble
Copy link
Member

Actually, probably touch /etc/kubernetes during provisioning is the better workaround. But still unclear to me why this is now needed.

@dustymabe
Copy link
Member

Thanks @dghubble - the extra context helps. We might be able to isolate a small reproducer (i.e. excluding typhoon/kubernetes) and hone in on the problem now.

@dghubble
Copy link
Member

I may have a smaller repro.

---
variant: fcos
version: 1.2.0
systemd:
  units:
    - name: hello.service
      contents: |
        [Unit]
        Description=Hello
        [Service]
        ExecStart=/usr/bin/yes
        [Install]
        WantedBy=multi-user.target
    - name: hello.path
      enabled: true
      contents: |
        [Unit]
        Description=Watch hello
        [Path]
        PathExists=/etc/kubernetes/kubeconfig
        [Install]
        WantedBy=multi-user.target
storage:
  directories:
    - path: /etc/kubernetes
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "MY-PUBKEY"

SSH to the machine and try to create /etc/kubernetes/kubeconfig. The path will stay waiting.

sudo touch /etc/kubernetes/kubeconfig
systemctl status hello.path
● hello.path - Watch hello
     Loaded: loaded (/etc/systemd/system/hello.path; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2021-06-16 18:50:55 UTC; 2min 47s ago
   Triggers: ● hello.service

Now here is where I may be going insane. Let's name our directory something else, like /etc/hello and make a new machine. Repeat. This time touching /etc/hello/kubeconfig does activate the unit. 🤔 Something about /etc/kubernetes is different. But these are the complete butane configs.

---
variant: fcos
version: 1.2.0
systemd:
  units:
    - name: hello.service
      contents: |
        [Unit]
        Description=Hello
        [Service]
        ExecStart=/usr/bin/yes
        [Install]
        WantedBy=multi-user.target
    - name: hello.path
      enabled: true
      contents: |
        [Unit]
        Description=Watch hello
        [Path]
        PathExists=/etc/hello/kubeconfig     <- rename
        [Install]
        WantedBy=multi-user.target
storage:
  directories:
    - path: /etc/hello      <- rename
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - "MY-PUBKEY"

@wkruse
Copy link
Author

wkruse commented Jun 16, 2021

@dghubble In our case not always all of the machines fail, when provisioning a cluster. Maybe reprovisioning a couple of times with /etc/hello would also break…

@dghubble
Copy link
Member

I've been able to see the behaviors mentioned in each setup, reliably each time. Though perhaps with even more attempts, case 2 could fail too. This at least shows you don't need a Kubernetes cluster to repro though, just a single machine and butane config, if someone else can confirm. I used stable 34.20210529.3.0.

@jlebon
Copy link
Member

jlebon commented Jun 21, 2021

Thanks for the reproducer! Looks like the culprit is our good old friend SELinux:

Jun 21 15:26:00 localhost audit[1]: AVC avc:  denied  { watch } for  pid=1 comm="systemd"
path="/etc/kubernetes" dev="vda4" ino=18874496 scontext=system_u:system_r:init_t:s0
tcontext=system_u:object_r:kubernetes_file_t:s0 tclass=dir permissive=0

Filed containers/container-selinux#135.

@dustymabe
Copy link
Member

Thanks @wkruse @dghubble!

miabbott added a commit to miabbott/fedora-coreos-config that referenced this issue Jun 21, 2021
Adds two tests for the ability of `systemd` to read and watch files
labeled with `kubernetes_file_t`.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1973418
See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135
miabbott added a commit to miabbott/fedora-coreos-config that referenced this issue Jun 23, 2021
Adds two tests for the ability of `systemd` to read and watch files
labeled with `kubernetes_file_t`.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1973418
See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135
@dustymabe
Copy link
Member

opened https://bugzilla.redhat.com/show_bug.cgi?id=1980560 to track this

@dustymabe
Copy link
Member

@dghubble @wkruse - this made it into testing-devel 34.20210714.20.1. Will be in the next testing release.

miabbott added a commit to miabbott/fedora-coreos-config that referenced this issue Jul 15, 2021
Adds two tests for the ability of `systemd` to read and watch files
labeled with `kubernetes_file_t`.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1973418
See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135
miabbott added a commit to miabbott/fedora-coreos-config that referenced this issue Jul 15, 2021
Adds a test for the ability of `systemd` to watch files
labeled with `kubernetes_file_t`.

See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135
miabbott added a commit to miabbott/fedora-coreos-config that referenced this issue Jul 16, 2021
Adds a test for the ability of `systemd` to watch files
labeled with `kubernetes_file_t`.

See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135

Co-authored-by: Dusty Mabe <dusty@dustymabe.com>
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Jul 17, 2021
Adds a test for the ability of `systemd` to watch files
labeled with `kubernetes_file_t`.

See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135

Co-authored-by: Dusty Mabe <dusty@dustymabe.com>
@dustymabe
Copy link
Member

The fix for this went into testing stream release 34.20210725.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Aug 9, 2021
@dustymabe
Copy link
Member

The fix for this went into stable stream release 34.20210725.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Aug 25, 2021
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Adds a test for the ability of `systemd` to watch files
labeled with `kubernetes_file_t`.

See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135

Co-authored-by: Dusty Mabe <dusty@dustymabe.com>
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Adds a test for the ability of `systemd` to watch files
labeled with `kubernetes_file_t`.

See: coreos/fedora-coreos-tracker#861
See: containers/container-selinux#135

Co-authored-by: Dusty Mabe <dusty@dustymabe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants