Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kairos user ids change on upgrade, breaking ssh login #2797

Closed
Tracked by #2804
robarnold opened this issue Aug 7, 2024 · 17 comments · Fixed by kairos-io/packages#1011
Closed
Tracked by #2804

Kairos user ids change on upgrade, breaking ssh login #2797

robarnold opened this issue Aug 7, 2024 · 17 comments · Fixed by kairos-io/packages#1011
Assignees
Labels
bug Something isn't working

Comments

@robarnold
Copy link
Contributor

Kairos version, CPU architecture, OS, and Version:
I am upgrading from quay.io/kairos/ubuntu:23.10-standard-amd64-generic-v3.0.11-k3sv1.29.3-k3s1 to quay.io/kairos/ubuntu:24.04-standard-amd64-generic-v3.1.1-k3sv1.30.2-k3s1

Describe the bug
After login, I am unable to ssh into the new box with ssh keys. I am able to ssh in with password authentication.

To Reproduce
Do the aforementioned upgrade

Expected behavior
The kairos UID does not change during upgrades

Logs
Pre-upgrade, I see this when running id:

uid=65535(kairos) gid=65535(kairos) groups=65535(kairos),900(admin)

Post-upgrade, I see this when running id:

uid=1001(kairos) gid=1001(kairos) groups=1001(kairos),900(admin) context=system_u:system_r:kernel_t:s0

The files in the home directory retain the pre-upgrade user/group ownership for important things like .ssh/authorized_keys, which causes sshd to ignore this configuration.

My kairos config does not set a uid/gid for the kairos user, only passwd and ssh_authorized_keys

@robarnold robarnold added bug Something isn't working triage Add this label to issues that should be triaged and prioretized in the next planning call unconfirmed labels Aug 7, 2024
@jimmykarily
Copy link
Contributor

It's because of this change: mudler/entities#15
consuming this: mauromorales/xpasswd#3
and consumed here: mudler/yip#159

The id 65535 which we were assigning, according to systemd is the "nobody" user. This was creating some bug with some service not being started or something. Unfortunately I can't find a link back to the original problem (I always add links between issues and fixes in various repos but I missed it here).

@kairos-io/maintainers does anyone remember what problem this was causing?

In any case, the id we assign now is correct but I don't think we predicted the home directory change.

@sdwilsh
Copy link
Contributor

sdwilsh commented Aug 10, 2024

I just got bit by this too. Luckily I had local access to the box with a password so I could fix the permissions so I could ssh again.

@sdwilsh
Copy link
Contributor

sdwilsh commented Aug 11, 2024

As a workaround, you can add this to your config to make sure when you apply the 3.1.0 upgrade you don't lose ssh access:

fs:
  - name: Ensure kairos owns files in its user directory
    commands:
      - chown -R kairos:kairos ~kairos

@Itxaka
Copy link
Member

Itxaka commented Aug 11, 2024

The issue comes because now we ignore the nobody user when calculating the UID so we don't get those absurd high uids anymore. And unfortunately in existing installs this will hit.

Unfortunately this will affect all upgrades either with Kairos user or not :(

Because the files are persistent but the user is not, the files will always have a different user id after upgrade.

No idea how we can workaround this for all users easily.

I'll add a note in the release notes and send a comment in the channel about this.

Thanks for the report @robarnold and the workaround @sdwilsh !

@sdwilsh
Copy link
Contributor

sdwilsh commented Aug 11, 2024

Yeah, I run a lot of my pods with the nobody user, not realizing it was going to be having the same permissions as the kairos user, so I'm happy for the change! It was just a surprise to lose access over ssh.

@mudler mudler mentioned this issue Aug 12, 2024
35 tasks
@jimmykarily jimmykarily removed the triage Add this label to issues that should be triaged and prioretized in the next planning call label Aug 12, 2024
@jimmykarily jimmykarily moved this to In Progress 🏃 in 🧙Issue tracking board Aug 12, 2024
@Itxaka
Copy link
Member

Itxaka commented Aug 12, 2024

we could probably add a yip config file that runs in initrafms.after and runs over the /home dirs as so:

for u in `ls -1 /home`
do
chown -R $(id $u -u):$(id $u -g) /home/$u
done

In order to refix the home directories

sdwilsh added a commit to sdwilsh/ansible-playbooks that referenced this issue Aug 13, 2024
@santhoshdaivajna
Copy link
Contributor

santhoshdaivajna commented Aug 14, 2024

I'm assuming this would impact kairos 3.0.x
for users on kairos 3.0.x - if there's a recommendation of things to do before/after upgrade please share.

@jimmykarily jimmykarily self-assigned this Aug 14, 2024
@jimmykarily
Copy link
Contributor

I reproduced locally. I will now try a fix (starting with Itxaka's proposal).

@jimmykarily
Copy link
Contributor

I created an upgrade image with this Dockerfile:

FROM quay.io/kairos/ubuntu:24.04-core-amd64-generic-v3.1.1

RUN cat <<EOF >> /system/oem/01_fix_home_dir_permissions.yaml
  name: "Fix home directory permissions (kairos issue #2797)"
  stages:
    initramfs.after:
      - name: "Fix permissions"
        commands:
          - |
            # Iterate over users in /etc/passwd and chown their directories
            awk -F: '$3 >= 1000 && $6 ~ /^\/home\// {print $1, $6}' /etc/passwd | while read -r user homedir; do
                if [ -d "$homedir" ]; then  # Check if the home directory exists
                    echo "Changing ownership of $homedir to $user"
                    chown -R "$user":"$user" "$homedir"
                else
                    echo "Directory $homedir does not exist for user $user"
                fi
            done
EOF

and it kind of works:

kairos@localhost:~$ ls -liah .
total 20K
524905 drwxr-xr-x. 4 kairos kairos 4.0K Aug 14 09:44 .
524320 drwxr-xr-x. 4 root   root   4.0K Jun  5 02:06 ..
524921 -rw-------. 1  65535  65535   16 Aug 14 09:44 .bash_history
524915 drwx------. 2  65535  65535 4.0K Aug 14 09:40 .cache
524911 drwx------. 2 kairos kairos 4.0K Aug 14 09:38 .ssh
524917 -rw-r--r--. 1  65535  65535    0 Aug 14 09:41 .sudo_as_admin_successful
kairos@localhost:~$ ls -liah .ssh/
total 12K
524911 drwx------. 2 kairos kairos 4.0K Aug 14 09:38 .
524905 drwxr-xr-x. 4 kairos kairos 4.0K Aug 14 09:44 ..
524912 -rw-------. 1  65535  65535  382 Aug 14 09:38 authorized_keys

The home directory and some of the contents (e.g. .ssh dir) have the correct owner. Some others didn't change. Maybe initramfs.after kicked in too early and those files weren't there yet? But still, why do they have the old owner?

@jimmykarily
Copy link
Contributor

Never mind, cat <<EOF should have been cat <<'EOF' because it tries to expand the values of $1, $6, etc in the script. I will give it one more try.

@jimmykarily
Copy link
Contributor

All fine now:

kairos@localhost:~$ id
uid=1001(kairos) gid=1001(kairos) groups=1001(kairos),900(admin) context=system_u:system_r:kernel_t:s0
kairos@localhost:~$ ls -liah
total 20K
131689 drwxr-xr-x. 4 kairos kairos 4.0K Aug 14 10:24 .
131104 drwxr-xr-x. 4 root   root   4.0K Jun  5 02:06 ..
131703 -rw-------. 1 kairos kairos  157 Aug 14 10:24 .bash_history
131699 drwx------. 2 kairos kairos 4.0K Aug 14 10:21 .cache
131695 drwx------. 2 kairos kairos 4.0K Aug 14 10:20 .ssh
131701 -rw-r--r--. 1 kairos kairos    0 Aug 14 10:22 .sudo_as_admin_successful
kairos@localhost:~$ ls -liah .ssh/
total 12K
131695 drwx------. 2 kairos kairos 4.0K Aug 14 10:20 .
131689 drwxr-xr-x. 4 kairos kairos 4.0K Aug 14 10:24 ..
131696 -rw-------. 1 kairos kairos  382 Aug 14 10:20 authorized_keys

I'll prepare a PR for the packages repo

jimmykarily added a commit to kairos-io/packages that referenced this issue Aug 14, 2024
Fixes kairos-io/kairos#2797

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
jimmykarily added a commit to kairos-io/packages that referenced this issue Aug 26, 2024
Fixes kairos-io/kairos#2797

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
@github-project-automation github-project-automation bot moved this from Under review 🔍 to Done ✅ in 🧙Issue tracking board Aug 26, 2024
mauromorales added a commit that referenced this issue Aug 29, 2024
Focuses on these fixes:

- Kairos user ids change on upgrade, breaking ssh login #2797
- Long duration hang during boot #2802
- Support TRIM in LUKS partitions mounted by Kairos #2693
mauromorales added a commit that referenced this issue Aug 29, 2024
Focuses on these fixes:

- Kairos user ids change on upgrade, breaking ssh login #2797
- Long duration hang during boot #2802
- Support TRIM in LUKS partitions mounted by Kairos #2693
@paddy-hack
Copy link
Contributor

Hate to say this, but this issue has not been fixed 😞
I booted to recovery mode and added a password just so I could login again and check what was going on.

After upgrading to v3.1.2 (using debian:bookworm-core-amd64-generic-v3.1.2) I found a situation very similar to what @jimmykarily reported above for my custom user's account.

janitor@localhost:~$ ls -liahR .
.:
total 16K
43385501 drwxr-xr-x 3 janitor janitor 4.0K Sep  9 05:14 .
43384860 drwxr-xr-x 4 root    root    4.0K Mar 29 17:20 ..
43385509 -rw------- 1   65536   65536  131 Sep  9 05:15 .bash_history
43385502 drwx------ 2 janitor janitor 4.0K Sep  9 04:31 .ssh
43385510 -rw-r--r-- 1   65536   65536    0 Sep  9 05:14 .sudo_as_admin_successful

./.ssh:
total 12K
43385502 drwx------ 2 janitor janitor 4.0K Sep  9 04:31 .
43385501 drwxr-xr-x 3 janitor janitor 4.0K Sep  9 05:14 ..
43385503 -rw------- 1   65536   65536   81 Sep  9 04:31 authorized_keys

Curiously, for the kairos user, I find

kairos@localhost:~$ ls -liahR .
.:
total 16K
43385422 drwxr-xr-x 3 kairos kairos 4.0K Sep  9 05:52 .
43384860 drwxr-xr-x 4 root   root   4.0K Mar 29 17:20 ..
43385535 -rw------- 1 kairos kairos  150 Sep  9 05:52 .bash_history
43385524 drwx------ 2 kairos kairos 4.0K Sep  9 05:43 .ssh
43385533 -rw-r--r-- 1 kairos kairos    0 Sep  9 05:44 .sudo_as_admin_successful

./.ssh:
total 8.0K
43385524 drwx------ 2 kairos kairos 4.0K Sep  9 05:43 .
43385422 drwxr-xr-x 3 kairos kairos 4.0K Sep  9 05:52 ..

Note how .bash_history and .sudo_as_admin_successful have been chownd for the kairos user but not for my custom janitor user.

If I run the snippet that is supposed to have fixed this (as the kairos user), I get

Changing ownership of /home/kairos to kairos
Changing ownership of /home/janitor to janitor
chown: changing ownership of '/home/janitor/.bash_history': Operation not permitted
chown: cannot read directory '/home/janitor/.ssh': Permission denied
chown: changing ownership of '/home/janitor/.sudo_as_admin_successful': Operation not permitted
chown: changing ownership of '/home/janitor': Operation not permitted

Wondering whether that snippet ran with sufficient privileges to chown for all users with an ID of 1000 or larger. Actually, wondering if the snippet was run at all because I cannot find anything below /var/log/ that hints at that.

Anyway, I chownd my custom user's home directory recursively removed the password, rebooted and now all is well again, for me at least 💦

@jimmykarily jimmykarily moved this from Done ✅ to In Progress 🏃 in 🧙Issue tracking board Sep 9, 2024
@jimmykarily
Copy link
Contributor

@paddy-hack thanks for letting us know. I think the logs would appear in cat /run/immucore/*.log somewhere (because it's the initramfs stage). I didn't think of testing with an additional custom user. I only checked that the default "kairos" home directory was fixed. I need to check again.

@paddy-hack
Copy link
Contributor

Thanks for the pointer. Just took a peek and here's the scoop

2024-09-09T06:38:09Z INF Running stage: initramfs.after

2024-09-09T06:38:09Z INF Processing stage step 'Fix permissions'. ( commands: 1, files: 0, ... )
2024-09-09T06:38:09Z WRN (conditional) Skip 'Skipping stage (if statement error: failed to run [ -e /sbin/openrc ]: exit status 1)' stage name: Enable serial login for alpine
2024-09-09T06:38:09Z INF Command output: Changing ownership of /home/kairos to kairos

2024-09-09T06:38:09Z WRN (conditional) Skip 'Skipping stage (if statement error: failed to run [[ $(kairos-agent state get kairos.flavor) =~ ^ubuntu ]]: exit status 127)' stage name: setupcon initramfs.after ubuntu
2024-09-09T06:38:09Z INF Done executing stage 'initramfs.after'

Makes me think custom user accounts are created after this step.

@jimmykarily jimmykarily removed their assignment Sep 9, 2024
@Itxaka
Copy link
Member

Itxaka commented Sep 9, 2024

If a user is created in a later stage, indeed, this wont work as expected. I know that wiht the interactive installer, the user is created in the network stage IF the user has an ssh key attached from github (as we need network to get it) so we may need to move this workaround into the boot stage maybe AND the network stage both :(

@Itxaka Itxaka reopened this Sep 9, 2024
@github-project-automation github-project-automation bot moved this from Todo 🖊 to Under review 🔍 in 🧙Issue tracking board Sep 9, 2024
@jimmykarily jimmykarily self-assigned this Sep 9, 2024
@paddy-hack
Copy link
Contributor

FTR, my custom user's account was configured during installation with a literal public SSH key. No need for network to create the account.

@jimmykarily
Copy link
Contributor

FTR, my custom user's account was configured during installation with a literal public SSH key. No need for network to create the account.

We opted for the network stage because it triggers last. It shouldn't matter if the user is created in different ways, as long as it's there when the script is run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants