Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locksmithd will reboot outside of reboot windows, if no semaphore was acquired before #407

Closed
tobgu opened this issue May 31, 2021 · 5 comments · Fixed by flatcar-archive/coreos-overlay#1161
Labels
good first issue Get started with Flatcar contribution with this issue. kind/bug Something isn't working

Comments

@tobgu
Copy link

tobgu commented May 31, 2021

Description

This ticket mirrors coreos/bugs#1886 which has been left to fade away.

Impact

Unexpected reboots and updates of machines outside of designated reboot windows.

Expected behavior

Once locksmithd gets the semaphore it should check again that we're still within the update window. If not, no update should take place and the semaphore should be released again.

@tormath1
Copy link
Contributor

tormath1 commented Jun 1, 2021

Hi, thanks for backporting this issue.

Here a few steps to reproduce locally the behavior:

  • Download an "old" release wget https://alpha.release.flatcar-linux.net/amd64-usr/2879.0.0/flatcar_production_qemu_image.img.bz2

Write ignition to setup a local etcd-member, configure locksmithd to use etcd

{"ignition": { "version": "2.0.0" },
  "systemd": {
    "units": [
    {
	"name": "update-engine.service",
	"enable": false
	
    },
      {
        "name": "etcd-member.service",
        "dropins": [{
          "name": "environment.conf",
          "contents": "[Unit]\n[Service]\nEnvironment=ETCD_ADVERTISE_CLIENT_URLS=http://127.0.0.1:2379\nEnvironment=ETCD_LISTEN_CLIENT_URLS=http://127.0.0.1:2379"
        }]
      },
      {
        "name": "locksmithd.service",
        "enable": true,
        "dropins": [{
          "name": "environment.conf",
          "contents": "[Unit]\nAfter=etcd-member.service\nRequires=etcd-member.service\n[Service]\nEnvironment=LOCKSMITHD_ENDPOINT=http://localhost:2379\nEnvironment=LOCKSMITHD_REBOOT_WINDOW_START=03:00\nEnvironment=LOCKSMITHD_REBOOT_WINDOW_LENGTH=2m"
        }]
      }
    ]
  },
  "storage": {
    "files": [
      {
        "filesystem": "root",
        "path": "/etc/coreos/update.conf",
        "contents": { "source": "data:,REBOOT_STRATEGY=etcd-lock%0A" },
        "mode": 420
      }
    ]
  }
}

N.B: Notice the LOCKSMITHD_REBOOT_WINDOW_START=03:00 and LOCKSMITHD_REBOOT_WINDOW_LENGTH=2m.

  • Start Flatcar local instance: ./flatcar_production_qemu.sh -i ignition.json

Once on the instance, remove any available token to "emulate" the issue:

locksmithctl -endpoint http://localhost:2379 set-max 0
  • Download any available update
update_engine_client -update
  • Tweak the system time to be in the locksmithd window update
sudo timedatectl set-ntp false
sudo timedatectl set-time "2021-06-02 02:59:40"
sudo systemctl restart locksmithd
  • Assert that locksmithd fails to acquire a token:
journalctl -u locksmithd.service -f
Jun 02 03:00:12 localhost locksmithd[5049]: Failed to acquire lock: semaphore is at 0. Retrying in 10s.
Jun 02 03:00:22 localhost locksmithd[5049]: Failed to acquire lock: semaphore is at 0. Retrying in 20s.
  • Once out of the window update, release a token and assert that locksmith is triggering an update, even if you're outside the update window
locksmithctl -endpoint http://localhost:2379 set-max 1
...
Jun 02 03:10:22 localhost locksmithd[5049]: Logins detected, delaying reboot for 5 minutes.
...

Looking quickly at the code, we could add a "sleep" inside this loop:
https://github.com/kinvolk/locksmith/blob/3f36f3bfd10d3e684ab0c50eedacb885e1a827b6/locksmithctl/daemon.go#L141-L153

Using the same logic as here:
https://github.com/kinvolk/locksmith/blob/3f36f3bfd10d3e684ab0c50eedacb885e1a827b6/locksmithctl/daemon.go#L367-L374

@sayanchowdhury sayanchowdhury added the kind/bug Something isn't working label Jun 1, 2021
@tobgu
Copy link
Author

tobgu commented Jun 6, 2021

Thanks for looking into this!

Yes, in case you fail to get the semaphore you can probably sleep until the update period starts again followed by a reset of the backoff interval (if sleeptime > 0 as in the example you linked).

@tormath1
Copy link
Contributor

tormath1 commented Jun 7, 2021

Hi @tobgu , would you be interested by doing the implementation yourself then ? :)

@tormath1 tormath1 added the good first issue Get started with Flatcar contribution with this issue. label Jun 7, 2021
@aniruddha2000
Copy link
Contributor

@tormath1 I want to work on this issue

@tobgu
Copy link
Author

tobgu commented Jun 11, 2021

@aniruddha2000 If you want to pick it up I'm happy with that. I would expect the code change to be pretty minimal based on the above suggestion but verifying it in any meaningful way would probably be the biggest effort for me.

@tormath1 tormath1 moved this from In Progress to Ready for review in Flatcar Container Linux Releases Planning Aug 2, 2021
tormath1 pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Aug 3, 2021
closes flatcar/Flatcar#407

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
@sayanchowdhury sayanchowdhury moved this from Ready for review to In Progress in Flatcar Container Linux Releases Planning Aug 3, 2021
tormath1 pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Aug 26, 2021
closes flatcar/Flatcar#407

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
tormath1 pushed a commit to flatcar-archive/coreos-overlay that referenced this issue Aug 27, 2021
closes flatcar/Flatcar#407

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
@tormath1 tormath1 moved this from In Progress to Ready for review in Flatcar Container Linux Releases Planning Aug 27, 2021
@tormath1 tormath1 moved this from Ready for review to Ready to Release - 2021-08-30 in Flatcar Container Linux Releases Planning Aug 27, 2021
@vbatts vbatts removed this from Released - 2021-08-30 in Flatcar Container Linux Releases Planning Sep 10, 2021
t-lo pushed a commit to flatcar/scripts that referenced this issue Apr 17, 2023
closes flatcar/Flatcar#407

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Get started with Flatcar contribution with this issue. kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants