Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSourceMAAS timeout delays the boot of the machine by ~2 minutes #5064

Closed
r00ta opened this issue Mar 15, 2024 · 4 comments
Closed

DataSourceMAAS timeout delays the boot of the machine by ~2 minutes #5064

r00ta opened this issue Mar 15, 2024 · 4 comments
Labels
bug Something isn't working correctly priority Fix soon

Comments

@r00ta
Copy link
Contributor

r00ta commented Mar 15, 2024

Bug report

On every MAAS version when I deploy ubuntu 22.04 (https://images.maas.io/ephemeral-v3/stable/jammy/amd64/20240301/) the os is properly installed, but when the machine reboots there is a delay in the startup and looking at the journalctl logs it's possible to see a timeout of ~2 minutes for the DataSourceMAAS that failed to make some requests to the rack.
The issue here is the delay of ~2 minutes in the boot process.

Steps to reproduce the problem

Just deploy any ubuntu 22.04 machine with MAAS

Environment details

  • Cloud-init version: 23.4.4-0
  • Operating System Distribution: ubuntu0~22.04.1
  • Cloud provider, platform or installer type: MAAS

cloud-init logs

Mar 15 17:19:32 ubuntu cloud-init[1058]: Cloud-init v. 23.4.4-0ubuntu0~22.04.1 running 'init-local' at Fri, 15 Mar 2024 17:19:32 +0000. Up 7.26 seconds.
Mar 15 17:19:32 ubuntu cloud-init[1058]: 2024-03-15 17:19:32,925 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "cloudinit", "timestamp": 1710523172.9016404}. This was caused by: HTTPCo>
Mar 15 17:19:32 ubuntu cloud-init[1058]: 2024-03-15 17:19:32,926 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/check-cache", "description": "no cache found", "event_type": "finish", "origin": "cloudinit", "timestamp": 1710523172.9023874, "result": "SUCCESS"}. This was caused by: HTTPCon>
Mar 15 17:19:32 ubuntu cloud-init[1058]: 2024-03-15 17:19:32,927 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/search-MAAS", "description": "searching for local data from DataSourceMAAS", "event_type": "start", "origin": "cloudinit", "timestamp": 1710523172.9124043}. This was caused by:>
Mar 15 17:20:02 ubuntu systemd[1]: systemd-fsckd.service: Deactivated successfully.
Mar 15 17:21:39 ubuntu cloud-init[1058]: 2024-03-15 17:21:39,146 - url_helper.py[ERROR]: Timed out, no response from urls: ['http://172.0.2.14:5248/MAAS/metadata/2012-03-01/meta-data/instance-id']
Mar 15 17:21:39 ubuntu cloud-init[1058]: 2024-03-15 17:21:39,147 - DataSourceMAAS.py[CRITICAL]: Giving up on md from ['http://172.0.2.14:5248/MAAS/metadata/2012-03-01/meta-data/instance-id'] after 126 seconds
Mar 15 17:21:39 ubuntu cloud-init[1058]: 2024-03-15 17:21:39,150 - handlers.py[WARNING]: Failed posting event: {"name": "init-local/search-MAAS", "description": "no local data found from DataSourceMAAS", "event_type": "finish", "origin": "cloudinit", "timestamp": 1710523299.1476657, "result": "SUCCESS"}. Th>
Mar 15 17:21:39 ubuntu kernel: mlx5_core 0000:04:00.0 ens3f0: renamed from ens3f0np0
Mar 15 17:21:39 ubuntu systemd-udevd[1065]: Using default interface naming scheme 'v249'.
Mar 15 17:21:39 ubuntu kernel: mlx5_core 0000:04:00.1 ens3f1: renamed from ens3f1np1
Mar 15 17:21:39 ubuntu systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Mar 15 17:21:39 ubuntu systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Mar 15 17:21:39 ubuntu systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Mar 15 17:21:39 ubuntu cloud-init[1058]: 2024-03-15 17:21:39,555 - handlers.py[WARNING]: Failed posting event: {"name": "init-local", "description": "searching for local datasources", "event_type": "finish", "origin": "cloudinit", "timestamp": 1710523299.5538688, "result": "SUCCESS"}. This was caused by: HT>
Mar 15 17:21:39 ubuntu cloud-init[1058]: 2024-03-15 17:21:39,556 - handlers.py[WARNING]: Multiple consecutive failures in WebHookHandler. Cancelling all queued events.
Mar 15 17:21:39 ubuntu systemd[1]: Finished Initial cloud-init job (pre-networking).
Mar 15 17:21:39 ubuntu systemd[1]: Reached target Preparation for Network.
Mar 15 17:21:39 ubuntu systemd[1]: Starting Network Configuration...
Mar 15 17:21:39 ubuntu systemd-networkd[1082]: lo: Link UP
Mar 15 17:21:39 ubuntu systemd-networkd[1082]: lo: Gained carrier
@r00ta r00ta added bug Something isn't working correctly new An issue that still needs triage labels Mar 15, 2024
@r00ta
Copy link
Contributor Author

r00ta commented Mar 15, 2024

Link to MAAS upstream bug https://bugs.launchpad.net/maas/+bug/2057763

@holmanb holmanb added the 24.1 label Mar 18, 2024
blackboxsw added a commit to blackboxsw/cloud-init that referenced this issue Mar 18, 2024
Incorrectly registering DataSourceMAAS for init-local DEP_FILESYSTEM
inadvertently short-circuits the _get_dat checks on network setup
from initramfs. This results in a 2 minute timeout on physical system
reboots. Correct this registration to do the proper checks and avoid
trying to perform GETs against the IMDS in init-local timeframe when
we know network is not yet available.

Fixes canonicalGH-5064
@blackboxsw
Copy link
Collaborator

Thank you @r00ta for reflecting the bug here. I've tried setting a custom image streams to http://images.maas.io/ephemeral-v3/candidate in an attempt to reproduce this issue. I'm unable to reproduce this issue correctly due to my own inability on a maas local VM in multipass to get to a commissioning success state so I haven't seen this yet. But, from logs it looks like this timeout is due to the MAAS trying to run it's full get_data against the IMDS url in init-local timeframe before networking is up on the physical machine. That shouldn't generally happen unless initramfs provided network config to the booted machine, which I was told only occurs in Ephemeral boots not physical system reboots.

This leads me to discovering a misconfiguration. The DatasourceMAASLocal class should have been registered to DEP_FILESYSTEM(init-local boot stage). But it looks like DataSourceMAAS (the init-network stage datasource) was registered instead. This is likely causing that 2 minute timeout you are seeing on reboot.

blackboxsw added a commit to blackboxsw/cloud-init that referenced this issue Mar 19, 2024
Incorrectly registering DataSourceMAAS for init-local DEP_FILESYSTEM
inadvertently short-circuits the _get_data checks performed by
DataSourceMAASLocal for presence of network config from initramfs.
This results in a 2 minute timeout on physical system
reboots because the network stage DataSourceMAAS expects all system
networking is already configured and the IMDS to bebe accessible.

Correct this registration to do the proper initramfs checks and avoid
trying to perform GETs against the IMDS in init-local timeframe when
we know network is not yet available.

Fixes canonicalGH-5064
@blackboxsw
Copy link
Collaborator

Thank you for filing this bug @r00ta. I have uploaded a test package 23.4.4 to this PPA https://launchpad.net/~chad.smith/+archive/ubuntu/maas-test. I believe correcting the registration of DataSourceMAASLocal in init-local timeframe should solve this timeout problem. But, I'd need a test run of a modified maas ephemeral image to confirm and I don't have a reproducer handy. Would someone be able to test this to confirm 2 things:

  1. That we no longer have a 2 minute timeout across physical node reboot and
  2. No initramfs applicable config is seen in /var/log/cloud-init.log on physical system reboot

@blackboxsw blackboxsw added priority Fix soon and removed new An issue that still needs triage labels Mar 19, 2024
TheRealFalcon pushed a commit to TheRealFalcon/cloud-init that referenced this issue Mar 19, 2024
…nonical#5068)

Incorrectly registering DataSourceMAAS for init-local DEP_FILESYSTEM
inadvertently short-circuits the _get_data checks performed by
DataSourceMAASLocal for presence of network config from initramfs.
This results in a 2 minute timeout on physical system
reboots because the network stage DataSourceMAAS expects all system
networking is already configured and the IMDS to be accessible.

Correct this registration to do the proper initramfs checks and avoid
trying to perform GETs against the IMDS in init-local timeframe when
we know network is not yet available.

Fixes canonicalGH-5064
LP: #2057763
@bjornt
Copy link

bjornt commented Mar 21, 2024

Thank you for filing this bug @r00ta. I have uploaded a test package 23.4.4 to this PPA https://launchpad.net/~chad.smith/+archive/ubuntu/maas-test. I believe correcting the registration of DataSourceMAASLocal in init-local timeframe should solve this timeout problem. But, I'd need a test run of a modified maas ephemeral image to confirm and I don't have a reproducer handy. Would someone be able to test this to confirm 2 things:

1. That we no longer have a 2 minute timeout across physical node reboot and

2. `No initramfs applicable config` is seen in /var/log/cloud-init.log on physical system reboot

I will try this later.

For now, what I've done is to try to deploy a custom, non-Ubuntu, image with MAAS and the proposed fixed. It doesn't only time out, it seems to fail to configure networking altogether for some reason.

This is the cloud-init.log failing to deploy rhel9.3 with cloud-init 24.1, with the proposed fix applied: https://paste.ubuntu.com/p/GP3nd5zmTw/

For reference, this is the cloud-init.log from a deployment of rhel9.3, with cloud-init 23.1: https://paste.ubuntu.com/p/6CnPcdHk7r/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly priority Fix soon
Projects
None yet
Development

No branches or pull requests

4 participants