openstack: metadata fetcher may stop retrying before network comes up #1081

ITD27M01 · 2020-08-31T08:11:39Z

Bug

Operating System Version

fedora-coreos-32.20200809.3.0-openstack.x86_64

Ignition Version

2.6.0

Environment

OpenStack cloud and Fedora Coreos images.

Expected Behavior

Explicit dependence of ignition on the end of the network configuration.

Actual Behavior

On the high load cloud environments, the DHCP agents can be started and configured with some delay and this delay doesn't predictable. Currently, the ignition tries 10 times within 30 seconds to get config from metadata service and fails with an error if instance doesn't get IP yet:

https://gist.github.com/ITD27M01/81409d4d72ee442c4787e39d9d15f2a9

[   67.235413] ignition[498]: neither config drive nor metadata service were available in time. Continuing without a config...

You can see that ignition failed but NM gets IP after a while.

I understand that this is difficult to reproduce on devstack environment but we often experience such errors in our large cloud. As for cloud-init, there should be an explicit dependency of ignition from network-manager or NetworkManager-wait-online.service because it is in a state of racing with network setup and no heuristic will help to predict and determine the timeout value.

Reproduction Steps

Start Fedora CoreOS provision on OpenStack
See that Ignition and NetworkManager work in parallel.

Other Information

The same behavior can be reproduced for RHCOS:

https://gist.github.com/ITD27M01/2755c8b474ac1ffe4e39627386d0f3bd

The example race with 0.1 sec delay (for fcos):

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-08-31T17:34:56Z

Right, the big mess here is ignition/afterburn wants to support OpenStack instances without a metadata service too (which apparently exist).

I am not really sure how we get out of this without creating a fully distinct openstack-with-metadata image...alternatively perhaps is there a way we can support admins setting an image property via Glance that we can detect somehow?

ITD27M01 · 2020-08-31T18:49:25Z

Does this logic solve somehow in cloud-init?

For operators it would be good to provide some way to configure timeouts If, at this point, the exact dependency cannot be implemented. For example, the same as we can use coreos-firstboot-network to adjust network settings.

lucab · 2020-09-15T15:00:16Z

For reference, the same issue was reported at openshift/os#380.

There, the initial reaction from Andrew was:

There's a command line flag -fetchTimeout we could bump or conditionally bump on different platforms (use systemd drop ins) which is probably what we want.

bgilbert · 2020-09-15T15:41:52Z

The -fetchTimeout option only affects the HTTP timeout, which isn't the issue here, so I'm not sure Andrew's response was on point.

Are we actually constrained from setting an arbitrarily long timeout on OpenStack? I think a config drive should still appear if no userdata is specified, and a metadata service will presumably return 404 in that case. So the only case where the timeout should fire is an OpenStack cloud that implements neither service, but such a cloud isn't suitable for running Ignition anyway.

cgwalters · 2020-09-15T18:05:35Z

For e.g. AWS, we never time out right? A quick glance at git annotate says we should ask @crawford why he picked 30s in af1a535

So the only case where the timeout should fire is an OpenStack cloud that implements neither service, but such a cloud isn't suitable for running Ignition anyway.

Yep agree.

crawford · 2020-09-15T18:26:18Z

there should be an explicit dependency of ignition from network-manager or NetworkManager-wait-online.service

This isn't going to work because not all Ignition configs require the network to be fetched or evaluated (and machines legitimately might not ever realize the network-online.target).

Right, the big mess here is ignition/afterburn wants to support OpenStack instances without a metadata service too (which apparently exist).

This is correct. It takes a variable amount of time for the OS to probe attached config-drives and/or contact the network metadata service (assuming either one exists!). The trade-off is between supporting slow environments like this ("slow" being defined as anything that takes longer than my arbitrarily-chosen thirty-second timeout) and forcing users to needlessly wait to boot in instances where they know that no config will be present.

The ideal solution would be one where we somehow know whether or not to wait for a config (e.g. a distinct openstack-with-metadata image or some signal from OpenStack itself). Realistically though, I don't think many folks are booting FCOS without an Ignition config (do we support the old coreos.autologin kernel argument?) and nobody is booting RHCOS without one. This might be a good time to rip off the band-aid and require that FCOS boot with an Ignition config. That would allow us to wait indefinitely and sidestep the problem.

cgwalters · 2020-09-15T18:43:43Z

Realistically though, I don't think many folks are booting FCOS without an Ignition config (do we support the old coreos.autologin kernel argument?)

This is coreos/fedora-coreos-tracker#279 - basically with FCOS we do support this because you can use the provider-injected SSH keys.

and nobody is booting RHCOS without one.

Agree.

This might be a good time to rip off the band-aid and require that FCOS boot with an Ignition config.

That said, aren't we conflating "no configuration" with "no metadata service"? Not providing a config should result in the metadata service replying successfully with an empty file, not failing to be present, right?

jlebon · 2020-09-15T20:12:16Z

I think a config drive should still appear if no userdata is specified

Is that really the case? If so, then yeah this exactly matches the medium vs media semantics I was going for in the QEMU case in #928.

So if indeed we're guaranteed (within our supported use cases) that either the config drive or the metadata server is present, I think we could remove timeouts entirely for OpenStack, matching other clouds where we already do this.

cgwalters · 2020-09-15T21:25:43Z

Is that really the case? If so, then yeah this exactly matches the medium vs media semantics I was going for in the QEMU case in #928.

I'm not sure; but we need to nail this down.

crawford · 2020-09-15T21:57:50Z

It sounds like there is agreement that booting FCOS on OpenStack without an Ignition config, config-drive, or metadata service is very much an edge case. Since there are so few users of that flow, we should be okay to increase the timeout from thirty seconds; not withstanding the broader changes around medium vs media.

bgilbert · 2020-09-15T22:58:11Z

I think a config drive should still appear if no userdata is specified

Is that really the case?

We should test, but the config drive has a lot of other metadata, so I presume so.

So if indeed we're guaranteed (within our supported use cases) that either the config drive or the metadata server is present, I think we could remove timeouts entirely for OpenStack, matching other clouds where we already do this.

I was originally going to argue for a long timeout, perhaps 10-15 minutes, as a safety valve for broken environments. But @jlebon convinced me OOB that we should behave the same as other platforms and block indefinitely until we learn whether we have userdata or not.

The current logic will need some cleanup though. @jlebon and I turned up a few things:

In fetch-offline we only return ErrNeedNet after timing out the config-drive fetcher, which currently adds 30 seconds of startup overhead when there's no config-drive. We think the right fix is to immediately return ErrNeedNet in fetch-offline on OpenStack. We could set a shorter timeout (avoid starting network if the config drive appears within 2-5 seconds) but we'd end up sometimes starting network on a heavily loaded system, and the inconsistency seems worse than always doing it. @jlebon is going to submit a fix for this one.
If the config drive has no userdata, we correctly treat that as an authoritative result. But if the metadata service returns 404, this is converted to ErrNotFound and we continue waiting for the config drive. We should treat ErrNotFound as authoritative as well.
If all config sources fail, we continue to block until the timeout is reached. We should e.g. send errors through a channel to the caller and stop waiting once we've received N of them.
If the config drive succeeds, we leak the metadata service goroutine. The right fix is to plumb Context into the fetcher, but that'd be a lot of work across all supported URL schemes. The leak seems reasonably harmless so there doesn't seem to be any urgency to fixing it.

This is a follow up to (and revert of) coreos#1057. On OpenStack, we don't actually know if we're fetching from the config-drive or from the metadata server. In theory, if it's from the config-drive, we don't strictly need networking, and so `fetch-offline` would work. The issue is that in the more common case of the metadata server, `fetch-offline` will still wait the full 30s for the config-drive to also time out only to then have to bring up networking and run the `fetch` stage. Instead, let's just accept the brokenness of the OpenStack provider and declare it as always requiring networking. For more information, see: coreos#1081 (comment)

arithx · 2020-09-22T00:17:04Z

In fetch-offline we only return ErrNeedNet after timing out the config-drive fetcher, which currently adds 30 seconds of startup overhead when there's no config-drive. We think the right fix is to immediately return ErrNeedNet in fetch-offline on OpenStack. We could set a shorter timeout (avoid starting network if the config drive appears within 2-5 seconds) but we'd end up sometimes starting network on a heavily loaded system, and the inconsistency seems worse than always doing it. @jlebon is going to submit a fix for this one.

Fixed in #1094

If the config drive has no userdata, we correctly treat that as an authoritative result. But if the metadata service returns 404, this is converted to ErrNotFound and we continue waiting for the config drive. We should treat ErrNotFound as authoritative as well.

Fixed in #1095

If all config sources fail, we continue to block until the timeout is reached. We should e.g. send errors through a channel to the caller and stop waiting once we've received N of them.

Fixed in #1095

And finally the drop of the timeout is pending in #1098

This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567

Mirroring openshift#4206 this bumps the version of RHCOS for MA to mirror the version from openshift#4206. This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567

This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567

nashford77 · 2022-03-13T05:27:28Z

Is there a version I can download image wise that's working with CoreOS ? I am having this issue with all versions it would seem

jlebon · 2022-03-14T20:13:24Z

@nashford77 Can you open an issue on the tracker instead describing the exact FCOS version used and the console logs?

lucab changed the title ~~Ignition and NetworkManager race condition~~ openstack: metadata fetcher may stop retrying before network comes up Sep 15, 2020

lucab added the platform/openstack label Sep 15, 2020

lucab mentioned this issue Sep 15, 2020

Ignition timeout while wait for IP openshift/os#380

Closed

bgilbert added area/stability kind/bug jira for syncing to jira labels Sep 15, 2020

bgilbert assigned arithx Sep 15, 2020

jlebon mentioned this issue Sep 16, 2020

fetch-offline: immediately return ErrNeedNet on OpenStack #1094

Merged

arithx mentioned this issue Sep 22, 2020

interal/providers/*stack: drop timeout for config fetch #1098

Merged

cgwalters mentioned this issue Sep 22, 2020

Bug 1881487: data/rhcos.json: Update to 46.82.202009220041-0 openshift/installer#4206

Merged

arithx closed this as completed in #1098 Sep 22, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openstack: metadata fetcher may stop retrying before network comes up #1081

openstack: metadata fetcher may stop retrying before network comes up #1081

ITD27M01 commented Aug 31, 2020 •

edited

Loading

cgwalters commented Aug 31, 2020 •

edited

Loading

ITD27M01 commented Aug 31, 2020 •

edited

Loading

lucab commented Sep 15, 2020

bgilbert commented Sep 15, 2020

cgwalters commented Sep 15, 2020

crawford commented Sep 15, 2020

cgwalters commented Sep 15, 2020

jlebon commented Sep 15, 2020

cgwalters commented Sep 15, 2020

crawford commented Sep 15, 2020

bgilbert commented Sep 15, 2020 •

edited

Loading

arithx commented Sep 22, 2020

nashford77 commented Mar 13, 2022

jlebon commented Mar 14, 2022

openstack: metadata fetcher may stop retrying before network comes up #1081

openstack: metadata fetcher may stop retrying before network comes up #1081

Comments

ITD27M01 commented Aug 31, 2020 • edited Loading

Bug

Operating System Version

Ignition Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

cgwalters commented Aug 31, 2020 • edited Loading

ITD27M01 commented Aug 31, 2020 • edited Loading

lucab commented Sep 15, 2020

bgilbert commented Sep 15, 2020

cgwalters commented Sep 15, 2020

crawford commented Sep 15, 2020

cgwalters commented Sep 15, 2020

jlebon commented Sep 15, 2020

cgwalters commented Sep 15, 2020

crawford commented Sep 15, 2020

bgilbert commented Sep 15, 2020 • edited Loading

arithx commented Sep 22, 2020

nashford77 commented Mar 13, 2022

jlebon commented Mar 14, 2022

ITD27M01 commented Aug 31, 2020 •

edited

Loading

cgwalters commented Aug 31, 2020 •

edited

Loading

ITD27M01 commented Aug 31, 2020 •

edited

Loading

bgilbert commented Sep 15, 2020 •

edited

Loading