Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openstack: metadata fetcher may stop retrying before network comes up #1081

Closed
ITD27M01 opened this issue Aug 31, 2020 · 14 comments · Fixed by #1098
Closed

openstack: metadata fetcher may stop retrying before network comes up #1081

ITD27M01 opened this issue Aug 31, 2020 · 14 comments · Fixed by #1098

Comments

@ITD27M01
Copy link

ITD27M01 commented Aug 31, 2020

Bug

Operating System Version

fedora-coreos-32.20200809.3.0-openstack.x86_64

Ignition Version

2.6.0

Environment

OpenStack cloud and Fedora Coreos images.

Expected Behavior

Explicit dependence of ignition on the end of the network configuration.

Actual Behavior

On the high load cloud environments, the DHCP agents can be started and configured with some delay and this delay doesn't predictable. Currently, the ignition tries 10 times within 30 seconds to get config from metadata service and fails with an error if instance doesn't get IP yet:

https://gist.github.com/ITD27M01/81409d4d72ee442c4787e39d9d15f2a9

[   67.235413] ignition[498]: neither config drive nor metadata service were available in time. Continuing without a config...

You can see that ignition failed but NM gets IP after a while.

I understand that this is difficult to reproduce on devstack environment but we often experience such errors in our large cloud. As for cloud-init, there should be an explicit dependency of ignition from network-manager or NetworkManager-wait-online.service because it is in a state of racing with network setup and no heuristic will help to predict and determine the timeout value.

Reproduction Steps

  1. Start Fedora CoreOS provision on OpenStack
  2. See that Ignition and NetworkManager work in parallel.

Other Information

The same behavior can be reproduced for RHCOS:

https://gist.github.com/ITD27M01/2755c8b474ac1ffe4e39627386d0f3bd

The example race with 0.1 sec delay (for fcos):

image

@cgwalters
Copy link
Member

cgwalters commented Aug 31, 2020

Right, the big mess here is ignition/afterburn wants to support OpenStack instances without a metadata service too (which apparently exist).

I am not really sure how we get out of this without creating a fully distinct openstack-with-metadata image...alternatively perhaps is there a way we can support admins setting an image property via Glance that we can detect somehow?

@ITD27M01
Copy link
Author

ITD27M01 commented Aug 31, 2020

Does this logic solve somehow in cloud-init?

For operators it would be good to provide some way to configure timeouts If, at this point, the exact dependency cannot be implemented. For example, the same as we can use coreos-firstboot-network to adjust network settings.

@lucab lucab changed the title Ignition and NetworkManager race condition openstack: metadata fetcher may stop retrying before network comes up Sep 15, 2020
@lucab
Copy link
Contributor

lucab commented Sep 15, 2020

For reference, the same issue was reported at openshift/os#380.

There, the initial reaction from Andrew was:

There's a command line flag -fetchTimeout we could bump or conditionally bump on different platforms (use systemd drop ins) which is probably what we want.

@bgilbert
Copy link
Contributor

The -fetchTimeout option only affects the HTTP timeout, which isn't the issue here, so I'm not sure Andrew's response was on point.

Are we actually constrained from setting an arbitrarily long timeout on OpenStack? I think a config drive should still appear if no userdata is specified, and a metadata service will presumably return 404 in that case. So the only case where the timeout should fire is an OpenStack cloud that implements neither service, but such a cloud isn't suitable for running Ignition anyway.

@cgwalters
Copy link
Member

For e.g. AWS, we never time out right? A quick glance at git annotate says we should ask @crawford why he picked 30s in af1a535

So the only case where the timeout should fire is an OpenStack cloud that implements neither service, but such a cloud isn't suitable for running Ignition anyway.

Yep agree.

@crawford
Copy link
Contributor

there should be an explicit dependency of ignition from network-manager or NetworkManager-wait-online.service

This isn't going to work because not all Ignition configs require the network to be fetched or evaluated (and machines legitimately might not ever realize the network-online.target).

Right, the big mess here is ignition/afterburn wants to support OpenStack instances without a metadata service too (which apparently exist).

This is correct. It takes a variable amount of time for the OS to probe attached config-drives and/or contact the network metadata service (assuming either one exists!). The trade-off is between supporting slow environments like this ("slow" being defined as anything that takes longer than my arbitrarily-chosen thirty-second timeout) and forcing users to needlessly wait to boot in instances where they know that no config will be present.

The ideal solution would be one where we somehow know whether or not to wait for a config (e.g. a distinct openstack-with-metadata image or some signal from OpenStack itself). Realistically though, I don't think many folks are booting FCOS without an Ignition config (do we support the old coreos.autologin kernel argument?) and nobody is booting RHCOS without one. This might be a good time to rip off the band-aid and require that FCOS boot with an Ignition config. That would allow us to wait indefinitely and sidestep the problem.

@cgwalters
Copy link
Member

Realistically though, I don't think many folks are booting FCOS without an Ignition config (do we support the old coreos.autologin kernel argument?)

This is coreos/fedora-coreos-tracker#279 - basically with FCOS we do support this because you can use the provider-injected SSH keys.

and nobody is booting RHCOS without one.

Agree.

This might be a good time to rip off the band-aid and require that FCOS boot with an Ignition config.

That said, aren't we conflating "no configuration" with "no metadata service"? Not providing a config should result in the metadata service replying successfully with an empty file, not failing to be present, right?

@jlebon
Copy link
Member

jlebon commented Sep 15, 2020

I think a config drive should still appear if no userdata is specified

Is that really the case? If so, then yeah this exactly matches the medium vs media semantics I was going for in the QEMU case in #928.

So if indeed we're guaranteed (within our supported use cases) that either the config drive or the metadata server is present, I think we could remove timeouts entirely for OpenStack, matching other clouds where we already do this.

@cgwalters
Copy link
Member

Is that really the case? If so, then yeah this exactly matches the medium vs media semantics I was going for in the QEMU case in #928.

I'm not sure; but we need to nail this down.

@crawford
Copy link
Contributor

It sounds like there is agreement that booting FCOS on OpenStack without an Ignition config, config-drive, or metadata service is very much an edge case. Since there are so few users of that flow, we should be okay to increase the timeout from thirty seconds; not withstanding the broader changes around medium vs media.

@bgilbert
Copy link
Contributor

bgilbert commented Sep 15, 2020

I think a config drive should still appear if no userdata is specified

Is that really the case?

We should test, but the config drive has a lot of other metadata, so I presume so.

So if indeed we're guaranteed (within our supported use cases) that either the config drive or the metadata server is present, I think we could remove timeouts entirely for OpenStack, matching other clouds where we already do this.

I was originally going to argue for a long timeout, perhaps 10-15 minutes, as a safety valve for broken environments. But @jlebon convinced me OOB that we should behave the same as other platforms and block indefinitely until we learn whether we have userdata or not.

The current logic will need some cleanup though. @jlebon and I turned up a few things:

  • In fetch-offline we only return ErrNeedNet after timing out the config-drive fetcher, which currently adds 30 seconds of startup overhead when there's no config-drive. We think the right fix is to immediately return ErrNeedNet in fetch-offline on OpenStack. We could set a shorter timeout (avoid starting network if the config drive appears within 2-5 seconds) but we'd end up sometimes starting network on a heavily loaded system, and the inconsistency seems worse than always doing it. @jlebon is going to submit a fix for this one.
  • If the config drive has no userdata, we correctly treat that as an authoritative result. But if the metadata service returns 404, this is converted to ErrNotFound and we continue waiting for the config drive. We should treat ErrNotFound as authoritative as well.
  • If all config sources fail, we continue to block until the timeout is reached. We should e.g. send errors through a channel to the caller and stop waiting once we've received N of them.
  • If the config drive succeeds, we leak the metadata service goroutine. The right fix is to plumb Context into the fetcher, but that'd be a lot of work across all supported URL schemes. The leak seems reasonably harmless so there doesn't seem to be any urgency to fixing it.

jlebon added a commit to jlebon/ignition that referenced this issue Sep 16, 2020
This is a follow up to (and revert of) coreos#1057.

On OpenStack, we don't actually know if we're fetching from the
config-drive or from the metadata server.

In theory, if it's from the config-drive, we don't strictly need
networking, and so `fetch-offline` would work. The issue is that in the
more common case of the metadata server, `fetch-offline` will still wait
the full 30s for the config-drive to also time out only to then have to
bring up networking and run the `fetch` stage.

Instead, let's just accept the brokenness of the OpenStack provider and
declare it as always requiring networking.

For more information, see:
coreos#1081 (comment)
@arithx
Copy link
Contributor

arithx commented Sep 22, 2020

  • In fetch-offline we only return ErrNeedNet after timing out the config-drive fetcher, which currently adds 30 seconds of startup overhead when there's no config-drive. We think the right fix is to immediately return ErrNeedNet in fetch-offline on OpenStack. We could set a shorter timeout (avoid starting network if the config drive appears within 2-5 seconds) but we'd end up sometimes starting network on a heavily loaded system, and the inconsistency seems worse than always doing it. @jlebon is going to submit a fix for this one.

Fixed in #1094

  • If the config drive has no userdata, we correctly treat that as an authoritative result. But if the metadata service returns 404, this is converted to ErrNotFound and we continue waiting for the config drive. We should treat ErrNotFound as authoritative as well.

Fixed in #1095

  • If all config sources fail, we continue to block until the timeout is reached. We should e.g. send errors through a channel to the caller and stop waiting once we've received N of them.

Fixed in #1095

And finally the drop of the timeout is pending in #1098

andymcc added a commit to andymcc/installer that referenced this issue Sep 28, 2020
@nashford77
Copy link

Is there a version I can download image wise that's working with CoreOS ? I am having this issue with all versions it would seem

@jlebon
Copy link
Member

jlebon commented Mar 14, 2022

@nashford77 Can you open an issue on the tracker instead describing the exact FCOS version used and the console logs?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants