-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openstack: metadata fetcher may stop retrying before network comes up #1081
Comments
Right, the big mess here is ignition/afterburn wants to support OpenStack instances without a metadata service too (which apparently exist). I am not really sure how we get out of this without creating a fully distinct |
Does this logic solve somehow in cloud-init? For operators it would be good to provide some way to configure timeouts If, at this point, the exact dependency cannot be implemented. For example, the same as we can use coreos-firstboot-network to adjust network settings. |
For reference, the same issue was reported at openshift/os#380. There, the initial reaction from Andrew was:
|
The Are we actually constrained from setting an arbitrarily long timeout on OpenStack? I think a config drive should still appear if no userdata is specified, and a metadata service will presumably return 404 in that case. So the only case where the timeout should fire is an OpenStack cloud that implements neither service, but such a cloud isn't suitable for running Ignition anyway. |
For e.g. AWS, we never time out right? A quick glance at git annotate says we should ask @crawford why he picked 30s in af1a535
Yep agree. |
This isn't going to work because not all Ignition configs require the network to be fetched or evaluated (and machines legitimately might not ever realize the network-online.target).
This is correct. It takes a variable amount of time for the OS to probe attached config-drives and/or contact the network metadata service (assuming either one exists!). The trade-off is between supporting slow environments like this ("slow" being defined as anything that takes longer than my arbitrarily-chosen thirty-second timeout) and forcing users to needlessly wait to boot in instances where they know that no config will be present. The ideal solution would be one where we somehow know whether or not to wait for a config (e.g. a distinct |
This is coreos/fedora-coreos-tracker#279 - basically with FCOS we do support this because you can use the provider-injected SSH keys.
Agree.
That said, aren't we conflating "no configuration" with "no metadata service"? Not providing a config should result in the metadata service replying successfully with an empty file, not failing to be present, right? |
Is that really the case? If so, then yeah this exactly matches the medium vs media semantics I was going for in the QEMU case in #928. So if indeed we're guaranteed (within our supported use cases) that either the config drive or the metadata server is present, I think we could remove timeouts entirely for OpenStack, matching other clouds where we already do this. |
I'm not sure; but we need to nail this down. |
It sounds like there is agreement that booting FCOS on OpenStack without an Ignition config, config-drive, or metadata service is very much an edge case. Since there are so few users of that flow, we should be okay to increase the timeout from thirty seconds; not withstanding the broader changes around medium vs media. |
We should test, but the config drive has a lot of other metadata, so I presume so.
I was originally going to argue for a long timeout, perhaps 10-15 minutes, as a safety valve for broken environments. But @jlebon convinced me OOB that we should behave the same as other platforms and block indefinitely until we learn whether we have userdata or not. The current logic will need some cleanup though. @jlebon and I turned up a few things:
|
This is a follow up to (and revert of) coreos#1057. On OpenStack, we don't actually know if we're fetching from the config-drive or from the metadata server. In theory, if it's from the config-drive, we don't strictly need networking, and so `fetch-offline` would work. The issue is that in the more common case of the metadata server, `fetch-offline` will still wait the full 30s for the config-drive to also time out only to then have to bring up networking and run the `fetch` stage. Instead, let's just accept the brokenness of the OpenStack provider and declare it as always requiring networking. For more information, see: coreos#1081 (comment)
Fixed in #1094
Fixed in #1095
Fixed in #1095 And finally the drop of the timeout is pending in #1098 |
This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
Mirroring openshift#4206 this bumps the version of RHCOS for MA to mirror the version from openshift#4206. This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
This fixes numerous bugs, such as: - https://bugzilla.redhat.com/show_bug.cgi?id=1879690 - coreos/ignition#1081 - https://bugzilla.redhat.com/show_bug.cgi?id=1875567
Is there a version I can download image wise that's working with CoreOS ? I am having this issue with all versions it would seem |
@nashford77 Can you open an issue on the tracker instead describing the exact FCOS version used and the console logs? |
Bug
Operating System Version
fedora-coreos-32.20200809.3.0-openstack.x86_64
Ignition Version
2.6.0
Environment
OpenStack cloud and Fedora Coreos images.
Expected Behavior
Explicit dependence of ignition on the end of the network configuration.
Actual Behavior
On the high load cloud environments, the DHCP agents can be started and configured with some delay and this delay doesn't predictable. Currently, the ignition tries 10 times within 30 seconds to get config from metadata service and fails with an error if instance doesn't get IP yet:
https://gist.github.com/ITD27M01/81409d4d72ee442c4787e39d9d15f2a9
You can see that ignition failed but NM gets IP after a while.
I understand that this is difficult to reproduce on devstack environment but we often experience such errors in our large cloud. As for cloud-init, there should be an explicit dependency of ignition from network-manager or NetworkManager-wait-online.service because it is in a state of racing with network setup and no heuristic will help to predict and determine the timeout value.
Reproduction Steps
Other Information
The same behavior can be reproduced for RHCOS:
https://gist.github.com/ITD27M01/2755c8b474ac1ffe4e39627386d0f3bd
The example race with 0.1 sec delay (for fcos):
The text was updated successfully, but these errors were encountered: