-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add oracle ipv6 single stack imds functionality #5785
add oracle ipv6 single stack imds functionality #5785
Conversation
3d1941e
to
f207b2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a full review, but I'm commenting on what I think are the big issues to be addressed.
I think you may be trying to make the EphemeralIPNetwork
class do more than it needs to. The original purpose of the connectivity url was to see if we could connect to the IMDS BEFORE setting up our ephemeral connection. If we have connectivity due to something outside of cloud-init (i.e., klibc in initramfs on OCI), then cloud-init doesn't need to do the ephemeral network setup. However, you've modified it to attempt to setup the v6 address, and then check for connectivity afterwards. I don't think there's value added by adding the connectivity check after. If we've setup a connection (either ipv4, ipv6, or both), we can assume we have the connectivity we need.
Additionally, the context manager in EphemeralIPNetwork
is now attempting to signal to the caller (via ipv6_reached_at_url
) if it should be using ipv6 or not. This isn't really the job of EphemeralIPNetwork though, and it's currently preventing v4 from being setup if v6 works, and preventing v6 from being setup if no v6 callback was provided. This is a fairly large change in behavior that would break EC2 as implemented.
I think that the only changes that were needed to ephemeral.py
are around allowing adding support for multiple connectivity urls and updating EphemeralIPNetwork
to do the connectivity check early in the context manager (see my inline comment for more details).
16d2927
to
042cca9
Compare
12c26f4
to
697b4bd
Compare
b3f8d71
to
ccdaf55
Compare
@TheRealFalcon all changes have been made to allow for a cleaner happy eyeballs ish approach for ephemeral networking. ready for re-review! |
9b25497
to
b47f284
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a number of inline comments, but nothing super major this time.
I haven't looked at tests yet, but I also remember you saying you didn't plan on updating tests until the approach is solid.
Also, can you remind me what local testing has looked like so far? Were you able to test on both native and paravirtualized instances? Are both ipv6-only and dual stack both currently supported? Were there cases that didn't include a pre-existing initramfs network?
cloudinit/net/ephemeral.py
Outdated
][0] | ||
return headers | ||
|
||
def _perform_connectivity_check( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have two functions doing the same thing. Any reason we can't merge them? This one could just become a generic function in this file and we could remove the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheRealFalcon I understand the sentiment but I feel like they are slightly different functions. Previously, I tried using the existing has_url_connectivity
and it did not seem to fit my needs. Would like to discuss this further with you.
cloudinit/net/ephemeral.py
Outdated
url_data["url"] for url_data in self.connectivity_urls_data | ||
], | ||
headers_cb=self._headers_cb, | ||
timeout=0.1, # keep really short for quick failure path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connectivity_url structure can specify a timeout. We should use that if specified. Also, 0.1
feels a bit short here. I don't think it's uncommon that a standard GET request can take longer than that, even over link local.
cloudinit/net/ephemeral.py
Outdated
@@ -456,5 +465,135 @@ def __enter__(self): | |||
raise exceptions[0] | |||
return self | |||
|
|||
def _do_ipv4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use a slightly more descriptive name here. Something like _obtain_ephemeral_v4_address
maybe? Same for the _do_ipv6
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... _do_ipv4 and _do_ipv6 definitely leave something to be desired... lol
cloudinit/net/ephemeral.py
Outdated
@@ -456,5 +465,135 @@ def __enter__(self): | |||
raise exceptions[0] | |||
return self | |||
|
|||
def _do_ipv4( | |||
self, ephemeral_obtained, exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters don't seem necessary. Mutable parameters make the state harder to follow, and in this case since we're returning them too, there doesn't seem to be a need to pass anything. Both v4 and v6 functions can instead return their ephemeral_obtained
value along with the exception (if it exists), and the caller should be able to know what to do with them.
cloudinit/net/ephemeral.py
Outdated
exceptions.append(e) | ||
return ephemeral_obtained, exceptions | ||
|
||
def _do_ipv6( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems really similar to the _do_ipv4
method...so much so that I think they could really be one method that takes an option telling them whether to do ipv4 vs ipv6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm interesting. I do agree. will have to look at the ramifications of doing so for unit test mocking.
start_time = time.monotonic() | ||
instance_url, instance_response = wait_for_url( | ||
urls, | ||
url_that_worked, instance_response = wait_for_url( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we want our happy eyeballs, not in the latter call. We should use a connect_synchronously=False
here along with no specified sleep time.
Once we know this worked, we can use the url_that_worked
to know which url to use for all future calls.
Also, nit, but url_that_worked
implies that the others the others don't. Let's see if we can come up with a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot good catch on the happy eyeballs.
Also, I agree that url_that_worked is not great but am unsure if i can come up with something that much better. i shall see.
if not instance_url: | ||
LOG.warning("Failed to fetch IMDS metadata!") | ||
return None | ||
if not url_that_worked: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do an if not...
along with an else
. If we swap the blocks we can have if url_that_worked:
with an else
instead.
self.metadata_address = _get_versioned_metadata_base_url( | ||
url=url_that_worked | ||
) | ||
if _is_ipv4_metadata_url(self.metadata_address): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these logs are redundant. Didn't we already log our IMDS address in read_opc_metadata()
?
`fetch_vnics_data` is True, else None | ||
or None if fetching metadata failed | ||
|
||
A tuple containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for defining the OpcMetadata named tuple was to simplify what this function returns. If we're returning a bunch of things at once, the other options are kind of ugly.
E.g.,
(
version,
instance_data,
vnics_data
) = read_opc_metadata(...)
or
metadata = read_opc_metadata(...)
...
data = self._crawled_metadata = metadata[1]
self._vnics_data = fetched_metadata[2]
A named tuple simplifies the typing as well as the call and usage semantics.
All that to say, instead of changing the return type, can we add what you want to return into the named tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah wait a minute... good point. 🤦 will do.
for index, vnic_dict in enumerate(vnics_data): | ||
is_primary = set_primary and index == 0 | ||
mac_address = vnic_dict["macAddr"].lower() | ||
is_ipv6_only = vnic_dict.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dual-stack not an option? The either/or here feels odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dual stack IS an option. but if it is dual stack, the primary network will be ipv4. i.e. it is either ipv6 only (ipv6 single stack) OR it is not ipv6 only (either ipv4 single stack or ipv4 and ipv6 dual stack).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responding to @TheRealFalcon comments. will begin making changes and applying feedback now.
cloudinit/net/ephemeral.py
Outdated
self.connectivity_urls_data = connectivity_urls_data | ||
|
||
# will be updated by the context manager | ||
self.imds_reached_at_url: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. this was still left over from when this value was set from the _do_ipv4 and _do_ipv6 helper functions!
cloudinit/net/ephemeral.py
Outdated
self.distro, | ||
self.interface, | ||
) | ||
self.imds_reached_at_url = self._perform_connectivity_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion! definitely more readable.
cloudinit/net/ephemeral.py
Outdated
@@ -456,5 +465,135 @@ def __enter__(self): | |||
raise exceptions[0] | |||
return self | |||
|
|||
def _do_ipv4( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... _do_ipv4 and _do_ipv6 definitely leave something to be desired... lol
cloudinit/net/ephemeral.py
Outdated
][0] | ||
return headers | ||
|
||
def _perform_connectivity_check( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheRealFalcon I understand the sentiment but I feel like they are slightly different functions. Previously, I tried using the existing has_url_connectivity
and it did not seem to fit my needs. Would like to discuss this further with you.
cloudinit/net/ephemeral.py
Outdated
exceptions.append(e) | ||
return ephemeral_obtained, exceptions | ||
|
||
def _do_ipv6( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm interesting. I do agree. will have to look at the ramifications of doing so for unit test mocking.
ipv4=True, | ||
connectivity_urls_data=CONNECTIVITY_URLS_DATA, | ||
) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheRealFalcon hm..... good point. might need to discuss this a little further with you. at the very least logging a warning would be important. This is meant to catch any of the exceptions purposefully raised by EphemeralIPNetwork, which should be ProcessExecutionError
and NoDHCPLeaseError
. So maybe just catching for those would be best.
for index, vnic_dict in enumerate(vnics_data): | ||
is_primary = set_primary and index == 0 | ||
mac_address = vnic_dict["macAddr"].lower() | ||
is_ipv6_only = vnic_dict.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dual stack IS an option. but if it is dual stack, the primary network will be ipv4. i.e. it is either ipv6 only (ipv6 single stack) OR it is not ipv6 only (either ipv4 single stack or ipv4 and ipv6 dual stack).
`fetch_vnics_data` is True, else None | ||
or None if fetching metadata failed | ||
|
||
A tuple containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah wait a minute... good point. 🤦 will do.
start_time = time.monotonic() | ||
instance_url, instance_response = wait_for_url( | ||
urls, | ||
url_that_worked, instance_response = wait_for_url( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot good catch on the happy eyeballs.
Also, I agree that url_that_worked is not great but am unsure if i can come up with something that much better. i shall see.
@holmanb , do you mind reviewing the changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-dubs , I'm trying to give feedback quickly, so here's a few more comments I had
cloudinit/net/ephemeral.py
Outdated
if _check_connectivity_to_imds(self.connectivity_urls_data): | ||
LOG.debug("We already have connectivity to IMDS, skipping DHCP.") | ||
return self | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the if
condition returned, we can remove this else
entirely.
cloudinit/net/ephemeral.py
Outdated
) | ||
# short-circuit if we already have connectivity to IMDS | ||
if _check_connectivity_to_imds(self.connectivity_urls_data): | ||
LOG.debug("We already have connectivity to IMDS, skipping DHCP.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the log on line 302 logs the address used. We should do that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _check_connectivity_to_imds already logs the url so it felt redundant but might as well add since we are already logging something anyways.
|
||
OpcMetadata = namedtuple("OpcMetadata", "version instance_data vnics_data") | ||
ReadOpcMetadataResponse = namedtuple( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're updaing this anyway and you're adding typing, we should probably convert it to the newer style:
class ReadOpcMetadataResponse(NamedTuple):
version: int
instance_data: Dict[str, Any]
vnics_data: Optional[Dict[str, Any]]
imds_url_used: str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely down. But, while we're at it, shouldn't we just do a dataclass then? since that is python 3.7+ and we are on 3.8 now? I much prefer that over a named tuple tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I figured it's easier to still keep the same data types, but I'm not against it. I think I still slightly prefer named tuples though if there's no need for a class. They're immutible, hashable, and implemented in c so theoretically faster. Why would you rather use dataclasses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! I'll keep it as a named tuple then. It's just personal preference for data classes. But for something this simple, I agree named tuples fit the job better.
|
||
V2_HEADERS = {"Authorization": "Bearer Oracle"} | ||
|
||
CONNECTIVITY_URLS_DATA = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're leaning towards keeping less running during import time, just because it increases the load time of cloud-init. If it's also used in unit tests, it's fine as-is...though for unit tests, I also like to hard code the full URLs to ensure we're actually hitting the endpoints that we expect.
cloudinit/net/ephemeral.py
Outdated
exceptions.append(ipv6_exception) | ||
|
||
# need to set this if we only have ipv6 ephemeral network | ||
if not self.ipv4 or not ipv4_ephemeral_obtained: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point this should be if ipv6_ephemeral_obtained and not ipv4_ephemeral_obtained
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm... fair enough. i shall try that out.
Also, let's not forget to remove the original function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of my comments
"headers": V2_HEADERS, | ||
}, | ||
nic_name = net.find_fallback_nic() | ||
# try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code here and a few lines lower.
] | ||
|
||
url_that_worked = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line shouldn't be needed.
@@ -815,16 +815,19 @@ def test_multiple_files(self): | |||
|
|||
|
|||
@pytest.mark.usefixtures("disable_netdev_info") | |||
@mock.patch("cloudinit.net.ephemeral._check_connectivity_to_imds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes in this file necessary? What was wrong with the responses
code?
# assert r.ok | ||
# sleep for 10s to allow cloud-init clean to run | ||
# oracle takes > 10s to boot so this is not slowing us down | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove the reboot
part of cloud-init clean
, client.restart()
should work here and take care of any waiting necessary.
@@ -0,0 +1,186 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a great start, but I think we also need some testing beyond IMDS reachability. In particular:
- No long waits or timeouts when fetching the data
- Network rendered properly
- No errors or warnings in the logs
I think it's important to verify these things on both native and paravirtualized instances. I also realize some of this isn't really viable using our current integration testing tools and that this functionality isn't widely available yet, but I do think that means we also need some manual checks in the meantime.
a4a904a
to
bcbc950
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-dubs I just gave a first pass on the ephemeral code. No major objections, just a minor comment and a question.
the following keys: | ||
- "url" (str): The URL to check connectivity for. | ||
- "headers" (dict, optional): Headers to include in the request. | ||
- "timeout" (int, optional): Timeout for the request in seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this key used in any callsite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the timeouts are used to find the max timeout value specified and then they are passed to wait for url
d71305a
to
0b2890a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-dubs I think we should remove the dead code (either in a seperate commit in this PR or in a followup PR), and you mentioned having testing data to share. Once we have this in place I'm happy to merge this.
self.distro = distro | ||
self.interface_addrs_before_dhcp = netinfo.netdev_info() | ||
|
||
def __enter__(self): | ||
"""Setup sandboxed dhcp context, unless connectivity_url can already be | ||
reached.""" | ||
if self.connectivity_url_data: | ||
if net.has_url_connectivity(self.connectivity_url_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is now dead code. Can you please remove it in a separate commit or a followup PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look good to me, but I also want to see the testing results before we merge. Thanks for all the work here!
70247b1
to
ebea3f7
Compare
…onnectivity_to_imds()
ebea3f7
to
efe56a1
Compare
Proposed Commit Message
N/A
We rebasing 😎
Additional Context
This is to enable future functionality on the Oracle Cloud and will have no immediate ramifications or generally available use cases for the meantime.
My personal acceptance criteria for this MP is that I maintain or improve code coverage for any code I touch, which in this case is
cloudinit/sources/DataSourceOracle.py
andcloudinit/net/ephemeral.py
. For the ephemeral networking changes I added, I made sure that I added sufficient unit tests so that theEphemeralIPNetwork
class and subsequent helper functions have 100% code and branch coverage. For the Oracle DS, I improved code coverage from 91% to 94% while adding nearly 40 statements and over 20 branches that would need covered.Test Steps
These changes were manually tested on both jammy (22.04) and noble (24.04), on Oracle Platform Ubuntu images for testing ipv4 and dual stack, and then on custom built images for ipv6 single stack testing. For this manual validation, the following things were checked:
The new oracle single-stack ipv6 integration tests was run against an existing instance with the following command:
Testing Backwards compatability:
Integration testing:
Ran full suite of integration tests against oracle using my custom
ezdeb
package and 5 tests failed, but they were all explainable / expected / unrelated to my changes.On focal (20.04), 11 tests failed due to running 24.X on focal using my
ezdeb
package.On noble (24.04), only 5 tests failed. And they occurred both with and without my custom
ezdeb
package. so I feel confident in saying that no integration tests have been broken by my changes.the tests that failed:
Merge type