Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azurestack Support #561

Merged
merged 2 commits into from
Apr 8, 2021
Merged

Add Azurestack Support #561

merged 2 commits into from
Apr 8, 2021

Conversation

patrickdillon
Copy link
Contributor

@patrickdillon patrickdillon commented Mar 8, 2021

Updated as this is no longer WIP.

This PR:

  • adds a bare-bones azurestack provider
  • copies all of the logic from the azure provider
  • adjusts the logic behind hostname fetching to query the identity metadata service (not the new instance metadata endpoint)
  • removes support for attributes (using the logic of azure resulted in incorrect IP addresses)

I tested this and it pulled unique hostnames for a bootstrap node and 3 masters. The SSH keys also appear in the login screen.

Note: because this uses the identity metadata service, this will work in any azure stack environment.

Further todo on this PR:

  • remove changes to dracut/30afterburn/afterburn-hostname.service
  • Go through all of the logic from the azure provider and determine whether it is necessary; strip it out if we don't need it
  • Determine whether we want to query both the compute endpoint and the identity endpoint, falling back to one if the other doesn't work
  • Clean up commits (better messages, squash FIXUPs)

Regarding the compute endpoint:
From a running VM (in the lab environment) I am able to query it:

[core@ash-bootstrap ~]$ curl -H Metadata:true --noproxy "*" http://169.254.169.254/metadata/instance?api-version=2019-03-11 | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1484  100  1484    0     0   9160      0 --:--:-- --:--:-- --:--:--  9217
{
  "compute": {
    "azEnvironment": "AzureStack",

But when I hit that same endpoint during afterburn I get an error. Because I was able to get the vm name from the identity endpoint, I prioritized that. as future work we could add more support for the instance metadata service.

@patrickdillon
Copy link
Contributor Author

/hold
WIP
/cc @jlebon

dracut/30afterburn/afterburn-hostname.service Show resolved Hide resolved
@@ -0,0 +1,90 @@
// Copyright 2017 CoreOS, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021

This came up in discussion in another project as well. For this project at least, we should probably be able to drop all these blocks and just rely on the Cargo.toml license field. (To be clear, I mean this as a follow-up, not something you need to do here.)

@jlebon
Copy link
Member

jlebon commented Mar 8, 2021

But when I hit that same endpoint from ignition, I get a 400. Any ideas?

You mean when running Afterburn from the initrd, right? Hmm, not sure offhand. Can you paste the logs?

@patrickdillon
Copy link
Contributor Author

patrickdillon commented Mar 8, 2021

You mean when running Afterburn from the initrd, right?

Yes, afterburn! I keep mistakenly saying "from ignition".

Hmm, not sure offhand. Can you paste the logs?

Looks like I did not save the logs. I remember there was minimal info in the error, with a 400 status code. I will run again in order to capture. Let me know if you have any suggestions for getting a more detailed debug output.

@jlebon
Copy link
Member

jlebon commented Mar 8, 2021

I remember there was minimal info in the error, with a 400 status code.

That's odd, I don't know why it would give a 400 in the initrd, but work in the real root. Hmm, wonder if it wants the node to check in first?

If you have interactive access to the console BTW, you can try booting with rd.break (which itself might not be possible), and then run commands in the initrd shell. (E.g. I'd start with doing the same curl command to make sure that it's not a difference between how curl and afterburn do the request, then try doing a checkin first, etc...)

@patrickdillon
Copy link
Contributor Author

/retest

Comment on lines 310 to 311
let instance_metadata = AzureStack::fetch_identity();
Ok(Some(instance_metadata.unwrap().vm_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter is indirectly saying to use Err instead of unwrap(), e.g.:

Suggested change
let instance_metadata = AzureStack::fetch_identity();
Ok(Some(instance_metadata.unwrap().vm_name))
let instance_metadata = AzureStack::fetch_identity()
.context("fetching instance identity")?;
Ok(Some(instance_metadata.vm_name))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_identity function looks good. The context stuff is like a human-friendly backtrace where you can add more error-prefixing as the error bubbles up the stack to make it easier to tell where the error came from. It's up to you when to add a .context(...)? vs just ?. If you think e.g. it's repetitive with the context already added by the callee, you can omit it in the caller. In this case for example, yeah it's probably superfluous given that my suggestion is more or less repeating the context provided by fetch_identity, and our caller ultimately adds the context that the failure happened while fetching the hostname:

        // write hostname if configured to do so
        self.hostname_file
            .map_or(Ok(()), |x| metadata.write_hostname(x))
            .context("writing hostname")?;

So you can probably just swap out .context("fetching instance identity")? with just ? in my suggestion above.

@patrickdillon patrickdillon force-pushed the azurestack branch 2 times, most recently from 5d8f3f2 to ccaaa2d Compare March 25, 2021 20:13
@patrickdillon
Copy link
Contributor Author

/retest

@patrickdillon patrickdillon changed the title WIP Add Azurestack Support Add Azurestack Support Mar 25, 2021
@patrickdillon
Copy link
Contributor Author

@jlebon All work on this is done and tested. Can you give it a more thorough review or suggest how to get more eyes on it?

I have logs from testing master and bootstrap nodes if that will help.

Not sure why continuous-integration/jenkins/pr-merge test is failing. Perhaps this is a test ifnra problem? I cannot replicate locally. Also can you tip me on how to retrigger this test? I can't figure out a way. Do I have to push a new empty commit?

@patrickdillon
Copy link
Contributor Author

patrickdillon commented Mar 25, 2021

The merge test is failing with:

[2021-03-25T20:16:35.302Z] + make
[2021-03-25T20:16:35.302Z] sed -e 's,@DEFAULT_INSTANCE@,'core',' < systemd/afterburn-sshkeys@.service.in > systemd/afterburn-sshkeys@.service.tmp && mv systemd/afterburn-sshkeys@.service.tmp systemd/afterburn-sshkeys@.service
[2021-03-25T20:16:35.302Z] cargo build --release
[2021-03-25T20:16:35.302Z] make: cargo: No such file or directory
[2021-03-25T20:16:35.302Z] make: *** [Makefile:23: all] Error 127

script returned exit code 2

This looks like a problem running cargo so a problem with the test itself.

@jlebon
Copy link
Member

jlebon commented Mar 26, 2021

Yeah, CI was hitting issues yesterday. It should be fine now, so I've restarted another run.

@jlebon
Copy link
Member

jlebon commented Mar 26, 2021

So in a first pass, there seems to be a lot of duplication between azure and azurestack. From what I can tell, all the crypto and goalstate stuff is exactly the same. We should try to dedupe this. Hmm, how about re-arranging things like this:

src/providers/azure/mod.rs
src/providers/azure/goalstate.rs
src/providers/azure/crypto/...
src/providers/azure/azure/mod.rs
src/providers/azure/azure/mock_test.rs
src/providers/azure/azurestack/mod.rs
src/providers/azure/azurestack/mock_test.rs

So then the top-level providers::azure module would just re-export the azure and azurestack (using pub mod) so that they could be used in the top-level match in metadata.rs. WDYT?

src/providers/azurestack/mod.rs Outdated Show resolved Hide resolved
src/providers/azurestack/mod.rs Outdated Show resolved Hide resolved
Prep for adding an `azurestack` provider which shares a lot with
`azure`.
@jlebon
Copy link
Member

jlebon commented Mar 31, 2021

OK, updated this based on #561 (comment). The only change was that I made the top-level module microsoft instead of azure to make it less confusing (because otherwise we'd have azure::azure).

Still need to test this.

Adds support and tests for the Azure Stack platform. Sets the hostname
based on the vm name. Obtains SSH keys. Checks in to finish the boot
process. Unlike Azure, does not offer any attributes. Follows the
same pattern as Azure with some notable differences:

The hostname is obtained by parsing the vm name from the identity
metadata service, similar to the WALinuxAgent. Early attempts were
made to obtain this vm name from the compute instance metadata service
which was available in a pre-release lab environment, but the endpoint
was not reachable pre-check in. Consdering the vm name was available
from the identity endpoint, accessing the instance metadata service
can be saved for future work.

SSH key functionality is the same as Azure.

Regarding attributes, obtaining the IP addresses in the same manner as
Azure led to incorrect values, so they were removed as well as vmsize
for simplicity. Therefore no attributes are available.

Check in is the same process as Azure.

Co-authored-by: Ben Howard <ben.howard@redhat.com>
Co-authored-by: Jonathan Lebon <jonathan@jlebon.com>
@jlebon
Copy link
Member

jlebon commented Mar 31, 2021

Now with authorship actually preserved! :)

@lucab
Copy link
Contributor

lucab commented Apr 1, 2021

@jlebon there seems to be still quite a bit of overlap between azure and azurestack code (at least in constants/macros/helpers/tests). Is the plan to keep polishing here, or is that for later?

@ashcrow
Copy link
Member

ashcrow commented Apr 6, 2021

Other than possible overlap of constants/etc.. is there anything else holding this PR up?

@jlebon
Copy link
Member

jlebon commented Apr 6, 2021

@jlebon there seems to be still quite a bit of overlap between azure and azurestack code (at least in constants/macros/helpers/tests). Is the plan to keep polishing here, or is that for later?

Other than possible overlap of constants/etc.. is there anything else holding this PR up?

To answer both of you at the same time: this PR is being tested in ASH, which is painful to do. If testing comes back green, my suggestion is to get this in as is and tweak things in follow-ups.

@patrickdillon
Copy link
Contributor Author

I tested this in Azure Stack and everything looked good. I did not test ssh key fetching which I am not familiar with. But happy to help further.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also finally been able to successfully test this in ASH, including SSH key fetching (which matters more for FCOS):

[root@jlebon-vm-bootstrap ~]# journalctl -u afterburn-sshkeys@core.service
-- Logs begin at Thu 2021-04-08 01:43:58 UTC, end at Thu 2021-04-08 03:27:48 UTC. --
Apr 08 02:00:22 jlebon-vm-bootstrap systemd[1]: Starting Afterburn (SSH Keys)...
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.046 WARN Failed to get fabric address from DHCP: maximum number of retries (60) reached
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.046 INFO Using fallback address
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.055 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.062 INFO Fetch successful
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.062 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.219 INFO Fetch successful
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.270 INFO Fetching http://168.63.129.16/machine/70f676d8-fb84-4921-9716-96cc629c05a7/...
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: Apr 08 02:00:51.577 INFO Fetch successful
Apr 08 02:00:51 jlebon-vm-bootstrap afterburn[1929]: wrote ssh authorized keys file for user: core
Apr 08 02:00:51 jlebon-vm-bootstrap systemd[1]: afterburn-sshkeys@core.service: Succeeded.
Apr 08 02:00:51 jlebon-vm-bootstrap systemd[1]: Started Afterburn (SSH Keys).
[root@jlebon-vm-bootstrap ~]# cat /home/core/.ssh/authorized_keys.d/afterburn
ssh-rsa AAA...

So my vote is to get this in and we can do more cleanups afterwards. @lucab Would appreciate a stamp to sanity-check there isn't anything majorly wrong.

@jlebon
Copy link
Member

jlebon commented Apr 9, 2021

One thing worth noting is that in the base case, ASH support also requires coreos/fedora-coreos-config#935 so it has networking in the initrd. For OKD/OCP it works without it, because we use pointer Ignition configs and so Ignition itself needs networking and requests it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants