Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve networking options for libvirtd target #922

Closed
wants to merge 12 commits into from

Conversation

mbrgm
Copy link
Member

@mbrgm mbrgm commented Apr 13, 2018

This builds upon #824 and adds the possibility to connect libvirt guests via bridged networking. It also adds the qemu guest agent to guests, which could be helpful for #881.

There's still work to be done, especially documentation-wise, but also checking whether this introduces regressions and especially keep backwards-compatibility with the existing deployment.libvirtd.networks option.

However, it already works quite well (at least as far as I tested for bridged networking).

/cc @erosennin @teto

erosennin and others added 12 commits April 9, 2018 12:26
Otherwise, deployments with multiple VMs try to write to the same image.

Also, do the temp_image_path calculation only once.
`MachineState.run_command()` passes SSH flags to `self.ssh.run_command()`.
However, `self.get_ssh_flags()` is already registered as a `ssh_flag_fun` in the
class `__init__()` function, so `ssh_util.SSH` already uses it to get the flags
when initiating a connection. This lead to the SSH flags being duplicated, which
causes an error for some flags (e.g. the `-J` flag, which can only be specified once).
Offers better separation, especially when additional features will be added.
This helps in situations when there's no network connectivity to the guest, e.g.
when the hypervisor host can be reached via a VPN, but the guest host cannot.
This mainly adds driver support for virtio drivers to the initial image of the
guest provisioning.
This is a WIP!

- Replaced `deployment.libvirtd.networks` option with a submodule to allow not
  only (libvirt) network names, but other networking types as well.
- Domain XML was adjusted accordingly to incorporate the parameters from the new
  `networks` submodule.
- Added the qemu guest agent to guests to allow for out-of-band
  communication (no need for network connectivity) with the hypervisor.
- Guest IP (for provisioning after guest has started) is no longer determined by
  waiting for the guest to get a DHCP lease in the hypervisor libvirt network.
  If the guest has a static IP, it won't ask for a DHCP lease. Also, for bridged
  networking, we probably will not have access to the DHCP server.
- Instead, the address of the first interface is retrieved from libvirt using
  the `VIR_DOMAIN_INTERFACE_ADDRESSES_SRC_AGENT` method, which can now be done
  because of the newly added qemu guest agent.
type = types.listOf (types.submodule (import ./network-options.nix {
inherit lib;
}));
default = [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  `default = [{ source = "default"; type= "bridge"; }];` might be best for backwards compatibility.

k.get("value")
for k in x.findall("attr[@name='networks']/list/string")]
LibvirtdNetwork.from_xml(n)
for n in x.findall("attr[@name='networks']/list/*")]
assert len(self.networks) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current default we hit the assert

@teto
Copy link
Member

teto commented Apr 16, 2018

+∞ thanks.
It will be meaningfull to review once #824 gets merged. I believe the qemu-agent might be best upstreamed to nixpkgs (let me know if you need help with that, NixOS/nixpkgs#34722).

Just a quick tip for other testers, you can setup your networks with

    deployment.libvirtd.networks = [ 
      { source = "default"; type= "virtual"; }
    ]; 

In fact you have to define at least one network otherwise you'll hit the assert len(self.networks) > 0.
I am going to test it today thanks once again.

@teto
Copy link
Member

teto commented Apr 16, 2018

Did you have the problem

client> starting...
server> starting...
libvirt:  error : Cannot get interface MTU on 'default': No such device
libvirt:  error : Cannot get interface MTU on 'default': No such device
error: Multiple exceptions (2): 
  * client: Cannot get interface MTU on 'default': No such device
  * server: Cannot get interface MTU on 'default': No such device

I have tried adding <mtu size='9000'/> everywhere without much success (libvirt has 'default' bridge named virbr0). Investigating.

EDIT: Found problem see review

@mbrgm
Copy link
Member Author

mbrgm commented Apr 16, 2018

@teto No I'm sorry... in my setup everything works just fine with the current state. You might search the qemu sources for this error message and try to work your way backwards to the cause.

RE reviewing this PR: As you suggested, I'm going to do more work once #824 is merged, so I can be sure I don't have to rebase several times.

maybe_mac(n),
' <source network="{0}"/>',
' <interface type="{interface_type}">',
' <source {interface_type}="{source}"/>',
' </interface>',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have problems when configuring bridges like Cannot get interface MTU on 'default': No such device
As a quick hack as I work with bridges I kinda reverted the fix with this hybrid change:

       def iface(n):
           return "\n".join([
               '    <interface type="network">',
               '      <source network="{0}"/>',
               '    </interface>',
               ]).format(n.source)

which works

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind I used the wrong type. I should have used virtual instead (bridge seems to expect the host bridge name while virtual expects the libvirt networn name IIUC). I reverted back my changes.

@teto
Copy link
Member

teto commented Apr 16, 2018

I found a fix thanks ;)

@mbrgm
Copy link
Member Author

mbrgm commented Apr 16, 2018

@teto Can you post this fix here?

@mbrgm
Copy link
Member Author

mbrgm commented Apr 16, 2018

@teto By

I believe the qemu-agent might be best upstreamed to nixpkgs

do you mean it should be added as a nixos module?

@teto
Copy link
Member

teto commented Apr 16, 2018

Basically I reverted some of you change when generating the interface xml:

       def iface(n):
           return "\n".join([
               '    <interface type="network">',
               '      <source network="{0}"/>',
               '    </interface>',
               ]).format(n.source)

~~I guess you are using the virtual type while I use bridge? thus it's not a proper fix, it might break things on your end. ~~ I should have used virtual: fixed

When booting I also get

libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected
libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected
libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected

this might be related to my iface change. I wonder how you setup your VM ? could you share your deployment.libvirtd.networks/nixpkgs version please ?

systemd.services.qemu-guest-agent = {
description = "QEMU Guest Agent";
bindsTo = [ "dev-virtio\\x2dports-org.qemu.guest_agent.0.device" ];
after = [ "dev-virtio\\x2dports-org.qemu.guest_agent.0.device" ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this creates a problem with my config

@teto
Copy link
Member

teto commented Apr 18, 2018

I found the culprit for some reason theagent won't work mostly because of

      bindsTo = [ "dev-virtio\\x2dports-org.qemu.guest_agent.0.device" ];
      after = [ "dev-virtio\\x2dports-org.qemu.guest_agent.0.device" ];

I've opened a PR here NixOS/nixpkgs#39099 and with it I am able to make use of your PR \o/
(My messy branch https://github.com/teto/nixops/tree/qemu_agent)


addrs = first_iface.get('addrs', [])

return addrs[0]['addr']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  * client: 'NoneType' object has no attribute '__getitem__'
  * server: 'NoneType' object has no attribute '__getitem__'

sometimes addrs[0] won't have 'addr' (if DHCP request was not received) which generates the previous error. I guess this should be return addrs[0].get('addr', None)

@Izorkin
Copy link

Izorkin commented Apr 26, 2018

Is it possible to make virtio by default? Change <target dev="hda"/> to <target dev="hda" bus="virtio"/> and change network type to virtio ?

@Izorkin
Copy link

Izorkin commented Apr 27, 2018

How to fix error:

nixops deploy -d test1
Traceback (most recent call last):
  File "/home/user/works/src_nix/nixops/scripts/nixops", line 984, in <module> args.op()
  File "/home/user/works/src_nix/nixops/scripts/nixops", line 406, in op_deploy max_concurrent_activate=args.max_concurrent_activate)
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 1051, in deploy self.run_with_notify('deploy', lambda: self._deploy(**kwargs))
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 1040, in run_with_notify  f()
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 1051, in <lambda> self.run_with_notify('deploy', lambda: self._deploy(**kwargs))
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 900, in _deploy self.evaluate_active(include, exclude, kill_obsolete)
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 862, in evaluate_active self.evaluate()
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 360, in evaluate defn = _create_definition(x, cfg, cfg["targetEnv"])
  File "/home/user/works/src_nix/nixops/nixops/deployment.py", line 1249, in _create_definition return cls(xml, config)
  File "/home/user/works/src_nix/nixops/nixops/backends/libvirtd.py", line 71, in __init__ assert len(self.networks) > 0
AssertionError

@teto
Copy link
Member

teto commented May 10, 2018

NixOS/nixpkgs#39099 (comment) got merged.
@Izorkin you can check out some changes (mixed with other things) at https://github.com/teto/nixops/tree/qemu_agent

@Izorkin
Copy link

Izorkin commented May 11, 2018

@teto Thanks. Bridge mode work.
How to need to fix custom bridge mac-address with config deployment.libvirtd.networks.bridge.mac ?
Guest agent not work

Connection to 192.168.0.139 closed by remote host.
test01> waiting for the machine to finish rebooting...parsing IP
libvirt: QEMU Driver error : Агент гостя не отвечает: QEMU guest agent is not connected
........................................................................................

<link xlink:href="https://libvirt.org/storage.html">storage pool</link> which
usually corresponds to the <filename>/var/lib/libvirt/images</filename>
directory. You can choose another storage pool with the
<code>deployment.libvirtd.storagePool</code> option:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying out this PR, I got:

libvirt: Storage Driver error : Storage pool not found: no storage pool with matching name 'default'

This looks like this: simon3z/virt-deploy#8

And indeed for me:

% virsh pool-list 
 Name                 State      Autostart 
-------------------------------------------

Can we make this work also without the default storage pool or do we need to instruct the user to set this default storage pool up?

Copy link
Contributor

@nh2 nh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when trying this out, I get this:

node-3..> uploading disk image...
node-3..> starting...
libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected
node-3..> .libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected
node-3..> .libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected
node-3..> .libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not connected

Looking at the machine with a shell in virt-manager, qemu-guest-agent.service doens't seem to exist in systemd yet (I think only the base image is running at that point).

Not sure if I'm using it wrong.

@@ -334,7 +334,8 @@ def run_command(self, command, **kwargs):
# mainly operating in a chroot environment.
if self.state == self.RESCUE:
command = "export LANG= LC_ALL= LC_TIME=; " + command
return self.ssh.run_command(command, self.get_ssh_flags(), **kwargs)

return self.ssh.run_command(command, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is self.get_ssh_flags() removed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the commit message for 574ba39. Although this is quick-and-dirty WIP code, I always try to write comprehensible commit messages, so git blame can help people understand changes.

MachineState.run_command() passes SSH flags to self.ssh.run_command().
However, self.get_ssh_flags() is already registered as a ssh_flag_fun in the
class __init__() function, so ssh_util.SSH already uses it to get the flags
when initiating a connection. This lead to the SSH flags being duplicated, which
causes an error for some flags (e.g. the -J flag, which can only be specified once).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad -- I didn't pay enough attention. Great commit message!

return super_flags + ["-o", "StrictHostKeyChecking=no",
"-i", self.get_ssh_private_key_file()]
"-i", self.get_ssh_private_key_file(),
"-J", jumphost]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, jumphost seems to be "", so I get Invalid -J argument.

first_iface = next(v for k, v in ifaces.iteritems()
if v.get('hwaddr', None) == first_iface_mac)

addrs = first_iface.get('addrs', [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point behind this logic?

You're saying "if .addrs doesn't exist, default to []", and then below immediately access [0] on it, which will fail if the default you have given actually happens (because [][0] can't work).

Copy link
Member Author

@mbrgm mbrgm Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're absolutely right. See my other comment for an explanation: this PR is just a quick-and-dirty POC.

@mbrgm
Copy link
Member Author

mbrgm commented Jul 15, 2018

@nh2 As the title says, this is WIP code, which I only pushed after multiple requests from people really wanting to see it. Neither did I put much work in it (currently, I'm not even using this, as my time for NixOS is fairly limited), nor is it any better than quick-and-dirty WIP code.

I tried if this could possibly work the way I imagined, created a POC for my extremely limited use case and pushed this for some people to take a look at it. If you want to improve it, feel free to build upon it -- but as I said I currently have no time for work on NixOS, so I won't be able to incorporate your suggestions.

@mbrgm
Copy link
Member Author

mbrgm commented Jul 15, 2018

@nh2 Not wanting to be rude... this PR has just developed its own dynamics, while I got quite short on time for bringing this from POC to production-ready in the meantime ;-). So if anyone wants to move this forward and doesn't want to wait until I'm able to get back to it -- feel free :-).

@nh2
Copy link
Contributor

nh2 commented Jul 16, 2018

@mbrgm No problem! I think it's great that you share your WIP code, that's exactly how it should be done. I'm testing it because I hope it'll give me some hints on how to get around https://github.com/NixOS/nixops/issues/973.

My comments carry no negative connotation, I'm just writing down what I find to work / not work to save myself or others some time for later :)

@sorki
Copy link
Member

sorki commented Aug 28, 2018

I've added some commits on top of networking part

  • I'm able to use services.qemuGuest from nixos
  • stdout errors from libvirt supression
  • reworked IP lookup a bit (to exclude loopback, APIPA and link local v6 addresses)

This is pretty good work overall and not at all

extremely limited use case

:)

vpsfreecz/nixops@vpsadminos...vpsfreecz:libvirt

@teto
Copy link
Member

teto commented Jul 17, 2019

@sorki the underlying PR this was based on just got merged. I would be interest in seeing your PR merged too !

teto added a commit to teto/nixops-libvirtd that referenced this pull request Sep 13, 2019
An update of NixOS/nixops#922

- Replaced `deployment.libvirtd.networks` option with a submodule to allow not
  only (libvirt) network names, but other networking types as well.
- Domain XML was adjusted accordingly to incorporate the parameters from the new
  `networks` submodule.
- Added the qemu guest agent to guests to allow for out-of-band
  communication (no need for network connectivity) with the hypervisor.
- Guest IP (for provisioning after guest has started) is no longer determined by
  waiting for the guest to get a DHCP lease in the hypervisor libvirt network.
  If the guest has a static IP, it won't ask for a DHCP lease. Also, for bridged
  networking, we probably will not have access to the DHCP server.
- Instead, the address of the first interface is retrieved from libvirt using
  the `VIR_DOMAIN_INTERFACE_ADDRESSES_SRC_AGENT` method, which can now be done
  because of the newly added qemu guest agent.
teto added a commit to teto/nixops-libvirtd that referenced this pull request Oct 17, 2019
An update of NixOS/nixops#922

- Replaced `deployment.libvirtd.networks` option with a submodule to allow not
  only (libvirt) network names, but other networking types as well.
- Domain XML was adjusted accordingly to incorporate the parameters from the new
  `networks` submodule.
- Added the qemu guest agent to guests to allow for out-of-band
  communication (no need for network connectivity) with the hypervisor.
- Guest IP (for provisioning after guest has started) is no longer determined by
  waiting for the guest to get a DHCP lease in the hypervisor libvirt network.
  If the guest has a static IP, it won't ask for a DHCP lease. Also, for bridged
  networking, we probably will not have access to the DHCP server.
- Instead, the address of the first interface is retrieved from libvirt using
  the `VIR_DOMAIN_INTERFACE_ADDRESSES_SRC_AGENT` method, which can now be done
  because of the newly added qemu guest agent.
@grahamc
Copy link
Member

grahamc commented Mar 26, 2020

Hello!

Thank you for this PR.

In the past several months, some major changes have taken place in
NixOps:

  1. Backends have been removed, preferring a plugin-based architecture.
    Here are some of them:

  2. NixOps Core has been updated to be Python 3 only, and at the
    same time, MyPy type hints have been added and are now strictly
    required during CI.

This is all accumulating in to what I hope will be a NixOps 2.0
release
. There is a tracking issue for that:
#1242 . It is possible that
more core changes will be made to NixOps for this release, with a
focus on simplifying NixOps core and making it easier to use and work
on.

My hope is that by adding types and more thorough automated testing,
it will be easier for contributors to make improvements, and for
contributions like this one to merge in the future.

However, because of the major changes, it has become likely that this
PR cannot merge right now as it is. The backlog of now-unmergable PRs
makes it hard to see which ones are being kept up to date.

If you would like to see this merge, please bring it up to date with
master and reopen it
. If the or mypy type checking fails, please
correct any issues and then reopen it. I will be looking primarily at
open PRs whose tests are all green.

Thank you again for the work you've done here, I am sorry to be
closing it now.

Graham

@grahamc grahamc closed this Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants