Skip to content

Commit

Permalink
Merge branch 'nfsd-5.8' of git://linux-nfs.org/~cel/cel-2.6 into for-…
Browse files Browse the repository at this point in the history
…5.8-incoming

Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
  • Loading branch information
J. Bruce Fields committed May 21, 2020
2 parents 746c623 + f245397 commit 6670ee2
Show file tree
Hide file tree
Showing 655 changed files with 7,395 additions and 3,526 deletions.
14 changes: 14 additions & 0 deletions Documentation/core-api/printk-formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,20 @@ used when printing stack backtraces. The specifier takes into
consideration the effect of compiler optimisations which may occur
when tail-calls are used and marked with the noreturn GCC attribute.

Probed Pointers from BPF / tracing
----------------------------------

::

%pks kernel string
%pus user string

The ``k`` and ``u`` specifiers are used for printing prior probed memory from
either kernel memory (k) or user memory (u). The subsequent ``s`` specifier
results in printing a string. For direct use in regular vsnprintf() the (k)
and (u) annotation is ignored, however, when used out of BPF's bpf_trace_printk(),
for example, it reads the memory it is pointing to without faulting.

Kernel Pointers
---------------

Expand Down
3 changes: 2 additions & 1 deletion Documentation/devicetree/bindings/dma/fsl-edma.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Required properties:
- compatible :
- "fsl,vf610-edma" for eDMA used similar to that on Vybrid vf610 SoC
- "fsl,imx7ulp-edma" for eDMA2 used similar to that on i.mx7ulp
- "fsl,fsl,ls1028a-edma" for eDMA used similar to that on Vybrid vf610 SoC
- "fsl,ls1028a-edma" followed by "fsl,vf610-edma" for eDMA used on the
LS1028A SoC.
- reg : Specifies base physical address(s) and size of the eDMA registers.
The 1st region is eDMA control register's address and size.
The 2nd and the 3rd regions are programmable channel multiplexing
Expand Down
4 changes: 2 additions & 2 deletions Documentation/networking/devlink/ice.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ The ``ice`` driver reports the following versions
- running
- ICE OS Default Package
- The name of the DDP package that is active in the device. The DDP
package is loaded by the driver during initialization. Each varation
of DDP package shall have a unique name.
package is loaded by the driver during initialization. Each
variation of the DDP package has a unique name.
* - ``fw.app``
- running
- 1.3.1.0
Expand Down
37 changes: 30 additions & 7 deletions Documentation/usb/raw-gadget.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,8 @@ differences are:
3. Raw Gadget provides a way to select a UDC device/driver to bind to,
while GadgetFS currently binds to the first available UDC.

4. Raw Gadget uses predictable endpoint names (handles) across different
UDCs (as long as UDCs have enough endpoints of each required transfer
type).
4. Raw Gadget explicitly exposes information about endpoints addresses and
capabilities allowing a user to write UDC-agnostic gadgets.

5. Raw Gadget has ioctl-based interface instead of a filesystem-based one.

Expand All @@ -50,12 +49,36 @@ The typical usage of Raw Gadget looks like:
Raw Gadget and react to those depending on what kind of USB device
needs to be emulated.

Note, that some UDC drivers have fixed addresses assigned to endpoints, and
therefore arbitrary endpoint addresses can't be used in the descriptors.
Nevertheles, Raw Gadget provides a UDC-agnostic way to write USB gadgets.
Once a USB_RAW_EVENT_CONNECT event is received via USB_RAW_IOCTL_EVENT_FETCH,
the USB_RAW_IOCTL_EPS_INFO ioctl can be used to find out information about
endpoints that the UDC driver has. Based on that information, the user must
chose UDC endpoints that will be used for the gadget being emulated, and
properly assign addresses in endpoint descriptors.

You can find usage examples (along with a test suite) here:

https://github.com/xairy/raw-gadget

Internal details
~~~~~~~~~~~~~~~~

Currently every endpoint read/write ioctl submits a USB request and waits until
its completion. This is the desired mode for coverage-guided fuzzing (as we'd
like all USB request processing happen during the lifetime of a syscall),
and must be kept in the implementation. (This might be slow for real world
applications, thus the O_NONBLOCK improvement suggestion below.)

Potential future improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Implement ioctl's for setting/clearing halt status on endpoints.

- Reporting more events (suspend, resume, etc.) through
USB_RAW_IOCTL_EVENT_FETCH.
- Report more events (suspend, resume, etc.) through USB_RAW_IOCTL_EVENT_FETCH.

- Support O_NONBLOCK I/O.

- Support USB 3 features (accept SS endpoint companion descriptor when
enabling endpoints; allow providing stream_id for bulk transfers).

- Support ISO transfer features (expose frame_number for completed requests).
2 changes: 2 additions & 0 deletions Documentation/virt/kvm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,5 @@ KVM
arm/index

devices/index

running-nested-guests
276 changes: 276 additions & 0 deletions Documentation/virt/kvm/running-nested-guests.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
==============================
Running nested guests with KVM
==============================

A nested guest is the ability to run a guest inside another guest (it
can be KVM-based or a different hypervisor). The straightforward
example is a KVM guest that in turn runs on a KVM guest (the rest of
this document is built on this example)::

.----------------. .----------------.
| | | |
| L2 | | L2 |
| (Nested Guest) | | (Nested Guest) |
| | | |
|----------------'--'----------------|
| |
| L1 (Guest Hypervisor) |
| KVM (/dev/kvm) |
| |
.------------------------------------------------------.
| L0 (Host Hypervisor) |
| KVM (/dev/kvm) |
|------------------------------------------------------|
| Hardware (with virtualization extensions) |
'------------------------------------------------------'

Terminology:

- L0 – level-0; the bare metal host, running KVM

- L1 – level-1 guest; a VM running on L0; also called the "guest
hypervisor", as it itself is capable of running KVM.

- L2 – level-2 guest; a VM running on L1, this is the "nested guest"

.. note:: The above diagram is modelled after the x86 architecture;
s390x, ppc64 and other architectures are likely to have
a different design for nesting.

For example, s390x always has an LPAR (LogicalPARtition)
hypervisor running on bare metal, adding another layer and
resulting in at least four levels in a nested setup — L0 (bare
metal, running the LPAR hypervisor), L1 (host hypervisor), L2
(guest hypervisor), L3 (nested guest).

This document will stick with the three-level terminology (L0,
L1, and L2) for all architectures; and will largely focus on
x86.


Use Cases
---------

There are several scenarios where nested KVM can be useful, to name a
few:

- As a developer, you want to test your software on different operating
systems (OSes). Instead of renting multiple VMs from a Cloud
Provider, using nested KVM lets you rent a large enough "guest
hypervisor" (level-1 guest). This in turn allows you to create
multiple nested guests (level-2 guests), running different OSes, on
which you can develop and test your software.

- Live migration of "guest hypervisors" and their nested guests, for
load balancing, disaster recovery, etc.

- VM image creation tools (e.g. ``virt-install``, etc) often run
their own VM, and users expect these to work inside a VM.

- Some OSes use virtualization internally for security (e.g. to let
applications run safely in isolation).


Enabling "nested" (x86)
-----------------------

From Linux kernel v4.19 onwards, the ``nested`` KVM parameter is enabled
by default for Intel and AMD. (Though your Linux distribution might
override this default.)

In case you are running a Linux kernel older than v4.19, to enable
nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To
persist this setting across reboots, you can add it in a config file, as
shown below:

1. On the bare metal host (L0), list the kernel modules and ensure that
the KVM modules::

$ lsmod | grep -i kvm
kvm_intel 133627 0
kvm 435079 1 kvm_intel

2. Show information for ``kvm_intel`` module::

$ modinfo kvm_intel | grep -i nested
parm: nested:bool

3. For the nested KVM configuration to persist across reboots, place the
below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it
doesn't exist)::

$ cat /etc/modprobe.d/kvm_intel.conf
options kvm-intel nested=y

4. Unload and re-load the KVM Intel module::

$ sudo rmmod kvm-intel
$ sudo modprobe kvm-intel

5. Verify if the ``nested`` parameter for KVM is enabled::

$ cat /sys/module/kvm_intel/parameters/nested
Y

For AMD hosts, the process is the same as above, except that the module
name is ``kvm-amd``.


Additional nested-related kernel parameters (x86)
-------------------------------------------------

If your hardware is sufficiently advanced (Intel Haswell processor or
higher, which has newer hardware virt extensions), the following
additional features will also be enabled by default: "Shadow VMCS
(Virtual Machine Control Structure)", APIC Virtualization on your bare
metal host (L0). Parameters for Intel hosts::

$ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs
Y

$ cat /sys/module/kvm_intel/parameters/enable_apicv
Y

$ cat /sys/module/kvm_intel/parameters/ept
Y

.. note:: If you suspect your L2 (i.e. nested guest) is running slower,
ensure the above are enabled (particularly
``enable_shadow_vmcs`` and ``ept``).


Starting a nested guest (x86)
-----------------------------

Once your bare metal host (L0) is configured for nesting, you should be
able to start an L1 guest with::

$ qemu-kvm -cpu host [...]

The above will pass through the host CPU's capabilities as-is to the
gues); or for better live migration compatibility, use a named CPU
model supported by QEMU. e.g.::

$ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on

then the guest hypervisor will subsequently be capable of running a
nested guest with accelerated KVM.


Enabling "nested" (s390x)
-------------------------

1. On the host hypervisor (L0), enable the ``nested`` parameter on
s390x::

$ rmmod kvm
$ modprobe kvm nested=1

.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive
with the ``nested`` paramter — i.e. to be able to enable
``nested``, the ``hpage`` parameter *must* be disabled.

2. The guest hypervisor (L1) must be provided with the ``sie`` CPU
feature — with QEMU, this can be done by using "host passthrough"
(via the command-line ``-cpu host``).

3. Now the KVM module can be loaded in the L1 (guest hypervisor)::

$ modprobe kvm


Live migration with nested KVM
------------------------------

Migrating an L1 guest, with a *live* nested guest in it, to another
bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for
Intel x86 systems, and even on older versions for s390x.

On AMD systems, once an L1 guest has started an L2 guest, the L1 guest
should no longer be migrated or saved (refer to QEMU documentation on
"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate
or save-and-load an L1 guest while an L2 guest is running will result in
undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a
kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1
guest can no longer be considered stable or secure, and must be restarted.
Migrating an L1 guest merely configured to support nesting, while not
actually running L2 guests, is expected to function normally even on AMD
systems but may fail once guests are started.

Migrating an L2 guest is always expected to succeed, so all the following
scenarios should work even on AMD systems:

- Migrating a nested guest (L2) to another L1 guest on the *same* bare
metal host.

- Migrating a nested guest (L2) to another L1 guest on a *different*
bare metal host.

- Migrating a nested guest (L2) to a bare metal host.

Reporting bugs from nested setups
-----------------------------------

Debugging "nested" problems can involve sifting through log files across
L0, L1 and L2; this can result in tedious back-n-forth between the bug
reporter and the bug fixer.

- Mention that you are in a "nested" setup. If you are running any kind
of "nesting" at all, say so. Unfortunately, this needs to be called
out because when reporting bugs, people tend to forget to even
*mention* that they're using nested virtualization.

- Ensure you are actually running KVM on KVM. Sometimes people do not
have KVM enabled for their guest hypervisor (L1), which results in
them running with pure emulation or what QEMU calls it as "TCG", but
they think they're running nested KVM. Thus confusing "nested Virt"
(which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM).

Information to collect (generic)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following is not an exhaustive list, but a very good starting point:

- Kernel, libvirt, and QEMU version from L0

- Kernel, libvirt and QEMU version from L1

- QEMU command-line of L1 -- when using libvirt, you'll find it here:
``/var/log/libvirt/qemu/instance.log``

- QEMU command-line of L2 -- as above, when using libvirt, get the
complete libvirt-generated QEMU command-line

- ``cat /sys/cpuinfo`` from L0

- ``cat /sys/cpuinfo`` from L1

- ``lscpu`` from L0

- ``lscpu`` from L1

- Full ``dmesg`` output from L0

- Full ``dmesg`` output from L1

x86-specific info to collect
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both the below commands, ``x86info`` and ``dmidecode``, should be
available on most Linux distributions with the same name:

- Output of: ``x86info -a`` from L0

- Output of: ``x86info -a`` from L1

- Output of: ``dmidecode`` from L0

- Output of: ``dmidecode`` from L1

s390x-specific info to collect
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Along with the earlier mentioned generic details, the below is
also recommended:

- ``/proc/sysinfo`` from L1; this will also include the info from L0
Loading

0 comments on commit 6670ee2

Please sign in to comment.