Skip to content

Commit

Permalink
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Browse files Browse the repository at this point in the history
Pull KVM updates from Paolo Bonzini:
 "4.11 is going to be a relatively large release for KVM, with a little
  over 200 commits and noteworthy changes for most architectures.

  ARM:
   - GICv3 save/restore
   - cache flushing fixes
   - working MSI injection for GICv3 ITS
   - physical timer emulation

  MIPS:
   - various improvements under the hood
   - support for SMP guests
   - a large rewrite of MMU emulation. KVM MIPS can now use MMU
     notifiers to support copy-on-write, KSM, idle page tracking,
     swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is
     also supported, so that writes to some memory regions can be
     treated as MMIO. The new MMU also paves the way for hardware
     virtualization support.

  PPC:
   - support for POWER9 using the radix-tree MMU for host and guest
   - resizable hashed page table
   - bugfixes.

  s390:
   - expose more features to the guest
   - more SIMD extensions
   - instruction execution protection
   - ESOP2

  x86:
   - improved hashing in the MMU
   - faster PageLRU tracking for Intel CPUs without EPT A/D bits
   - some refactoring of nested VMX entry/exit code, preparing for live
     migration support of nested hypervisors
   - expose yet another AVX512 CPUID bit
   - host-to-guest PTP support
   - refactoring of interrupt injection, with some optimizations thrown
     in and some duct tape removed.
   - remove lazy FPU handling
   - optimizations of user-mode exits
   - optimizations of vcpu_is_preempted() for KVM guests

  generic:
   - alternative signaling mechanism that doesn't pound on
     tsk->sighand->siglock"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits)
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  x86/paravirt: Change vcp_is_preempted() arg type to long
  KVM: VMX: use correct vmcs_read/write for guest segment selector/base
  x86/kvm/vmx: Defer TR reload after VM exit
  x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss
  x86/kvm/vmx: Simplify segment_base()
  x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
  x86/kvm/vmx: Don't fetch the TSS base from the GDT
  x86/asm: Define the kernel TSS limit in a macro
  kvm: fix page struct leak in handle_vmon
  KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
  KVM: Return an error code only as a constant in kvm_get_dirty_log()
  KVM: Return an error code only as a constant in kvm_get_dirty_log_protect()
  KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl()
  KVM: x86: remove code for lazy FPU handling
  KVM: race-free exit from KVM_RUN without POSIX signals
  KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
  KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
  KVM: Support vCPU-based gfn->hva cache
  KVM: use separate generations for each address space
  ...
  • Loading branch information
torvalds committed Feb 23, 2017
2 parents 5066e4a + dd0fd8b commit fd7e9a8
Show file tree
Hide file tree
Showing 110 changed files with 7,277 additions and 2,968 deletions.
138 changes: 128 additions & 10 deletions Documentation/virtual/kvm/api.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2061,6 +2061,8 @@ registers, find a list below:
MIPS | KVM_REG_MIPS_LO | 64
MIPS | KVM_REG_MIPS_PC | 64
MIPS | KVM_REG_MIPS_CP0_INDEX | 32
MIPS | KVM_REG_MIPS_CP0_ENTRYLO0 | 64
MIPS | KVM_REG_MIPS_CP0_ENTRYLO1 | 64
MIPS | KVM_REG_MIPS_CP0_CONTEXT | 64
MIPS | KVM_REG_MIPS_CP0_USERLOCAL | 64
MIPS | KVM_REG_MIPS_CP0_PAGEMASK | 32
Expand All @@ -2071,9 +2073,11 @@ registers, find a list below:
MIPS | KVM_REG_MIPS_CP0_ENTRYHI | 64
MIPS | KVM_REG_MIPS_CP0_COMPARE | 32
MIPS | KVM_REG_MIPS_CP0_STATUS | 32
MIPS | KVM_REG_MIPS_CP0_INTCTL | 32
MIPS | KVM_REG_MIPS_CP0_CAUSE | 32
MIPS | KVM_REG_MIPS_CP0_EPC | 64
MIPS | KVM_REG_MIPS_CP0_PRID | 32
MIPS | KVM_REG_MIPS_CP0_EBASE | 64
MIPS | KVM_REG_MIPS_CP0_CONFIG | 32
MIPS | KVM_REG_MIPS_CP0_CONFIG1 | 32
MIPS | KVM_REG_MIPS_CP0_CONFIG2 | 32
Expand Down Expand Up @@ -2148,6 +2152,12 @@ patterns depending on whether they're 32-bit or 64-bit registers:
0x7020 0000 0001 00 <reg:5> <sel:3> (32-bit)
0x7030 0000 0001 00 <reg:5> <sel:3> (64-bit)

Note: KVM_REG_MIPS_CP0_ENTRYLO0 and KVM_REG_MIPS_CP0_ENTRYLO1 are the MIPS64
versions of the EntryLo registers regardless of the word size of the host
hardware, host kernel, guest, and whether XPA is present in the guest, i.e.
with the RI and XI bits (if they exist) in bits 63 and 62 respectively, and
the PFNX field starting at bit 30.

MIPS KVM control registers (see above) have the following id bit patterns:
0x7030 0000 0002 <reg:16>

Expand Down Expand Up @@ -2443,18 +2453,20 @@ are, it will do nothing and return an EBUSY error.
The parameter is a pointer to a 32-bit unsigned integer variable
containing the order (log base 2) of the desired size of the hash
table, which must be between 18 and 46. On successful return from the
ioctl, it will have been updated with the order of the hash table that
was allocated.
ioctl, the value will not be changed by the kernel.

If no hash table has been allocated when any vcpu is asked to run
(with the KVM_RUN ioctl), the host kernel will allocate a
default-sized hash table (16 MB).

If this ioctl is called when a hash table has already been allocated,
the kernel will clear out the existing hash table (zero all HPTEs) and
return the hash table order in the parameter. (If the guest is using
the virtualized real-mode area (VRMA) facility, the kernel will
re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
with a different order from the existing hash table, the existing hash
table will be freed and a new one allocated. If this is ioctl is
called when a hash table has already been allocated of the same order
as specified, the kernel will clear out the existing hash table (zero
all HPTEs). In either case, if the guest is using the virtualized
real-mode area (VRMA) facility, the kernel will re-create the VMRA
HPTEs on the next KVM_RUN of any vcpu.

4.77 KVM_S390_INTERRUPT

Expand Down Expand Up @@ -3177,7 +3189,7 @@ of IOMMU pages.

The rest of functionality is identical to KVM_CREATE_SPAPR_TCE.

4.98 KVM_REINJECT_CONTROL
4.99 KVM_REINJECT_CONTROL

Capability: KVM_CAP_REINJECT_CONTROL
Architectures: x86
Expand All @@ -3201,7 +3213,7 @@ struct kvm_reinject_control {
pit_reinject = 0 (!reinject mode) is recommended, unless running an old
operating system that uses the PIT for timing (e.g. Linux 2.4.x).

4.99 KVM_PPC_CONFIGURE_V3_MMU
4.100 KVM_PPC_CONFIGURE_V3_MMU

Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
Architectures: ppc
Expand Down Expand Up @@ -3232,7 +3244,7 @@ process table, which is in the guest's space. This field is formatted
as the second doubleword of the partition table entry, as defined in
the Power ISA V3.00, Book III section 5.7.6.1.

4.100 KVM_PPC_GET_RMMU_INFO
4.101 KVM_PPC_GET_RMMU_INFO

Capability: KVM_CAP_PPC_RADIX_MMU
Architectures: ppc
Expand Down Expand Up @@ -3266,6 +3278,101 @@ The ap_encodings gives the supported page sizes and their AP field
encodings, encoded with the AP value in the top 3 bits and the log
base 2 of the page size in the bottom 6 bits.

4.102 KVM_PPC_RESIZE_HPT_PREPARE

Capability: KVM_CAP_SPAPR_RESIZE_HPT
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_ppc_resize_hpt (in)
Returns: 0 on successful completion,
>0 if a new HPT is being prepared, the value is an estimated
number of milliseconds until preparation is complete
-EFAULT if struct kvm_reinject_control cannot be read,
-EINVAL if the supplied shift or flags are invalid
-ENOMEM if unable to allocate the new HPT
-ENOSPC if there was a hash collision when moving existing
HPT entries to the new HPT
-EIO on other error conditions

Used to implement the PAPR extension for runtime resizing of a guest's
Hashed Page Table (HPT). Specifically this starts, stops or monitors
the preparation of a new potential HPT for the guest, essentially
implementing the H_RESIZE_HPT_PREPARE hypercall.

If called with shift > 0 when there is no pending HPT for the guest,
this begins preparation of a new pending HPT of size 2^(shift) bytes.
It then returns a positive integer with the estimated number of
milliseconds until preparation is complete.

If called when there is a pending HPT whose size does not match that
requested in the parameters, discards the existing pending HPT and
creates a new one as above.

If called when there is a pending HPT of the size requested, will:
* If preparation of the pending HPT is already complete, return 0
* If preparation of the pending HPT has failed, return an error
code, then discard the pending HPT.
* If preparation of the pending HPT is still in progress, return an
estimated number of milliseconds until preparation is complete.

If called with shift == 0, discards any currently pending HPT and
returns 0 (i.e. cancels any in-progress preparation).

flags is reserved for future expansion, currently setting any bits in
flags will result in an -EINVAL.

Normally this will be called repeatedly with the same parameters until
it returns <= 0. The first call will initiate preparation, subsequent
ones will monitor preparation until it completes or fails.

struct kvm_ppc_resize_hpt {
__u64 flags;
__u32 shift;
__u32 pad;
};

4.103 KVM_PPC_RESIZE_HPT_COMMIT

Capability: KVM_CAP_SPAPR_RESIZE_HPT
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_ppc_resize_hpt (in)
Returns: 0 on successful completion,
-EFAULT if struct kvm_reinject_control cannot be read,
-EINVAL if the supplied shift or flags are invalid
-ENXIO is there is no pending HPT, or the pending HPT doesn't
have the requested size
-EBUSY if the pending HPT is not fully prepared
-ENOSPC if there was a hash collision when moving existing
HPT entries to the new HPT
-EIO on other error conditions

Used to implement the PAPR extension for runtime resizing of a guest's
Hashed Page Table (HPT). Specifically this requests that the guest be
transferred to working with the new HPT, essentially implementing the
H_RESIZE_HPT_COMMIT hypercall.

This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
returned 0 with the same parameters. In other cases
KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
-EBUSY, though others may be possible if the preparation was started,
but failed).

This will have undefined effects on the guest if it has not already
placed itself in a quiescent state where no vcpu will make MMU enabled
memory accesses.

On succsful completion, the pending HPT will become the guest's active
HPT and the previous HPT will be discarded.

On failure, the guest will still be operating on its previous HPT.

struct kvm_ppc_resize_hpt {
__u64 flags;
__u32 shift;
__u32 pad;
};

5. The kvm_run structure
------------------------

Expand All @@ -3282,7 +3389,18 @@ struct kvm_run {
Request that KVM_RUN return when it becomes possible to inject external
interrupts into the guest. Useful in conjunction with KVM_INTERRUPT.

__u8 padding1[7];
__u8 immediate_exit;

This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN
exits immediately, returning -EINTR. In the common scenario where a
signal is used to "kick" a VCPU out of KVM_RUN, this field can be used
to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
Rather than blocking the signal outside KVM_RUN, userspace can set up
a signal handler that sets run->immediate_exit to a non-zero value.

This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.

__u8 padding1[6];

/* out */
__u32 exit_reason;
Expand Down
11 changes: 8 additions & 3 deletions Documentation/virtual/kvm/devices/arm-vgic-v3.txt
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Groups:
-EBUSY: One or more VCPUs are running


KVM_DEV_ARM_VGIC_CPU_SYSREGS
KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
Attributes:
The attr field of kvm_device_attr encodes two values:
bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 |
Expand All @@ -139,13 +139,15 @@ Groups:
All system regs accessed through this API are (rw, 64-bit) and
kvm_device_attr.addr points to a __u64 value.

KVM_DEV_ARM_VGIC_CPU_SYSREGS accesses the CPU interface registers for the
KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the
CPU specified by the mpidr field.

CPU interface registers access is not implemented for AArch32 mode.
Error -ENXIO is returned when accessed in AArch32 mode.
Errors:
-ENXIO: Getting or setting this register is not yet supported
-EBUSY: VCPU is running
-EINVAL: Invalid mpidr supplied
-EINVAL: Invalid mpidr or register value supplied


KVM_DEV_ARM_VGIC_GRP_NR_IRQS
Expand Down Expand Up @@ -204,3 +206,6 @@ Groups:
architecture defined MPIDR, and the field is encoded as follows:
| 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
| Aff3 | Aff2 | Aff1 | Aff0 |
Errors:
-EINVAL: vINTID is not multiple of 32 or
info field is not VGIC_LEVEL_INFO_LINE_LEVEL
35 changes: 35 additions & 0 deletions Documentation/virtual/kvm/hypercalls.txt
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,38 @@ the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
is used in the hypercall for future use.


6. KVM_HC_CLOCK_PAIRING
------------------------
Architecture: x86
Status: active
Purpose: Hypercall used to synchronize host and guest clocks.
Usage:

a0: guest physical address where host copies
"struct kvm_clock_offset" structure.

a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
is supported (corresponding to the host's CLOCK_REALTIME clock).

struct kvm_clock_pairing {
__s64 sec;
__s64 nsec;
__u64 tsc;
__u32 flags;
__u32 pad[9];
};

Where:
* sec: seconds from clock_type clock.
* nsec: nanoseconds from clock_type clock.
* tsc: guest TSC value used to calculate sec/nsec pair
* flags: flags, unused (0) at the moment.

The hypercall lets a guest compute a precise timestamp across
host and guest. The guest can use the returned TSC value to
compute the CLOCK_REALTIME for its clock, at the same instant.

Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
31 changes: 27 additions & 4 deletions Documentation/virtual/kvm/locking.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,16 @@ sections.
Fast page fault:

Fast page fault is the fast path which fixes the guest page fault out of
the mmu-lock on x86. Currently, the page fault can be fast only if the
shadow page table is present and it is caused by write-protect, that means
we just need change the W bit of the spte.
the mmu-lock on x86. Currently, the page fault can be fast in one of the
following two cases:

1. Access Tracking: The SPTE is not present, but it is marked for access
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
restore the saved R/X bits. This is described in more detail later below.

2. Write-Protection: The SPTE is present and the fault is
caused by write-protect. That means we just need to change the W bit of the
spte.

What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
SPTE_MMU_WRITEABLE bit on the spte:
Expand All @@ -38,7 +45,8 @@ SPTE_MMU_WRITEABLE bit on the spte:
page write-protection.

On fast page fault path, we will use cmpxchg to atomically set the spte W
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
is safe because whenever changing these bits can be detected by cmpxchg.

But we need carefully check these cases:
Expand Down Expand Up @@ -142,6 +150,21 @@ Since the spte is "volatile" if it can be updated out of mmu-lock, we always
atomically update the spte, the race caused by fast page fault can be avoided,
See the comments in spte_has_volatile_bits() and mmu_spte_update().

Lockless Access Tracking:

This is used for Intel CPUs that are using EPT but do not support the EPT A/D
bits. In this case, when the KVM MMU notifier is called to track accesses to a
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
by clearing the RWX bits in the PTE and storing the original R & X bits in
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
PTE (using the ignored bit 62). When the VM tries to access the page later on,
a fault is generated and the fast page fault mechanism described above is used
to atomically restore the PTE to a Present state. The W bit is not saved when
the PTE is marked for access tracking and during restoration to the Present
state, the W bit is set depending on whether or not it was a write access. If
it wasn't, then the W bit will remain clear until a write access happens, at
which time it will be set using the Dirty tracking mechanism described above.

3. Reference
------------

Expand Down
3 changes: 0 additions & 3 deletions arch/arm/include/asm/kvm_host.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,6 @@ struct kvm_arch {
/* The last vcpu id that ran on each physical CPU */
int __percpu *last_vcpu_ran;

/* Timer */
struct arch_timer_kvm timer;

/*
* Anything that is not used directly from assembly code goes
* here.
Expand Down
12 changes: 2 additions & 10 deletions arch/arm/include/asm/kvm_mmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,7 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)

static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
kvm_pfn_t pfn,
unsigned long size,
bool ipa_uncached)
unsigned long size)
{
/*
* If we are going to insert an instruction page and the icache is
Expand All @@ -150,18 +149,12 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
* and iterate over the range.
*/

bool need_flush = !vcpu_has_cache_enabled(vcpu) || ipa_uncached;

VM_BUG_ON(size & ~PAGE_MASK);

if (!need_flush && !icache_is_pipt())
goto vipt_cache;

while (size) {
void *va = kmap_atomic_pfn(pfn);

if (need_flush)
kvm_flush_dcache_to_poc(va, PAGE_SIZE);
kvm_flush_dcache_to_poc(va, PAGE_SIZE);

if (icache_is_pipt())
__cpuc_coherent_user_range((unsigned long)va,
Expand All @@ -173,7 +166,6 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
kunmap_atomic(va);
}

vipt_cache:
if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
/* any kind of VIPT cache */
__flush_icache_all();
Expand Down
Loading

0 comments on commit fd7e9a8

Please sign in to comment.