This simple manual explains some parts of KSM, and also describes how we use multiple EPT pointers and Virtualization Exception (#VE) handling.
Since #VE and VMFUNC are now optional and will not be enabled unless the CPU support it, you can now test under VMs with emulation for VMFUNC.
You may want to disable SECONDARY_EXEC_DESC_TABLE_EXITING in vcpu.c in secondary controls, otherwise it makes WinDBG go maniac. I have not investigated the root cause, but it keeps loading GDT and LDT all the time, which is _insane_.
The main initialization code is in ksm.c which manages all-cpus {,de}initialization, however, vcpu.c is to handle per-cpu initialization and is called from ksm.c on a high-level understanding, then exit.c handles events from the guest kernel.
Note: there are more things that happen during this transition, but to simplify things, only a few stuff is explained.
Note: Most ksm prefixed functions are either defined in ksm.c or ksm.h or main_KERNELNAME.c, all of vcpu prefixed functions are defined in vcpu.c and exit.c, vmx prefixed functions are mostly inline assembly (vmx.h in that case) or compiler-intrinsics or defined in assembler.
This is the flow:
driver entry (any process context) -> ksm_init() (later on or in another context): ksm_subvert() -> for each cpu: call __ksm_init_cpu __ksm_init_cpu(): set feature control MSR appropriately call vcpu_init(current cpu ptr) vcpu_init(vcpu) -> allocate needed stuff return __ksm_init_cpu (contd.): call __vmx_vminit(current cpu ptr) __vmx_vminit(vcpu) -> save guest state call vcpu_run(vcpu, rsp, guest_start_pointer) vcpu_run(vcpu, stack_ptr, guest_start_point) -> enter vmx root mode (load vmxon pointer) load vmcs clear vmcs launch state set cr0 fixed bits set cr4 fixed bits (VMXE) ... /* write important fields to VM Control structure: */ vmcs_write() vmcs_write() .... if not vmlaunch(): print error else /* CPU already jumped to start point */ endif __vmx_vminit(vcpu) (contd.) guest_start_point: restore guest state (registers incl rflags, etc.) return to __ksm_init_cpu __ksm_init_cpu (contd.): if __vmx_vminit failed: throw error remove CR4.VMXE bit return error else return 0 /* succeeded */ endif ksm_subvert() (contd.): return whatever___ksm_init_cpu_returned ... During a guest event (e.g. CPUID execution, etc.), this is what happens:
CPU: save guest state load host state (rsp, fs, gs, ...) jump to host RIP (__vmx_entrypoint) KSM: __vmx_entrypoint: /* Note: The guest registers are still untouched at this point! so we can save them and write to them if needed. */ push guest registers if not vcpu_handle_exit(regs) then jump do_vmx_off else pop guest registers vmresume if fail: jump handle_fail endif endif do_vmx_off: pop guest registers vmxoff if fail: jump handle_fail endif /* Now we're off VMX root mode and preparing to return to normal mode, aka no guest-host barrier. */ restore guest stack pointer set guest rflags /* important to do this after restoring the stack pointer and not before, because this may cause interrupts to be re-enabled... */ jump to guest defined RIP /* last guest RIP + last instruction length. */ handle_fail: push guest registers push guest flags call vcpu_handle_fail() /* should not return */ do_hlt: /* incase vcpu_handle_fail() somehow returned... */ hlt jump do_hlt CPU: /* Note: We assume all these came from us (the host, or root mode), in the other case where these instructions are not executed in root mode, the CPU will either: 1) VM exit to root mode if it's inside non-root mode (i.e. virtualized) 2) throw exception if non-root mode. */ if did_vmresume: /* if vmresume was executed */ if not check_host_state_fields or not check_guest_state_fields: set_eflags_to_indicate_failure advance_instruction_pointer return to caller endif /* check if some stuff need to be done on vm-entry (e.g. MSR load or exceptions): */ if msr_entry_fields: load msrs endif save_host_state_fields (very unlikely to be updated...) load guest_state_fields set instruction_pointer to guest_start /* defined by us in that case */ if exception_queued: throw exception else: jump to guest_start endif elif did_vmxoff: /* if vmxoff was executed */ if not do_sanity_checks: set_eflags_to_indicate_failure advance instruction_pointer return to caller endif turn off root mode (i.e. set VMXON pointer to none) set vmcs pointer to 0 set_eflags_to_indicate_success advance instruction_pointer return to caller endif
You can probably tell from that that the execution is now split into 2 things and that we pretty much "kicked" the kernel (Linux or Windows) out of the physical CPU, and took over, and each time they execute something that we want to monitor, the physical processor does the so-called "VM exit" which makes it enter a "supervision" mode then we can decide what to do with that event, there are some events that are forced to do a VM-exit ("unconditional vm-exit") such as execution of CPUID, VMX instructions, etc, which we don't have control over, so when the CPU encounters such instructions, it will exit to us and then we can emulate the instruction.
To control which events do a VM-exit, the processor offers 3 main "VM control structure" fields, which are (see vmx.h for a full list of controls):
Primary processor control (stuff like enabling MSR/IO controls, cr3-load-exiting, cr3-store-exiting, etc.)
Secondary processor control (which must be activated by setting a bit inside primary, and offers stuff like enabling Extended Page Tables, Virtual Processor ID (for cache control), and even descriptor table-exiting like when they load/store GDT/IDT/TR/LDT, etc.)
Pin based control (External interrupts, posted interrupts, preemption timer, virtual non-maskable interrupt). This basically controls interrupt delivery, the "External interrupts" bit means that the processor will exit each time it's delivering an external interrupt (e.g. Mouse, keyboard, etc. Things that are attached to the Local APIC basically, ...)
Note: Those are the "main" controls, there are also other fields that can conditionally cause vm-exits.
Since there is no bits that controls cr0/cr4 stores/loads in those "main" controls, the processor offers 4 fields that control access to those:
- CR0_READ_SHADOW (If a bit is not set in this variable, then the guest kernel won't see it visible.)
- CR4_READ_SHADOW (Same as CR0, we shadow the VMXE bit, so that they can't easily know that we have VMX mode on, and so that we can emulate VMX mode for them if needed.)
- CR0_GUEST_HOST_MASK (If a bit is set in this variable, the processor causes a VM-exit when they try to set that specific bit in their CR0)
- CR4_GUEST_HOST_MASK (Same as CR0, we set the VMXE bit here, so that we can know when they tried to set it and emulate VMX if needed.)
The following control fields are also useful:
- EXCEPTION_BITMAP - The bit index is the exception vector in the IDT
- MSR_BITMAP - This is described in more detail in ksm.c, see init_msr_bitmap. Controls when the processor does VM exits for an MSR read/write.
- IO_BITMAP_A - I/O ports (low part from 0 up to 7FFF), controls when an in/out instruction for a specific port will cause a VM exit.
- IO_BITMAP_B - I/O ports (high part from 8000 to FFFF), ^^^^.
There are also other control fields, but these are mainly not used. The rest of the fields are mostly guest and host setup. It's rather better if you look at vcpu_run() in vcpu.c, that way you can get a "realistic" view of things.
EPT (Extended Page Tables) also called SLAT (Second Level Address Translation) is used to control guest address translation but on the physical level.
Without EPT, the processor normally goes through translating a virtual address to its backing physical address, with EPT, the processor adds another level of translation, which translates the physical address (now called "guest physical address" or GPA) to "host physical address" (HPA).
Normally, the base PML4 table is stored in CR3, which the processor uses to translate a virtual address to it's backing physical address, EPT is very similar, with a configured EPT pointer (EPTP) which also contains the PML4 table, the processor uses this table to translate the GPA to HPA.
Here's an example of what happens during both phases:
#define MAX_PHYS 36 /* physical addr width */ #define PAGE_SHIFT 12 /* bits 0:11 are the offset. */ #define PA_MASK ((1 << MAX_PHYS) - 1) #define PAGE_PA_MASK (PA_MASK << PAGE_SHIFT); /* bits 12:47 of an entry is the Page physical address. */ #define ENTRY_COUNT 512 /* per table */ #define ENTRY_MASK (ENTRY_COUNT - 1) #define pdpt_index(a) (a >> 39) & ENTRY_MASK /* bits 38:30 */ #define pdt_index(a) (a >> 30) & ENTRY_MASK /* bits 29:21 */ #define pt_index(a) (a >> 21) & ENTRY_MASK /* bits 20:12 */ #define page_index(a) (a >> 12) & ENTRY_MASK /* bits 11:0 */ /* First level: Translate GVA to GPA: */ gva = some_arbitrary_value; pml4 = VA_OF(CR3 & PAGE_PA_MASK); pdpt = VA_OF(pml4[pdpt_index(gva)] & PAGE_PA_MASK); pdt = VA_OF(pdpt[pdt_index(gva)] & PAGE_PA_MASK); pt = VA_OF(pdt[pt_index(gva)] & PAGE_PA_MASK); page = pt[page_index(gva)]; gpa = page & PAGE_PA_MASK; /* We now have GPA, and we know it's valid! (assume so) */ /* Second level: Translate GPA to HPA */ eptp = read_eptp_from_current_vmcs; pml4 = VA_OF(eptp & PAGE_PA_MASK); pdpt = VA_OF(pml4[pdpt_index(gpa)] & PAGE_PA_MASK); pdt = VA_OF(pdpt[pdt_index(gpa)] & PAGE_PA_MASK); pt = VA_OF(pdt[pt_index(gpa)] & PAGE_PA_MASK); page = pt[page_index(gpa)]; hpa = page & PAGE_PA_MASK;Pretty much repeating ourselves, but this is basically what happens. An example for this kind of use is executable page hooking which is described in more detail in page.c (also below).
Just like page faults, EPT has "EPT violation" and "EPT misconfig", in the latter case, it can happen when an unsupported bit is set (e.g. a reserved bit is set somewhere), in the former case, it can happen when for example an access bit is not there (e.g. trying to execute but there is no execute access given.)
The traditional EPT violation handling is via the VM exit path, but modern processors (starting off Intel Broadwell) supports a new IDT exception called "Virtualization Exceptions" and that is defined at vector 20 in the IDT. When set and the relevant bits in VMCS are also set, the processor will throw exceptions to that vector instead of causing a VM exit, but under certain conditions it will take the vm-exit path instead, see notes below.
To simplify things, the following terms are used as an abbreviation:
- Host - refers to the VMM (Virtual Machine Monitor) aka VMX root mode
- Guest or Kernel - refers to the running guest kernel (i.e. Windows or Linux)
Some things need to be used with extra care especially inside Host as this is a sensitive mode and things may go unexpected if used improperly.
- The timestamp counter does not _pause_ during entry to Host, so things like APIC timer can fire on next guest entry (vmresume).
- Interrupts are disabled. On entry to __vmx_entrypoint, the CPU had already disabled interrupts ("host eflags"). So, addresses referenced inside root mode should be physically contiguous, otherwise if you enable interrupts by yourself, you might cause havoc if a preemption happens.
- Calling a Kernel function inside the Host can be dangerous, especially because the Host stack is different, so any kind of stack probing functions will most likely fail.
- Single stepping vmresume or vmlaunch is invaluable, the debugger will never give you back control, for obvious reasons. If you want that behavior, then rather set a breakpoint on whatever vcpu->ip is set to.
- If the processor does not support Virtualization Exceptions, the VM exit path will be taken instead (Note that the VM exit path is _always_ handled).
- If the processor does not support VMFUNC, it's emulated via VMCALL instead.
- Virtualization Exceptions (#VE) will not occur if:
- The processor is delivering another exception
- The except_mask inside ve_except_info is set to non-zero value.
Some notes on Guest:
- VMFUNC does not have CPL checks, that means a user-space program can execute it.
- The virtual processor ID (VPID) cannot be 0 since the Host already uses that one, so we use the current processor number is + 1. VPIDs are used to control processor cache.
- By enabling the descriptor table exiting bit in processor secondary control, we can easily establish this
- On initial startup, we allocate a completely new IDT base and copy the current one in use to it (also save the old one)
- When a VM-exit occurs with an EXIT_REASON_GDT_IDT_ACCESS, we simply just give them the cached one (on sidt) (or on lidt), we copy the new one's contents, discarding the hooked entries we know about, thus not letting them know about our stuff.
- vcpu.c: in setup_vmcs() where we initially setup the VMCS fields, we then set the relevant fields (VE_INFO_ADDRESS, EPTP_LIST_ADDRESS, VM_FUNCTION_CTL) and enable relevant bits VE and VMFUNC in secondary processor control.
- vmx.asm (or vmx.S for GCC): which contains the #VE handler (__ept_violation) then does the usual interrupt handling and then calls the C handler __ept_handle_violation (vcpu.c).
- vcpu.c: in __ept_handle_violation (#VE handler not VM-exit), usually the processor will do the #VE handler instead of the VM-exit route, but sometimes it won't do so if it's delivering another exception. This is very rare.
- vcpu.c: while handling the violation via #VE, we call vmfunc only when we detect that the faulting address is one of
- our interest (e.g. a hooked page), then we determine which EPTP we want and execute VMFUNC with that EPTP index.
To avoid a lot of violations, we just mark the page as execute only and replace the _final_ page frame number so that it just goes straight ahead to our trampoline Since we use 3 EPT pointers, and since the page needs to be read and written to sometimes (e.g. patchguard verification), we also need to catch RW access to the page and then switch the EPTP appropriately according to the access. In that case we switch over to EPTP_RWHOOK to allow RW access only! The third pointer is used for when we need to call the original function. The third pointer has execute only access rights to the page with the sane page frame number.
- Port mm.h functions (mm_alloc_page, __mm_free_page, mm_alloc_pool, etc.)
- Port resubv.c (not really needed) for re-virtualization on S1-3 or S4 state (commenting it out is OK).
- Port hotplug.c for cpu hotplug callbacks
- Write module for initialization
- Port print.c for printing interface (Some kernels may not require it)
- Port vmx.S for the assembly based stuff, please use macros for calling conventions, etc.
Hopefully didn't miss something important, but these are definitely the mains.