This allows the kvm mmu to perform sleepy operations, such as memory
allocation.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Current kvm disables preemption while the new virtualization registers are
in use. This of course is not very good for latency sensitive workloads (one
use of virtualization is to offload user interface and other latency
insensitive stuff to a container, so that it is easier to analyze the
remaining workload). This patch re-enables preemption for kvm; preemption
is now only disabled when switching the registers in and out, and during
the switch to guest mode and back.
Contains fixes from Shaohua Li <shaohua.li@intel.com>.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add the hypercall number to kvm_run and initialize it. This changes the ABI,
but as this particular ABI was unusable before this no users are affected.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Put cpu feature detecting part in hardware_setup, and stored the vmcs
condition in global variable for further check.
[glommer: fix for some i386-only machines not supporting CR8 load/store
exiting]
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch converts the vcpus array in "struct kvm" to a pointer
array, and changes the "vcpu_create" and "vcpu_setup" hooks into one
"vcpu_create" call which does the allocation and initialization of the
vcpu (calling back into the kvm_vcpu_init core helper).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
struct kvm_vcpu has vmx-specific members; remove them to a private structure.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
load_pdptrs can be handed an invalid cr3, and it should not oops.
This can happen because we injected #gp in set_cr3() after we set
vcpu->cr3 to the invalid value, or from kvm_vcpu_ioctl_set_sregs(), or
memory configuration changes after the guest did set_cr3().
We should also copy the pdpte array once, before checking and
assigning, otherwise an SMP guest can potentially alter the values
between the check and the set.
Finally one nitpick: ret = 1 should be done as late as possible: this
allows GCC to check for unset "ret" should the function change in
future.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The writeback fixes (02c03a326a) let
some dead code in the cmpxchg instruction emulation. Remove it.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch mainly imports some constants and rename two exist constants
of vmcs according to IA32 SDM.
It also adds two constants to indicate Lock bit and Enable bit in
MSR_IA32_FEATURE_CONTROL, and replace the hardcode _5_ with these two
bits.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
gfn_to_page might sleep with swap support. Move it out of the kmap calls.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
vmx_cpu_run doesn't handle error correctly and kvm_mmu_reload might
sleep with mutex changes, so I move it above.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Right now, the bug is harmless as we never emulate one-byte 0xb6 or 0xb7.
But things may change.
Noted by the mysterious Gabriel C.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Intel manual (and KVM definition) say the TPR is 4 bits wide. Also fix
CR8_RESEVED_BITS typo.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Creating one's own BITMAP macro seems suboptimal: if we use manual
arithmetic in the one place exposed to userspace, we can use standard
macros elsewhere.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
On this machine (Intel), writing to the CR4 bits 0x00000800 and
0x00001000 cause a GPF. The Intel manual is a little unclear, but
AFIACT they're reserved, too.
Also fix spelling of CR4_RESEVED_BITS.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kernel now has asm/cpu-features.h: use those macros instead of inventing
our own.
Also spell out definition of CR3_RESEVED_BITS, fix spelling and
tighten it for the non-PAE case.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The kernel now has asm/cpu-features.h: use those macros instead of
inventing our own.
Also spell out definition of CR0_RESEVED_BITS (no code change) and fix typo.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
I have shied away from touching x86_emulate.c (it could definitely use
some love, but it is forked from the Xen code, and it would be more
productive to cross-merge fixes).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add string pio write support to support some version of Windows.
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds a `vcpu_id' field in `struct vcpu', so we can
differentiate BSP and APs without pointer comparison or arithmetic.
Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
*nopage() in kvm_main.c should only store the type of mmap() fault if
the pointers are not NULL. This patch fixes the problem.
Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
What guest drivers?
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A guest context switch to an uncached cr3 can require allocation of
shadow pages, but we only recycle shadow pages in kvm_mmu_page_fault().
Move shadow page recycling to mmu_topup_memory_caches(), which is called
from both the page fault handler and from guest cr3 reload.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When taking a cpu down, we need to hardware_disable() it.
Unfortunately, the CPU_DYING notifier is called with interrupts
disabled, which means we can't use smp_call_function_single().
Fortunately, the CPU_DYING notifier is always called on the dying cpu,
so we don't need to use the function at all and can simply call
hardware_disable() directly.
Tested-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
More fallout from the writeback fixes: debug register transfer
instructions do their own writeback and thus need to disable the general
writeback mechanism.
This fixes oopses and some guest failures on AMD machines (the Intel
variant decodes the instruction in hardware and thus does not need
emulation).
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0x0f 0x01 instructions (ie lgdt, lidt, smsw, lmsw and invlpg) does
not use writeback. This patch set no_wb=1 when emulating those
instructions.
This fixes a regression booting the FreeBSD kernel on AMD.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Testing the wrong bit caused kvm not to disable nx on the guest when it is
disabled on the host (an mmu optimization relies on the nx bits being the
same in the guest and host).
This allows Windows to boot when nx is disabled on te host (e.g. when
host pae is disabled).
Signed-off-by: Avi Kivity <avi@qumranet.com>
This reverts commit a3c870bdce. While it
does save useless updates, it (probably) defeats the fork detector, causing
a massive performance loss.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We add the kvm to the vm_list before initializing the vcpu mutexes,
which can be mutex_trylock()'ed by decache_vcpus_on_cpu().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Writes that are contiguous in virtual memory may not be contiguous in
physical memory; so split writes that straddle a page boundary.
Thanks to Aurelien for reporting the bug, patient testing, and a fix
to this very patch.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
__free_page() wants a struct page, not a virtual address.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvm mmu uses page->private on shadow page tables; so does slub, and
an oops result. Fix by allocating regular pages for shadows instead of
using slub.
Tested-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allow real-mode emulation of rdmsr and wrmsr. This allows smp Windows to
boot, presumably for its sipi trampoline.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The memory slot management functions were oriented against vcpu 0, where
they should be kvm-wide. This causes hangs starting X on guest smp.
Fix by making the functions (and resultant tail in the mmu) non-vcpu-specific.
Unfortunately this reduces the efficiency of the mmu object cache a bit. We
may have to revisit this later.
Signed-off-by: Avi Kivity <avi@qumranet.com>
We need to distinguish between large page shadows which have the nx bit set
and those which don't. The problem shows up when booting a newer smp Linux
kernel, where the trampoline page (which is in real mode, which uses the
same shadow pages as large pages) is using the same mapping as a kernel data
page, which is mapped using nx, causing kvm to spin on that page.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.
This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Currently, CONFIG_X86_CMPXCHG64 both enables boot-time checking of
the cmpxchg64b feature and enables compilation of the set_64bit() family.
Since the option is dependent on PAE, and since KVM depends on set_64bit(),
this effectively disables KVM on i386 nopae.
Simplify by removing the config option altogether: the boot check is made
dependent on CONFIG_X86_PAE directly, and the set_64bit() family is exposed
without constraints. It is up to users to check for the feature flag (KVM
does not as virtualiation extensions imply its existence).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only at the CPU_DYING stage can we be sure that no user process will
be scheduled onto the cpu and oops when trying to use virtualization
extensions.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The hotplug IPIs can be called from the cpu on which we are currently
running on, so use on_cpu(). Similarly, drop on_each_cpu() for the
suspend/resume callbacks, as we're in atomic context here and only one
cpu is up anyway.
Signed-off-by: Avi Kivity <avi@qumranet.com>
By keeping track of which cpus have virtualization enabled, we
prevent double-enable or double-disable during hotplug, which is a
very fatal oops.
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm uses a pseudo filesystem, kvmfs, to generate inodes, a job that the
new anonymous inodes source does much better.
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>