Commit Graph

5836 Commits

Author SHA1 Message Date
Andi Kleen
b276268631 x86, mce, cmci: factor out threshold interrupt handler
Impact: cleanup; preparation for feature

The mce_amd_64 code has an own private MC threshold vector with an own
interrupt handler. Since Intel needs a similar handler
it makes sense to share the vector because both can not
be active at the same time.

I factored the common APIC handler code into a separate file which can
be used by both the Intel or AMD MC code.

This is needed for the next patch which adds an Intel specific
CMCI handler.

This patch should be a nop for AMD, it just moves some code
around.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24 13:24:42 -08:00
Andi Kleen
41fdff322e x86, mce, cmci: export MAX_NR_BANKS
Impact: Cleanup (code movement)

Move MAX_NR_BANKS into mce.h because it's needed there
for followup patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-24 13:24:42 -08:00
H. Peter Anvin
dc731ca609 Merge branch 'x86/urgent' into x86/mce2 2009-02-23 14:05:56 -08:00
H. Peter Anvin
ec5b3d3243 x86, mce: remove invalid __cpuinit/__cpuexit annotations
Impact: Bug fix when CPU hotplug is disabled

Correct the following broken __cpuinit/__cpuexit annotations:

- mce_cpu_features() is called from mce_resume(), and so cannot be
  __cpuinit.
- mce_disable_cpu() and mce_reenable_cpu() are called from
  mce_cpu_callback(), and so cannot be __cpuexit().

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-23 14:01:04 -08:00
Rafael J. Wysocki
770824bdc4 PM: Split up sysdev_[suspend|resume] from device_power_[down|up]
Move the sysdev_suspend/resume from the callee to the callers, with
no real change in semantics, so that we can rework the disabling of
interrupts during suspend/hibernation.

This is based on an earlier patch from Linus.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-22 10:33:44 -08:00
Linus Torvalds
936577c61d x86: Add IRQF_TIMER to legacy x86 timer interrupt descriptors
Right now nobody cares, but the suspend/resume code will eventually want
to suspend device interrupts without suspending the timer, and will
depend on this flag to know.

The modern x86 timer infrastructure uses the local APIC timers and never
shows up as a device interrupt at all, so it isn't affected and doesn't
need any of this.

Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-22 10:27:49 -08:00
Jiri Slaby
6defa2fe20 x86_64: Fix S3 fail path
As acpi_enter_sleep_state can fail, take this into account in
do_suspend_lowlevel and don't return to the do_suspend_lowlevel's
caller. This would break (currently) fpu status and preempt count.

Technically, this means use `call' instead of `jmp' and `jmp' to
the `resume_point' after the `call' (i.e. if
acpi_enter_sleep_state returns=fails). `resume_point' will handle
the restore of fpu and preempt count gracefully.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2009-02-21 21:58:18 -05:00
Jiri Slaby
e6bd6760c9 x86_64: acpi/wakeup_64 cleanup
- remove %ds re-set, it's already set in wakeup_long64
- remove double labels and alignment (ENTRY already adds both)
- use meaningful resume point labelname
- skip alignment while jumping from wakeup_long64 to the resume point
- remove .size, .type and unused labels
[v2]
- added ENDPROCs

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2009-02-21 21:58:18 -05:00
H. Peter Anvin
cc3ca22063 x86, mce: remove incorrect __cpuinit for mce_cpu_features()
Impact: Bug fix on UP

Checkin 6ec68bff3c:
    x86, mce: reinitialize per cpu features on resume

introduced a call to mce_cpu_features() in the resume path, in order
for the MCE machinery to get properly reinitialized after a resume.
However, this function (and its successors) was flagged __cpuinit,
which becomes __init on UP configurations (on SMP suspend/resume
requires CPU hotplug and so this would not be seen.)

Remove the offending __cpuinit annotations for mce_cpu_features() and
its successor functions.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-20 23:40:40 -08:00
Ingo Molnar
07a66d7c53 x86: use the right protections for split-up pagetables
Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().

The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.

The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.

With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.

Also update the comment.

This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.

[ Huang Ying also experienced problems in this area when writing
  the EFI code, but the real bug in split_large_page() was not
  realized back then. ]

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-20 08:35:03 +01:00
Alok N Kataria
48ffc70b67 x86, vmi: TSC going backwards check in vmi clocksource
Impact: fix time warps under vmware

Similar to the check for TSC going backwards in the TSC clocksource,
we also need this check for VMI clocksource.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
2009-02-20 07:53:08 +01:00
H. Peter Anvin
f6d1826dfa x86, mce: use %ll instead of %L for 64-bit numbers
Impact: Cleanup

The standard spelling of a printf pattern for long long is "ll", not
"L", which is for long double.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19 15:44:58 -08:00
Andi Kleen
b79109c3bb x86, mce: separate correct machine check poller and fatal exception handler
Impact: cleanup, performance enhancement

The machine check poller is diverging more and more from the fatal
exception handler. Instead of adding more special cases separate the code
paths completely. The corrected poll path is actually quite simple,
and this doesn't result in much code duplication.

This makes both handlers much easier to read and results in
cleaner code flow.  The exception handler now only needs to care
about uncorrected errors, which also simplifies the handling of multiple
errors. The corrected poller also now always runs in standard interrupt
context and does not need to do anything special to handle NMI context.

Minor behaviour changes:
- MCG status is now not cleared on polling.
- Only the banks which had corrected errors get cleared on polling
- The exception handler only clears banks with errors now

v2: Forward port to new patch order. Add "uc" argument.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19 14:52:20 -08:00
Andi Kleen
b5f2fa4ea0 x86, mce: factor out duplicated struct mce setup into one function
Impact: cleanup

This merely factors out duplicated code to set up
the initial struct mce state into a single function.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19 14:51:39 -08:00
Andi Kleen
0d7482e3d7 x86, mce: implement dynamic machine check banks support
Impact: cleanup; making code future proof; memory saving on small systems

This patch replaces the hardcoded max number of machine check banks with 
dynamic allocation depending on what the CPU reports. The sysfs
data structures and the banks array are dynamically allocated.

There is still a hard bank limit (128) because the mcelog protocol uses
banks >= 128 as pseudo banks to escape other events. But we expect
that 128 banks is beyond any reasonable CPU for now.

This supersedes an earlier patch by Venki, but it solves the problem
more completely by making the limit fully dynamic (up to the 128
boundary).

This saves some memory on machines with less than 6 banks because
they won't need sysdevs for unused ones and also allows to 
use sysfs to control these banks on possible future CPUs with
more than 6 banks.

This is an updated patch addressing Venki's comments.  I also added in
another patch from Thomas which fixed the error allocation path (that
patch was previously separated)

Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19 14:50:58 -08:00
Andi Kleen
e35849e910 x86, mce: enable machine checks in 64-bit defconfig
Impact: Low priority fix

The 32-bit defconfig already had it enabled. And it's a pretty
fundamental feature, so better enable it on 64 bits too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2009-02-19 14:48:55 -08:00
Linus Torvalds
bcf8951fc2 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
  x86, mce: use force_sig_info to kill process in machine check
  x86, mce: reinitialize per cpu features on resume
  x86, rcu: fix strange load average and ksoftirqd behavior
2009-02-19 09:14:35 -08:00
KAMEZAWA Hiroyuki
f2dbcfa738 mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:

	BUG_ON(page_zone(start_page) != page_zone(end_page));

Once I knew this is what was happening, I added some annotations:

	if (unlikely(page_zone(start_page) != page_zone(end_page))) {
		printk(KERN_ERR "move_freepages: Bogus zones: "
		       "start_page[%p] end_page[%p] zone[%p]\n",
		       start_page, end_page, zone);
		printk(KERN_ERR "move_freepages: "
		       "start_zone[%p] end_zone[%p]\n",
		       page_zone(start_page), page_zone(end_page));
		printk(KERN_ERR "move_freepages: "
		       "start_pfn[0x%lx] end_pfn[0x%lx]\n",
		       page_to_pfn(start_page), page_to_pfn(end_page));
		printk(KERN_ERR "move_freepages: "
		       "start_nid[%d] end_nid[%d]\n",
		       page_to_nid(start_page), page_to_nid(end_page));
 ...

And here's what I got:

	move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
	move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
	move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
	move_freepages: start_nid[1] end_nid[0]

My memory layout on this box is:

[    0.000000] Zone PFN ranges:
[    0.000000]   Normal   0x00000000 -> 0x0081ff5d
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[8] active PFN ranges
[    0.000000]     0: 0x00000000 -> 0x00020000
[    0.000000]     1: 0x00800000 -> 0x0081f7ff
[    0.000000]     1: 0x0081f800 -> 0x0081fe50
[    0.000000]     1: 0x0081fed1 -> 0x0081fed8
[    0.000000]     1: 0x0081feda -> 0x0081fedb
[    0.000000]     1: 0x0081fedd -> 0x0081fee5
[    0.000000]     1: 0x0081fee7 -> 0x0081ff51
[    0.000000]     1: 0x0081ff59 -> 0x0081ff5d

So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.

This patch:

Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
 I think it makes fix-for-memmap-init not easy.

This patch moves all declaration to include/linux/mm.h

After this,
  if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
     -> Use static definition in include/linux/mm.h
  else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
     -> Use generic definition in mm/page_alloc.c
  else
     -> per-arch back end function will be called.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:55 -08:00
Huang Ying
ef41df4344 x86, mce: fix a race condition in mce_read()
Impact: bugfix

Considering the situation as follow:

before: mcelog.next == 1, mcelog.entry[0].finished = 1

+--------------------------------------------------------------------------
R                   W1                  W2                  W3

read mcelog.next (1)
                    mcelog.next++ (2)
                    (working on entry 1,
                    finished == 0)

mcelog.next = 0
                                        mcelog.next++ (1)
                                        (working on entry 0)
                                                           mcelog.next++ (2)
                                                           (working on entry 1)
                        <----------------- race ---------------->
                    (done on entry 1,
                    finished = 1)
                                                           (done on entry 1,
                                                           finished = 1)

To fix the race condition, a cmpxchg loop is added to mce_read() to
ensure no new MCE record can be added between mcelog.next reading and
mcelog.next = 0.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:33:05 -08:00
Andi Kleen
d6b75584a3 x86, mce: disable machine checks on offlined CPUs
Impact: Lower priority bug fix

Offlined CPUs could still get machine checks, but the machine check handler
cannot handle them properly, leading to an unconditional crash. Disable
machine checks on CPUs that are going down.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:56 -08:00
Andi Kleen
5b4408fdaa x86, mce: don't set up mce sysdev devices with mce=off
Impact: bug fix, in this case the resume handler shouldn't run which
	avoids incorrectly reenabling machine checks on resume

When MCEs are completely disabled on the command line don't set
up the sysdev devices for them either.

Includes a comment fix from Thomas Gleixner.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:50 -08:00
Andi Kleen
52d168e28b x86, mce: switch machine check polling to per CPU timer
Impact: Higher priority bug fix

The machine check poller runs a single timer and then broadcasted an
IPI to all CPUs to check them. This leads to unnecessary
synchronization between CPUs. The original CPU running the timer has
to wait potentially a long time for all other CPUs answering. This is
also real time unfriendly and in general inefficient.

This was especially a problem on systems with a lot of events where
the poller run with a higher frequency after processing some events.
There could be more and more CPU time wasted with this, to
the point of significantly slowing down machines.

The machine check polling is actually fully independent per CPU, so
there's no reason to not just do this all with per CPU timers.  This
patch implements that.

Also switch the poller also to use standard timers instead of work
queues. It was using work queues to be able to execute a user program
on a event, but mce_notify_user() handles this case now with a
separate callback. So instead always run the poll code in in a
standard per CPU timer, which means that in the common case of not
having to execute a trigger there will be less overhead.

This allows to clean up the initialization significantly, because
standard timers are already up when machine checks get init'ed.  No
multiple initialization functions.

Thanks to Thomas Gleixner for some help.

Cc: thockin@google.com
v2: Use del_timer_sync() on cpu shutdown and don't try to handle
migrated timers.
v3: Add WARN_ON for timer running on unexpected CPU

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:44 -08:00
Andi Kleen
9bd9840580 x86, mce: always use separate work queue to run trigger
Impact: Needed for bug fix in next patch

This relaxes the requirement that mce_notify_user has to run in process
context. Useful for future changes, but also leads to cleaner
behaviour now. Now instead mce_notify_user can be called directly
from interrupt (but not NMI) context.

The work queue only uses a single global work struct, which can be done safely
because it is always free to reuse before the trigger function is executed.
This way no events can be lost.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:41 -08:00
Andi Kleen
123aa76ec0 x86, mce: don't disable machine checks during code patching
Impact: low priority bug fix

This removes part of a a patch I added myself some time ago. After some
consideration the patch was a bad idea. In particular it stopped machine check
exceptions during code patching.

To quote the comment:

        * MCEs only happen when something got corrupted and in this
        * case we must do something about the corruption.
        * Ignoring it is worse than a unlikely patching race.
        * Also machine checks tend to be broadcast and if one CPU
        * goes into machine check the others follow quickly, so we don't
        * expect a machine check to cause undue problems during to code
        * patching.

So undo the machine check related parts of
8f4e956b31 NMIs are still disabled.

This only removes code, the only additions are a new comment.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:38 -08:00
Andi Kleen
973a2dd1d5 x86, mce: disable machine checks on suspend
Impact: Bug fix

During suspend it is not reliable to process machine check
exceptions, because CPUs disappear but can still get machine check
broadcasts.  Also the system is slightly more likely to
machine check them, but the handler is typically not a position
to handle them in a meaningfull way.

So disable them during suspend and enable them during resume.

Also make sure they are always disabled on hot-unplugged CPUs.

This new code assumes that suspend always hotunplugs all
non BP CPUs.

v2: Remove the WARN_ONs Thomas objected to.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:32:14 -08:00
Andi Kleen
07db1c140e x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
Impact: Bugfix

The ifdef for the apic clear on shutdown for the 64bit intel thermal
vector was incorrect and never triggered. Fix that.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:24:34 -08:00
Andi Kleen
380851bc6b x86, mce: use force_sig_info to kill process in machine check
Impact: bug fix (with tolerant == 3)

do_exit cannot be called directly from the exception handler because
it can sleep and the exception handler runs on the exception stack.
Use force_sig() instead.

Based on a earlier patch by Ying Huang who debugged the problem.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:24:31 -08:00
Andi Kleen
6ec68bff3c x86, mce: reinitialize per cpu features on resume
Impact: Bug fix

This fixes a long standing bug in the machine check code. On resume the
boot CPU wouldn't get its vendor specific state like thermal handling
reinitialized. This means the boot cpu wouldn't ever get any thermal
events reported again.

Call the respective initialization functions on resume

v2: Remove ancient init because they don't have a resume device anyways.
    Pointed out by Thomas Gleixner.
v3: Now fix the Subject too to reflect v2 change

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-17 15:24:28 -08:00
Linus Torvalds
f8effd1a4a Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  doc: mmiotrace.txt, buffer size control change
  trace: mmiotrace to the tracer menu in Kconfig
  mmiotrace: count events lost due to not recording
2009-02-17 14:29:15 -08:00
Linus Torvalds
35010334aa Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, vm86: fix preemption bug
  x86, olpc: fix model detection without OFW
  x86, hpet: fix for LS21 + HPET = boot hang
  x86: CPA avoid repeated lazy mmu flush
  x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
  x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
  x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
  x86/cpa: make sure cpa is safe to call in lazy mmu mode
  x86, ptrace, mm: fix double-free on race
2009-02-17 14:27:39 -08:00
Linus Torvalds
39a65762d4 Merge branch 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Flush volatile msrs before emulating rdmsr
  KVM: Fix assigned devices circular locking dependency
  KVM: x86: fix LAPIC pending count calculation
  KVM: Fix INTx for device assignment
  KVM: MMU: Map device MMIO as UC in EPT
  KVM: x86: disable kvmclock on non constant TSC hosts
  KVM: PIT: fix i8254 pending count read
  KVM: Fix racy in kvm_free_assigned_irq
  KVM: Add kvm_arch_sync_events to sync with asynchronize events
  KVM: mmu_notifiers release method
  KVM: Avoid using CONFIG_ in userspace visible headers
  KVM: ia64: fix fp fault/trap handler
2009-02-17 14:04:32 -08:00
Paul E. McKenney
bf51935f3e x86, rcu: fix strange load average and ksoftirqd behavior
Damien Wyart reported high ksoftirqd CPU usage (20%) on an
otherwise idle system.

The function-graph trace Damien provided:

>   799.521187 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521371 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521555 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521738 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.521934 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522068 |   1)  ksoftir-2324  |               |                rcu_check_callbacks() {
>   799.522208 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522392 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522575 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522759 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.522956 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523074 |   1)  ksoftir-2324  |               |                  rcu_check_callbacks() {
>   799.523214 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523397 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523579 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523762 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.523960 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524079 |   1)  ksoftir-2324  |               |                  rcu_check_callbacks() {
>   799.524220 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524403 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524587 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
>   799.524770 |   1)    <idle>-0    |               |  rcu_check_callbacks() {
> [ . . . ]

Shows rcu_check_callbacks() being invoked way too often. It should be called
once per jiffy, and here it is called no less than 22 times in about
3.5 milliseconds, meaning one call every 160 microseconds or so.

Why do we need to call rcu_pending() and rcu_check_callbacks() from the
idle loop of 32-bit x86, especially given that no other architecture does
this?

The following patch removes the call to rcu_pending() and
rcu_check_callbacks() from the x86 32-bit idle loop in order to
reduce the softirq load on idle systems.

Reported-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-17 22:47:45 +01:00
Rusty Russell
a0abd520fd cpumask: fix powernow-k8: partial revert of 2fdf66b491
Impact: fix powernow-k8 when acpi=off (or other error).

There was a spurious change introduced into powernow-k8 in this patch:
so that we try to "restore" the cpus_allowed we never saved.  We revert
that file.

See lkml "[PATCH] x86/powernow: fix cpus_allowed brokage when
acpi=off" from Yinghai for the bug report.

Cc: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
2009-02-16 17:31:59 +10:30
Pekka Paalanen
6bc5c366b1 trace: mmiotrace to the tracer menu in Kconfig
Impact: cosmetic change in Kconfig menu layout

This patch was originally suggested by Peter Zijlstra, but seems it
was forgotten.

CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable
directly under the Kernel hacking / debugging menu in the kernel
configuration system. They were present only for x86 and x86_64.

Other tracers that use the ftrace tracing framework are in their own
sub-menu. This patch moves the mmiotrace configuration options there.
Since the Kconfig file, where the tracer menu is, is not architecture
specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by
x86/x86_64. CONFIG_MMIOTRACE now depends on it.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 20:03:28 +01:00
Thomas Gleixner
be716615fe x86, vm86: fix preemption bug
Commit 3d2a71a596 ("x86, traps: converge
do_debug handlers") changed the preemption disable logic of do_debug()
so vm86_handle_trap() is called with preemption disabled resulting in:

 BUG: sleeping function called from invalid context at include/linux/kernel.h:155
 in_atomic(): 1, irqs_disabled(): 0, pid: 3005, name: dosemu.bin
 Pid: 3005, comm: dosemu.bin Tainted: G        W  2.6.29-rc1 #51
 Call Trace:
  [<c050d669>] copy_to_user+0x33/0x108
  [<c04181f4>] save_v86_state+0x65/0x149
  [<c0418531>] handle_vm86_trap+0x20/0x8f
  [<c064e345>] do_debug+0x15b/0x1a4
  [<c064df1f>] debug_stack_correct+0x27/0x2c
  [<c040365b>] sysenter_do_call+0x12/0x2f
 BUG: scheduling while atomic: dosemu.bin/3005/0x10000001

Restore the original calling convention and reenable preemption before
calling handle_vm86_trap().

Reported-by: Michal Suchanek <hramrach@centrum.cz>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 10:46:13 +01:00
Avi Kivity
516a1a7e9d KVM: VMX: Flush volatile msrs before emulating rdmsr
Some msrs (notable MSR_KERNEL_GS_BASE) are held in the processor registers
and need to be flushed to the vcpu struture before they can be read.

This fixes cygwin longjmp() failure on Windows x64.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:39 +02:00
Marcelo Tosatti
b682b814e3 KVM: x86: fix LAPIC pending count calculation
Simplify LAPIC TMCCT calculation by using hrtimer provided
function to query remaining time until expiration.

Fixes host hang with nested ESX.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:38 +02:00
Sheng Yang
2aaf69dcee KVM: MMU: Map device MMIO as UC in EPT
Software are not allow to access device MMIO using cacheable memory type, the
patch limit MMIO region with UC and WC(guest can select WC using PAT and
PCD/PWT).

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:37 +02:00
Marcelo Tosatti
abe6655dd6 KVM: x86: disable kvmclock on non constant TSC hosts
This is better.

Currently, this code path is posing us big troubles,
and we won't have a decent patch in time. So, temporarily
disable it.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Marcelo Tosatti
d2a8284e8f KVM: PIT: fix i8254 pending count read
count_load_time assignment is bogus: its supposed to contain what it
means, not the expiration time.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Sheng Yang
ba4cef31d5 KVM: Fix racy in kvm_free_assigned_irq
In the past, kvm_get_kvm() and kvm_put_kvm() was called in assigned device irq
handler and interrupt_work, in order to prevent cancel_work_sync() in
kvm_free_assigned_irq got a illegal state when waiting for interrupt_work done.
But it's tricky and still got two problems:

1. A bug ignored two conditions that cancel_work_sync() would return true result
in a additional kvm_put_kvm().

2. If interrupt type is MSI, we would got a window between cancel_work_sync()
and free_irq(), which interrupt would be injected again...

This patch discard the reference count used for irq handler and interrupt_work,
and ensure the legal state by moving the free function at the very beginning of
kvm_destroy_vm(). And the patch fix the second bug by disable irq before
cancel_work_sync(), which may result in nested disable of irq but OK for we are
going to free it.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Sheng Yang
ad8ba2cd44 KVM: Add kvm_arch_sync_events to sync with asynchronize events
kvm_arch_sync_events is introduced to quiet down all other events may happen
contemporary with VM destroy process, like IRQ handler and work struct for
assigned device.

For kvm_arch_sync_events is called at the very beginning of kvm_destroy_vm(), so
the state of KVM here is legal and can provide a environment to quiet down other
events.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:36 +02:00
Avi Kivity
7a0eb1960e KVM: Avoid using CONFIG_ in userspace visible headers
Kconfig symbols are not available in userspace, and are not stripped by
headers-install.  Avoid their use by adding #defines in <asm/kvm.h> to
suit each architecture.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-02-15 02:47:35 +02:00
Chris Ball
e49590b6dd x86, olpc: fix model detection without OFW
Impact: fix "garbled display, laptop is unusable" bug

Commit e51a1ac2df ("x86, olpc: fix endian
bug in openfirmware workaround") breaks model comparison on OLPC; the value
0xc2 needs to be scaled up by olpc_board().

The pre-patch version was wrong, but accidentally worked anyway
(big-endian 0xc2 is big enough to satisfy all other board revisions,
but little endian 0xc2 is not).

Signed-off-by: Chris Ball <cjb@laptop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andres Salomon <dilinger@queued.net>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:05:25 +01:00
john stultz
b13e24644c x86, hpet: fix for LS21 + HPET = boot hang
Between 2.6.23 and 2.6.24-rc1 a change was made that broke IBM LS21
systems that had the HPET enabled in the BIOS, resulting in boot hangs
for x86_64.

Specifically commit b8ce335906, which
merges the i386 and x86_64 HPET code.

Prior to this commit, when we setup the HPET timers in x86_64, we did
the following:

	hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
                    HPET_TN_32BIT, HPET_T0_CFG);

However after the i386/x86_64 HPET merge, we do the following:

	cfg = hpet_readl(HPET_Tn_CFG(timer));
	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC |
			HPET_TN_SETVAL | HPET_TN_32BIT;
	hpet_writel(cfg, HPET_Tn_CFG(timer));

However on LS21s with HPET enabled in the BIOS, the HPET_T0_CFG register
boots with Level triggered interrupts (HPET_TN_LEVEL) enabled. This
causes the periodic interrupt to be not so periodic, and that results in
the boot time hang I reported earlier in the delay calibration.

My fix: Always disable HPET_TN_LEVEL when setting up periodic mode.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13 09:15:46 +01:00
Thomas Gleixner
7ad9de6ac8 x86: CPA avoid repeated lazy mmu flush
Impact: Flush the lazy MMU only once

Pending mmu updates only need to be flushed once to bring the
in-memory pagetable state up to date.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-02-12 23:11:58 +01:00
Thomas Gleixner
34b0900d32 x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
Impact: Catch cases where lazy MMU state is active in a preemtible context

arch_flush_lazy_mmu_cpu() has been changed to disable preemption so
the checks in enter/leave will never trigger. Put the preemtible()
check into arch_flush_lazy_mmu_cpu() to catch such cases.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-02-12 23:11:58 +01:00
Jeremy Fitzhardinge
d85cf93da6 x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
Impact: avoid access to percpu vars in preempible context

They are intended to be used whenever there's the possibility
that there's some stale state which is going to be overwritten
with a queued update, or to force a state change when we may be
in lazy mode.  Either way, we could end up calling it with
preemption enabled, so wrap the functions in their own little
preempt-disable section so they can be safely called in any
context (though preemption should never be enabled if we're actually
in a lazy state).

(Move out of line to avoid #include dependencies.)
    
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-02-12 23:11:58 +01:00
Suresh Siddha
be03d9e802 x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
Jeff Mahoney reported:

> With Suse's hwinfo tool, on -tip:
> WARNING: at arch/x86/mm/pat.c:637 reserve_pfn_range+0x5b/0x26d()

reserve_pfn_range() is not tracking the memory range below 1MB
as non-RAM and as such is inconsistent with similar checks in
reserve_memtype() and free_memtype()

Rename the pagerange_is_ram() to pat_pagerange_is_ram() and add the
"track legacy 1MB region as non RAM" condition.

And also, fix reserve_pfn_range() to return -EINVAL, when the pfn
range is RAM. This is to be consistent with this API design.

Reported-and-tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12 08:27:27 +01:00
Jeremy Fitzhardinge
4f06b0436b x86/cpa: make sure cpa is safe to call in lazy mmu mode
Impact: fix race leading to crash under KVM and Xen

The CPA code may be called while we're in lazy mmu update mode - for
example, when using DEBUG_PAGE_ALLOC and doing a slab allocation
in an interrupt handler which interrupted a lazy mmu update.  In this
case, the in-memory pagetable state may be out of date due to pending
queued updates.  We need to flush any pending updates before inspecting
the page table.  Similarly, we must explicitly flush any modifications
CPA may have made (which comes down to flushing queued operations when
flushing the TLB).

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12 08:27:26 +01:00