Impact: clean up
We can autodetect those system that need cluster apic, and update genapic
accordingly.
We can also remove wakeup.h for e7000, because it's default one is now
the same as overall default mach_wakecpu.h
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Change order of storing to match the sigcontext_ia32.
And add casting to make this code same as arch/x86/kernel/signal_32.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
No need to use temporary variable.
Also rename the variable same as arch/x86/kernel/signal_32.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: WANG Cong <wangcong@zeuux.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
No need to use temporary variable in this case.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen the reach of the low-memory-protect DMI quirk
Phoenix BIOSes variously identify their vendor as "Phoenix Technologies,
LTD" or "Phoenix Technologies LTD" (without the comma.)
This patch makes the identification string in the bad_bios_dmi_table
more general (following a suggestion by Ingo Molnar), so that both
versions are handled.
Again, the patched file compiles cleanly and the patch has been tested
successfully on my machine.
Signed-off-by: Philipp Kohlbecher <xt28@gmx.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: makes device isolation the default for AMD IOMMU
Some device drivers showed double-free bugs of DMA memory while testing
them with AMD IOMMU. If all devices share the same protection domain
this can lead to data corruption and data loss. Prevent this by putting
each device into its own protection domain per default.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
this compiler warning:
arch/x86/kernel/ds.c: In function 'ds_request':
arch/x86/kernel/ds.c:368: warning: 'context' may be used uninitialized in this function
Shows that the code flow in ds_request() is buggy - it goes into
the unlock+release-context path even when the context is not allocated
yet.
First allocate the context, then do the other checks.
Also, take care with GFP allocations under the ds_lock spinlock.
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fixes korg bugzilla 11980
A kernel for a 64bit x86 system should always contain the swiotlb code
in case it is booted on a machine without any hardware IOMMU supported
by the kernel and more than 4GB of RAM. This patch changes Kconfig to
always compile swiotlb into the kernel for x86_64.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: help to find the better depth of trace
We decided to arbitrary define the depth of function return trace as
"20". Perhaps this is not enough. To help finding an optimal depth, we
measure now the overrun: the number of functions that have been missed
for the current thread. By default this is not displayed, we have to
do set a particular flag on the return tracer: echo overrun >
/debug/tracing/trace_options And the overrun will be printed on the
right.
As the trace shows below, the current 20 depth is not enough.
update_wall_time+0x37f/0x8c0 -> update_xtime_cache (345 ns) (Overruns: 2838)
update_wall_time+0x384/0x8c0 -> clocksource_get_next (1141 ns) (Overruns: 2838)
do_timer+0x23/0x100 -> update_wall_time (3882 ns) (Overruns: 2838)
tick_do_update_jiffies64+0xbf/0x160 -> do_timer (5339 ns) (Overruns: 2838)
tick_sched_timer+0x6a/0xf0 -> tick_do_update_jiffies64 (7209 ns) (Overruns: 2838)
vgacon_set_cursor_size+0x98/0x120 -> native_io_delay (2613 ns) (Overruns: 274)
vgacon_cursor+0x16e/0x1d0 -> vgacon_set_cursor_size (33151 ns) (Overruns: 274)
set_cursor+0x5f/0x80 -> vgacon_cursor (36432 ns) (Overruns: 274)
con_flush_chars+0x34/0x40 -> set_cursor (38790 ns) (Overruns: 274)
release_console_sem+0x1ec/0x230 -> up (721 ns) (Overruns: 274)
release_console_sem+0x225/0x230 -> wake_up_klogd (316 ns) (Overruns: 274)
con_flush_chars+0x39/0x40 -> release_console_sem (2996 ns) (Overruns: 274)
con_write+0x22/0x30 -> con_flush_chars (46067 ns) (Overruns: 274)
n_tty_write+0x1cc/0x360 -> con_write (292670 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2a/0x90 -> native_apic_mem_write (330 ns) (Overruns: 274)
irq_enter+0x17/0x70 -> idle_cpu (413 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2f/0x90 -> irq_enter (1525 ns) (Overruns: 274)
ktime_get_ts+0x40/0x70 -> getnstimeofday (465 ns) (Overruns: 274)
ktime_get_ts+0x60/0x70 -> set_normalized_timespec (436 ns) (Overruns: 274)
ktime_get+0x16/0x30 -> ktime_get_ts (2501 ns) (Overruns: 274)
hrtimer_interrupt+0x77/0x1a0 -> ktime_get (3439 ns) (Overruns: 274)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wakeup_secondary_cpu with hotplug
We can not put that into x86_quirks, because that is __initdata.
So try to move that to genapic, and add update_genapic in x86_quirks.
later we even could use that stub to:
1. autodetect CONFIG_ES7000_CLUSTERED_APIC
2. more correct inquire_remote_apic with apic_verbosity setting.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix incorrectly marked unstable TSC clock
Patch (commit 0d12cdd "sched: improve sched_clock() performance") has
a regression on one of the test systems here.
With the patch, I see:
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 28 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to check_tsc_sync_source failed
Whereas, without the patch syncs pass fine on all CPUs:
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Due to this, TSC is marked unstable, when it is not actually unstable.
This is because syncs in check_tsc_wrap() goes away due to this commit.
As per the discussion on this thread, correct way to fix this is to add
explicit syncs as below?
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
reset_value was changed from long to u64 in commit
b991702884 (oprofile: Implement Intel
architectural perfmon support)
But dynamic allocation of this array use a wrong type (long instead of
u64)
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Impact: fix secondary-CPU wakeup/init path with numaq and es7000
While looking at wakeup_secondary_cpu for WAKE_SECONDARY_VIA_NMI:
|#ifdef WAKE_SECONDARY_VIA_NMI
|/*
| * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
| * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
| * won't ... remember to clear down the APIC, etc later.
| */
|static int __devinit
|wakeup_secondary_cpu(int logical_apicid, unsigned long start_eip)
|{
| unsigned long send_status, accept_status = 0;
| int maxlvt;
|...
| if (APIC_INTEGRATED(apic_version[phys_apicid])) {
| maxlvt = lapic_get_maxlvt();
I noticed that there is no warning about undefined phys_apicid...
because WAKE_SECONDARY_VIA_NMI and WAKE_SECONDARY_VIA_INIT can not be
defined at the same time. So NUMAQ is using wrong wakeup_secondary_cpu.
WAKE_SECONDARY_VIA_NMI, WAKE_SECONDARY_VIA_INIT and
WAKE_SECONDARY_VIA_MIP are variants of a weird and fragile
preprocessor-driven "HAL" mechanisms to specify the kind of secondary-CPU
wakeup strategy a given x86 kernel will use.
The vast majority of systems want to use INIT for secondary wakeup - NUMAQ
uses an NMI, (old-style-) ES7000 uses 'MIP' (a firmware driven in-memory
flag to let secondaries continue).
So convert these mechanisms to x86_quirks and add a
->wakeup_secondary_cpu() method to specify the rare exception
to the sane default.
Extend genapic accordingly as well, for 32-bit.
While looking further, I noticed that functions in wakecup.h for numaq
and es7000 are different to the default in mach_wakecpu.h - but smpboot.c
will only use default mach_wakecpu.h with smphook.h.
So we need to add mach_wakecpu.h for mach_generic, to properly support
numaq and es7000, and vectorize the following SMP init methods:
int trampoline_phys_low;
int trampoline_phys_high;
void (*wait_for_init_deassert)(atomic_t *deassert);
void (*smp_callin_clear_local_apic)(void);
void (*store_NMI_vector)(unsigned short *high, unsigned short *low);
void (*restore_NMI_vector)(unsigned short *high, unsigned short *low);
void (*inquire_remote_apic)(int apicid);
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
All blame goes to: color white,red "[^[:graph:]]+$"
in .nanorc ;).
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix es7000 build
CC arch/x86/kernel/es7000_32.o
arch/x86/kernel/es7000_32.c: In function find_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:255: error: implicit declaration of function acpi_get_table_with_size
arch/x86/kernel/es7000_32.c:261: error: implicit declaration of function early_acpi_os_unmap_memory
arch/x86/kernel/es7000_32.c: In function unmap_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:277: error: implicit declaration of function __acpi_unmap_table
make[1]: *** [arch/x86/kernel/es7000_32.o] Error 1
we applied one patch out of order...
| commit a73aaedd95
| Author: Yinghai Lu <yhlu.kernel@gmail.com>
| Date: Sun Sep 14 02:33:14 2008 -0700
|
| x86: check dsdt before find oem table for es7000, v2
|
| v2: use __acpi_unmap_table()
that patch need:
x86: use early_ioremap in __acpi_map_table
x86: always explicitly map acpi memory
acpi: remove final __acpi_map_table mapping before setting acpi_gbl_permanent_mmap
acpi/x86: introduce __apci_map_table, v4
submitted to the ACPI tree but not upstream yet.
fix it until those patches applied, need to revert this one
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a problem where ds_request() returned an error without releasing the
ds lock.
Reported-by: Stephane Eranian <eranian@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the support for dynamic tracing on the function return tracer.
The whole difference with normal dynamic function tracing is that we don't need
to hook on a particular callback. The only pro that we want is to nop or set
dynamically the calls to ftrace_caller (which is ftrace_return_caller here).
Some security checks ensure that we are not trying to launch dynamic tracing for
return tracing while normal function tracing is already running.
An example of trace with getnstimeofday set as a filter:
ktime_get_ts+0x22/0x50 -> getnstimeofday (2283 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1396 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1825 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1426 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1524 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1434 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1502 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1404 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1397 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1051 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1314 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1344 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1163 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1390 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1374 ns)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix possible race condition in ftrace function return tracer
This fixes a possible race condition if index incrementation
is not immediately flushed in memory.
Thanks for Andi Kleen and Steven Rostedt for pointing out this issue
and give me this solution.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow archs more flexibility on dynamic ftrace implementations
Dynamic ftrace has largly been developed on x86. Since x86 does not
have the same limitations as other architectures, the ftrace interaction
between the generic code and the architecture specific code was not
flexible enough to handle some of the issues that other architectures
have.
Most notably, module trampolines. Due to the limited branch distance
that archs make in calling kernel core code from modules, the module
load code must create a trampoline to jump to what will make the
larger jump into core kernel code.
The problem arises when this happens to a call to mcount. Ftrace checks
all code before modifying it and makes sure the current code is what
it expects. Right now, there is not enough information to handle modifying
module trampolines.
This patch changes the API between generic dynamic ftrace code and
the arch dependent code. There is now two functions for modifying code:
ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into
a nop, where the original text is calling addr. (mod is the
module struct if called by module init)
ftrace_make_caller(rec, addr) - convert the code rec->ip that should
be a nop into a caller to addr.
The record "rec" now has a new field called "arch" where the architecture
can add any special attributes to each call site record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e51af66308, which was
wrongly hoovered up and submitted about a month after a better fix had
already been merged.
The better fix is commit cbda1ba898
("PCI/iommu: blacklist DMAR on Intel G31/G33 chipsets"), where we do
this blacklisting based on the DMI identification for the offending
motherboard, since sometimes this chipset (or at least a chipset with
the same PCI ID) apparently _does_ actually have an IOMMU.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: Fix interrupt via the apicinterrupt macro
Checkin 939b787130 changed the
"interrupt" macro, but the "interrupt" macro is also invoked
indirectly from the "apicinterrupt" macro.
The "apicinterrupt" macro probably should have its own collection of
systematic stubs for the same reason the main IRQ code does; as is it
is a huge amount of replicated code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.c
Fixed conflicts above by using the non 'tsk' versions.
Signed-off-by: James Morris <jmorris@namei.org>
Make execve() take advantage of copy-on-write credentials, allowing it to set
up the credentials in advance, and then commit the whole lot after the point
of no return.
This patch and the preceding patches have been tested with the LTP SELinux
testsuite.
This patch makes several logical sets of alteration:
(1) execve().
The credential bits from struct linux_binprm are, for the most part,
replaced with a single credentials pointer (bprm->cred). This means that
all the creds can be calculated in advance and then applied at the point
of no return with no possibility of failure.
I would like to replace bprm->cap_effective with:
cap_isclear(bprm->cap_effective)
but this seems impossible due to special behaviour for processes of pid 1
(they always retain their parent's capability masks where normally they'd
be changed - see cap_bprm_set_creds()).
The following sequence of events now happens:
(a) At the start of do_execve, the current task's cred_exec_mutex is
locked to prevent PTRACE_ATTACH from obsoleting the calculation of
creds that we make.
(a) prepare_exec_creds() is then called to make a copy of the current
task's credentials and prepare it. This copy is then assigned to
bprm->cred.
This renders security_bprm_alloc() and security_bprm_free()
unnecessary, and so they've been removed.
(b) The determination of unsafe execution is now performed immediately
after (a) rather than later on in the code. The result is stored in
bprm->unsafe for future reference.
(c) prepare_binprm() is called, possibly multiple times.
(i) This applies the result of set[ug]id binaries to the new creds
attached to bprm->cred. Personality bit clearance is recorded,
but now deferred on the basis that the exec procedure may yet
fail.
(ii) This then calls the new security_bprm_set_creds(). This should
calculate the new LSM and capability credentials into *bprm->cred.
This folds together security_bprm_set() and parts of
security_bprm_apply_creds() (these two have been removed).
Anything that might fail must be done at this point.
(iii) bprm->cred_prepared is set to 1.
bprm->cred_prepared is 0 on the first pass of the security
calculations, and 1 on all subsequent passes. This allows SELinux
in (ii) to base its calculations only on the initial script and
not on the interpreter.
(d) flush_old_exec() is called to commit the task to execution. This
performs the following steps with regard to credentials:
(i) Clear pdeath_signal and set dumpable on certain circumstances that
may not be covered by commit_creds().
(ii) Clear any bits in current->personality that were deferred from
(c.i).
(e) install_exec_creds() [compute_creds() as was] is called to install the
new credentials. This performs the following steps with regard to
credentials:
(i) Calls security_bprm_committing_creds() to apply any security
requirements, such as flushing unauthorised files in SELinux, that
must be done before the credentials are changed.
This is made up of bits of security_bprm_apply_creds() and
security_bprm_post_apply_creds(), both of which have been removed.
This function is not allowed to fail; anything that might fail
must have been done in (c.ii).
(ii) Calls commit_creds() to apply the new credentials in a single
assignment (more or less). Possibly pdeath_signal and dumpable
should be part of struct creds.
(iii) Unlocks the task's cred_replace_mutex, thus allowing
PTRACE_ATTACH to take place.
(iv) Clears The bprm->cred pointer as the credentials it was holding
are now immutable.
(v) Calls security_bprm_committed_creds() to apply any security
alterations that must be done after the creds have been changed.
SELinux uses this to flush signals and signal handlers.
(f) If an error occurs before (d.i), bprm_free() will call abort_creds()
to destroy the proposed new credentials and will then unlock
cred_replace_mutex. No changes to the credentials will have been
made.
(2) LSM interface.
A number of functions have been changed, added or removed:
(*) security_bprm_alloc(), ->bprm_alloc_security()
(*) security_bprm_free(), ->bprm_free_security()
Removed in favour of preparing new credentials and modifying those.
(*) security_bprm_apply_creds(), ->bprm_apply_creds()
(*) security_bprm_post_apply_creds(), ->bprm_post_apply_creds()
Removed; split between security_bprm_set_creds(),
security_bprm_committing_creds() and security_bprm_committed_creds().
(*) security_bprm_set(), ->bprm_set_security()
Removed; folded into security_bprm_set_creds().
(*) security_bprm_set_creds(), ->bprm_set_creds()
New. The new credentials in bprm->creds should be checked and set up
as appropriate. bprm->cred_prepared is 0 on the first call, 1 on the
second and subsequent calls.
(*) security_bprm_committing_creds(), ->bprm_committing_creds()
(*) security_bprm_committed_creds(), ->bprm_committed_creds()
New. Apply the security effects of the new credentials. This
includes closing unauthorised files in SELinux. This function may not
fail. When the former is called, the creds haven't yet been applied
to the process; when the latter is called, they have.
The former may access bprm->cred, the latter may not.
(3) SELinux.
SELinux has a number of changes, in addition to those to support the LSM
interface changes mentioned above:
(a) The bprm_security_struct struct has been removed in favour of using
the credentials-under-construction approach.
(c) flush_unauthorized_files() now takes a cred pointer and passes it on
to inode_has_perm(), file_has_perm() and dentry_open().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: James Morris <jmorris@namei.org>
My previous patch to make CONFIG_NUMA on x86_32 depend on BROKEN
turned out to be unnecessary, after all, since the source of the
hibernation vs CONFIG_NUMA problem turned out to be the fact that
we didn't take the NUMA KVA remapping into account in the
hibernation code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crash during hibernation on 32-bit NUMA
The NUMA code on x86_32 creates special memory mapping that allows
each node's pgdat to be located in this node's memory. For this
purpose it allocates a memory area at the end of each node's memory
and maps this area so that it is accessible with virtual addresses
belonging to low memory. As a result, if there is high memory,
these NUMA-allocated areas are physically located in high memory,
although they are mapped to low memory addresses.
Our hibernation code does not take that into account and for this
reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
with high memory present. Fix this by adding a special mapping for
the NUMA-allocated memory areas to the temporary page tables created
during the last phase of resume.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Optimize a bit the function return tracer
This patch changes the calling convention of prepare_ftrace_return to
pass its arguments by register. This will optimize it a bit and
prepare it to support dynamic tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove spinlocks and irq disabling in function return tracer.
I've tried to figure out all of the race condition that could happen
when the tracer pushes or pops a return address trace to/from the
current thread_info.
Theory:
_ One thread can only execute on one cpu at a time. So this code
doesn't need to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an
interrupt. If an interrupt is raised, it will increase the index of
the return stack storage and then execute until the end of the
tracing to finally free the index it used. We don't need to disable
irqs.
This is theorical. In practice, I've tested it with a two-core SMP and
had no problem at all. Perhaps -tip testing could confirm it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: name change of unlikely tracer and profiler
Ingo Molnar suggested changing the config from UNLIKELY_PROFILE
to BRANCH_PROFILING. I never did like the "unlikely" name so I
went one step farther, and renamed all the unlikely configurations
to a "BRANCH" variant.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add debug check
If none of the perfctrs are free when calculating cpu_khz we default to
using ctr 3 (ie, we just choose 3). This may lead to an incorrect tsc
freq value which can cause the system to be unstable.
To aid in future debugging, WARN the user of a potential problem.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
KVM: Fix pit memory leak if unable to allocate irq source id
KVM: ia64: fix vmm_spin_{un}lock for !CONFIG_SMP
KVM: VMX: Set IGMT bit in EPT entry
KVM: Require the PCI subsystem
x86: KVM guest: fix section mismatch warning in kvmclock.c
KVM: ia64: Use guest signal mask when blocking
KVM: MMU: increase per-vcpu rmap cache alloc size
Older versions of gas don't implement the C-style != operator, they
instead want the Pascal-style <> operator. Change != to <> so we
don't break compilation with those old versions of gas.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (47 commits)
ACPI: pci_link: remove acpi_irq_balance_set() interface
fujitsu-laptop: Add DMI callback for Lifebook S6420
ACPI: EC: Don't do transaction from GPE handler in poll mode.
ACPI: EC: lower interrupt storm treshold
ACPICA: Use spinlock for acpi_{en|dis}able_gpe
ACPI: EC: restart failed command
ACPI: EC: wait for last write gpe
ACPI: EC: make kernel messages more useful when GPE storm is detected
ACPI: EC: revert msleep patch
thinkpad_acpi: fingers off backlight if video.ko is serving this functionality
sony-laptop: fingers off backlight if video.ko is serving this functionality
msi-laptop: fingers off backlight if video.ko is serving this functionality
fujitsu-laptop: fingers off backlight if video.ko is serving this functionality
eeepc-laptop: fingers off backlight if video.ko is serving this functionality
compal: fingers off backlight if video.ko is serving this functionality
asus-acpi: fingers off backlight if video.ko is serving this functionality
Acer-WMI: fingers off backlight if video.ko is serving this functionality
ACPI video: if no ACPI backlight support, use vendor drivers
ACPI: video: Ignore devices that aren't present in hardware
Delete an unwanted return statement at evgpe.c
...
Impact: make nmi_shootdown_cpus() callable from preemptible context
We need to know on which CPU we are running on, and we don't want to be
preempted while doing this.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen nmi_shootdown_cpus() availability
The X86_LOCAL_APIC #ifdef was for kdump. For !SMP, the function simply
does nothing.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make nmi_shootdown_cpus() available to the rest of the x86 platform
Now nmi_shootdown_cpus() is ready to be used by non-kdump code also.
Move it to reboot.c.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make API available to the rest of x86 platform code
Add prototype to asm/reboot.h.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend nmi_shootdown_cpus() with a callback
The reboot code will use a different function on crash_nmi_callback().
Adding a function pointer parameter to nmi_shootdown_cpus() for that.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
For the kdump-specific code that was living on nmi_shootdown_cpus().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This variable will be moved to non-kdump-specific code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The NMI CPU-halting code will be used on non-kdump cases, also
(e.g. emergency_reboot when virtualization is enabled).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bootup crash
the branch tracer missed arch/x86/vdso/vclock_gettime.c from
disabling tracing, which caused such bootup crashes:
[ 201.840097] init[1]: segfault at 7fffed3fe7c0 ip 00007fffed3fea2e sp 000077
also clean up the ugly ifdefs in arch/x86/kernel/vsyscall_64.c by
creating DISABLE_UNLIKELY_PROFILE facility for code to turn off
instrumentation on a per file basis.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new unlikely/likely profiler
Andrew Morton recently suggested having an in-kernel way to profile
likely and unlikely macros. This patch achieves that goal.
When configured, every(*) likely and unlikely macro gets a counter attached
to it. When the condition is hit, the hit and misses of that condition
are recorded. These numbers can later be retrieved by:
/debugfs/tracing/profile_likely - All likely markers
/debugfs/tracing/profile_unlikely - All unlikely markers.
# cat /debug/tracing/profile_unlikely | head
correct incorrect % Function File Line
------- --------- - -------- ---- ----
2167 0 0 do_arch_prctl process_64.c 832
0 0 0 do_arch_prctl process_64.c 804
2670 0 0 IS_ERR err.h 34
71230 5693 7 __switch_to process_64.c 673
76919 0 0 __switch_to process_64.c 639
43184 33743 43 __switch_to process_64.c 624
12740 64181 83 __switch_to process_64.c 594
12740 64174 83 __switch_to process_64.c 590
# cat /debug/tracing/profile_unlikely | \
awk '{ if ($3 > 25) print $0; }' |head -20
44963 35259 43 __switch_to process_64.c 624
12762 67454 84 __switch_to process_64.c 594
12762 67447 84 __switch_to process_64.c 590
1478 595 28 syscall_get_error syscall.h 51
0 2821 100 syscall_trace_leave ptrace.c 1567
0 1 100 native_smp_prepare_cpus smpboot.c 1237
86338 265881 75 calc_delta_fair sched_fair.c 408
210410 108540 34 calc_delta_mine sched.c 1267
0 54550 100 sched_info_queued sched_stats.h 222
51899 66435 56 pick_next_task_fair sched_fair.c 1422
6 10 62 yield_task_fair sched_fair.c 982
7325 2692 26 rt_policy sched.c 144
0 1270 100 pre_schedule_rt sched_rt.c 1261
1268 48073 97 pick_next_task_rt sched_rt.c 884
0 45181 100 sched_info_dequeued sched_stats.h 177
0 15 100 sched_move_task sched.c 8700
0 15 100 sched_move_task sched.c 8690
53167 33217 38 schedule sched.c 4457
0 80208 100 sched_info_switch sched_stats.h 270
30585 49631 61 context_switch sched.c 2619
# cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }'
39900 36577 47 pick_next_task sched.c 4397
20824 15233 42 switch_mm mmu_context_64.h 18
0 7 100 __cancel_work_timer workqueue.c 560
617 66484 99 clocksource_adjust timekeeping.c 456
0 346340 100 audit_syscall_exit auditsc.c 1570
38 347350 99 audit_get_context auditsc.c 732
0 345244 100 audit_syscall_entry auditsc.c 1541
38 1017 96 audit_free auditsc.c 1446
0 1090 100 audit_alloc auditsc.c 862
2618 1090 29 audit_alloc auditsc.c 858
0 6 100 move_masked_irq migration.c 9
1 198 99 probe_sched_wakeup trace_sched_switch.c 58
2 2 50 probe_wakeup trace_sched_wakeup.c 227
0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144
4514 2090 31 __grab_cache_page filemap.c 2149
12882 228786 94 mapping_unevictable pagemap.h 50
4 11 73 __flush_cpu_slab slub.c 1466
627757 330451 34 slab_free slub.c 1731
2959 61245 95 dentry_lru_del_init dcache.c 153
946 1217 56 load_elf_binary binfmt_elf.c 904
102 82 44 disk_put_part genhd.h 206
1 1 50 dst_gc_task dst.c 82
0 19 100 tcp_mss_split_point tcp_output.c 1126
As you can see by the above, there's a bit of work to do in rethinking
the use of some unlikelys and likelys. Note: the unlikely case had 71 hits
that were more than 25%.
Note: After submitting my first version of this patch, Andrew Morton
showed me a version written by Daniel Walker, where I picked up
the following ideas from:
1) Using __builtin_constant_p to avoid profiling fixed values.
2) Using __FILE__ instead of instruction pointers.
3) Using the preprocessor to stop all profiling of likely
annotations from vsyscall_64.c.
Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar
for their feed back on this patch.
(*) Not ever unlikely is recorded, those that are used by vsyscalls
(a few of them) had to have profiling disabled.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This removes the acpi_irq_balance_set() interface from the PCI
interrupt link driver.
x86 used acpi_irq_balance_set() to tell the PCI interrupt link
driver to configure links to minimize IRQ sharing. But the link
driver can easily figure out whether to turn on IRQ balancing
based on the IRQ model (PIC/IOAPIC/etc), so we can get rid of
that external interface.
It's better for the driver to figure this out at init-time. If
we set it externally via the x86 code, the interface reduces
modularity, and we depend on the fact that acpi_process_madt()
happens before we process the kernel command line.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Changes reboot behavior.
If port CF9 seems to be safe to touch, attempt it before trying the
keyboard controller. Port CF9 is not available on all chipsets (a
significant but decreasing number of modern chipsets don't implement
it), but port CF9 itself should in general be safe to poke (no ill
effects if unimplemented) on any system which has PCI Configuration
Method #1 or #2, as it falls inside the PCI configuration port range
in both cases. No chipset without PCI is known to have port CF9,
either, although an explicit "pci=bios" would mean we miss this and
therefore don't use port CF9. An explicit "reboot=pci" can be used to
force the use of port CF9.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move the IRQ stub generation to assembly to simplify it and for
consistency with 32 bits. Doing it in a C file with asm() statements
doesn't help clarity, and it prevents some optimizations.
Shrink the IRQ stubs down to just over four bytes per (we fit seven
into a 32-byte chunk.) This shrinks the total icache consumption of
the IRQ stubs down to an even kilobyte, if all of them are in active
use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips.
To get the most effect, cache-align the IRQ stubs.
This makes the 64-bit code match changes already done to the 32-bit
code, and should open up irqinit*.c for unification.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Shrink the IRQ stubs on 32 bits down to just over four bytes per (we
fit seven into a 32-byte chunk.) This shrinks the total icache
consumption of the IRQ stubs down to an even kilobyte, if all of them
are in active use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips (the exception being
486/Elan/Geode which have only 16-byte cachelines, but are unlikely to
have too many interrupt sources.)
To get the most effect, cache-align the IRQ stubs.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Don't generate interrupt stubs for interrupt vectors below
FIRST_EXTERNAL_VECTOR, and make the table of interrupt vectors
(interrupt[]) __initconst. Both of these changes both conserve memory
and improve consistency with 64 bits.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There is a potential issue that, when guest using pagetable without vmexit when
EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
which would be inconsistent with host side and would cause host MCE due to
inconsistent cache attribute.
The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
memory type to protect host (notice that all memory mapped by KVM should be WB).
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
WARNING: arch/x86/kernel/built-in.o(.text+0x1722c): Section mismatch
in reference from the function kvm_setup_secondary_clock() to the
function .devinit.text:setup_secondary_APIC_clock()
The function kvm_setup_secondary_clock() references
the function __devinit setup_secondary_APIC_clock().
This is often because kvm_setup_secondary_clock lacks a __devinit
annotation or the annotation of setup_secondary_APIC_clock is wrong.
Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context
nohz: disable tick_nohz_kick_tick() for now
irq: call __irq_enter() before calling the tick_idle_check
x86: HPET: enter hpet_interrupt_handler with interrupts disabled
x86: HPET: read from HPET_Tn_CMP() not HPET_T0_CMP
x86: HPET: convert WARN_ON to WARN_ON_ONCE
The page fault path can use two rmap_desc structures, if:
- walk_addr's dirty pte update allocates one rmap_desc.
- mmu_lock is dropped, sptes are zapped resulting in rmap_desc being
freed.
- fetch->mmu_set_spte allocates another rmap_desc.
Increase to 4 for safety.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: documentation update
Chris Snook pointed out that it's Core i7, not Core 7i.
Reported-by: Chris Snook <csnook@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid spurious lapic timer events on shutdown
The apic timer might be close to firing when it is shutdown. We can
not really disable the timer - we just mask the interrupt. That way we
can get an extra interrupt when it is reenabled. Set the counter to
max on shutdown to avoid this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: really halt all CPUs on halt
Function machine_halt (resp. native_machine_halt) is empty for x86
architectures. When command 'halt -f' is invoked, the message "System
halted." is displayed but this is not really true because all CPUs are
still running.
There are also similar inconsistencies for other arches (some uses
power-off for halt or forever-loop with IRQs enabled/disabled).
IMO there should be used the same approach for all architectures OR
what does the message "System halted" really mean?
This patch fixes it for x86.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build/boot fix for x86/Voyager
This change:
| commit 3d44223327
| Author: Jens Axboe <jens.axboe@oracle.com>
| Date: Thu Jun 26 11:21:34 2008 +0200
|
| Add generic helpers for arch IPI function calls
didn't wire up the voyager smp call function correctly, so do that
here. Also make CONFIG_USE_GENERIC_SMP_HELPERS a def_bool y again,
since we now use the generic helpers for every x86 architecture.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <Jens.Axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/ftrace.c: In function 'ftrace_return_to_handler':
arch/x86/kernel/ftrace.c:112: error: implicit declaration of function 'cpu_clock'
cpu_clock() is implicitly included via a number of ways, but its real
location is sched.h. (Build failure is triggerable if enough other
kernel components are turned off.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix double entry creation in /proc
There is a collision between two UV functions:
both uv_ptc_init() and gru_proc_init() try to make /proc/sgi_uv
So move it's creation to a single place: uv_system_init()
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/ftrace.c: Assembler messages:
arch/x86/kernel/ftrace.c:140: Error: missing ')'
arch/x86/kernel/ftrace.c:140: Error: junk `(%ebp))' after expression
arch/x86/kernel/ftrace.c:141: Error: missing ')'
arch/x86/kernel/ftrace.c:141: Error: junk `(%ebp))' after expression
the [parent_replaced] is used in an =rm fashion, so that constraint
is correct in isolation - but [parent_old] aliases register %0 and uses
it in an addressing mode that is only valid with registers - so change
the constraint from =rm to =r.
This fixes the build failure.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add infrastructure for function-return tracing
Add low level support for ftrace return tracing.
This plug-in stores return addresses on the thread_info structure of
the current task.
The index of the current return address is initialized when the task
is the first one (init) and when a process forks (the child). It is
not needed when a task does a sys_execve because after this syscall,
it still needs to return on the kernel functions it called.
Note that the code of return_to_handler has been suggested by Steven
Rostedt as almost all of the ideas of improvements in this V3.
For purpose of security, arch/x86/kernel/process_32.c is not traced
because __switch_to() changes the current task during its execution.
That could cause inconsistency in the stored return address of this
function even if I didn't have any crash after testing with tracing on
this function enabled.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, change .config option name
We had this ugly config name for a long time for hysteric raisons.
Rename it to a saner name.
We still cannot get rid of it completely, until /proc/<pid>/stack
usage replaces WCHAN usage for good.
We'll be able to do that in the v2.6.29/v2.6.30 timeframe.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While investigating the failure of hibernation on 32-bit x86 with
CONFIG_NUMA set, as described in this message
http://marc.info/?l=linux-kernel&m=122634118116226&w=4
I asked some people for help and I was told that it wasn't really
worth the effort, because CONFIG_NUMA was generally broken on 32-bit
x86 systems and it shouldn't be used in such configs. For this
reason, make CONFIG_NUMA depend on BROKEN instead of EXPERIMENTAL on
x86-32.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pavel Machek <pavel@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some functions that may be called from this handler require that
interrupts are disabled. Also, combining IRQF_DISABLED and
IRQF_SHARED does not reliably disable interrupts in a handler, so
remove IRQF_SHARED from the irq flags (this irq is not shared anyway).
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Cc: "Will Newton" <will.newton@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In hpet_next_event() we check that the value we just wrote to
HPET_Tn_CMP(timer) has reached the chip. Currently, we're checking that
the value we wrote to HPET_Tn_CMP(timer) is in HPET_T0_CMP, which, if
timer is anything other than timer 0, is likely to fail.
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It is possible to flood the console with call traces if the WARN_ON
condition is true because of the frequency with which this function is
called.
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup
It saves us some source lines and shift the code a bit righter.
And a multiline comment style is fixed too :-)
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
lapic_timer_setup is self-protected with local_irq_save/restore
no need to use them in caller and levt is the per-cpu variable so
no concurrent access from another cpu.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen BTS/PEBS ptrace enablement to more CPU models
Move BTS initialisation out of an #ifdef CONFIG_X86_64 guard.
Assume core2 BTS and DS layout for future models of family 6 processors.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Arches that have their own irq_regs definition are expected to
define ARCH_HAS_OWN_IRQ_REGS or else a generic (unused) set
will also be defined in lib/irq_regs.c
Sparse noticed the unused generic one had no prototype:
lib/irq_regs.c:15:1: warning: symbol 'per_cpu____irq_regs' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
setup_ioapic_dest() is called after the non boot cpus have been
brought up. It sets the irq affinity of all already configured
interrupts to all cpus and ignores affinity settings which were
done by the early bootup code.
If the IRQ_NO_BALANCING or IRQ_AFFINITY_SET flags are set then use the
affinity mask from the irq descriptor and not TARGET_CPUS.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove unused variable
I forgot to remove the now unused "cycles_t cycles" parameter from
vget_cycles() - which triggers build warnings as tsc.h is included
in a number of files.
Remove it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
a new file was accidentally added to include/asm-x86;
move it to the new arch/x86/include/asm location
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Impact: cleanup
Move rdtsc_barrier() use to vsyscall_64.c where it's relied on,
and point out its role in the context of its use.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: optimize sched_clock() a bit
sched: improve sched_clock() performance
sched_clock() uses cycles_2_ns() needlessly - which is an irq-disabling
variant of __cycles_2_ns().
Most of the time sched_clock() is called with irqs disabled already.
The few places that call it with irqs enabled need to be updated.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
in scheduler-intense workloads native_read_tsc() overhead accounts for
20% of the system overhead:
659567 system_call 41222.9375
686796 schedule 435.7843
718382 __switch_to 665.1685
823875 switch_mm 4526.7857
1883122 native_read_tsc 55385.9412
9761990 total 2.8468
this is large part due to the rdtsc_barrier() that is done before
and after reading the TSC.
But sched_clock() is not a precise clock in the GTOD sense, using such
barriers is completely pointless. So remove the barriers and only use
them in vget_cycles().
This improves lat_ctx performance by about 5%.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, xen: fix use of pgd_page now that it really does return a page
Fix the counter overflow check for CPUs with counter width > 32
I had a similar change in a different patch that I didn't submit
and I didn't notice the problem earlier because it was always
tested together.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Xen requires that all mappings of pagetable pages are read-only, so
that they can't be updated illegally. As a result, if a page is being
turned into a pagetable page, we need to make sure all its mappings
are RO.
If the page had been used for ioremap or vmalloc, it may still have
left over mappings as a result of not having been lazily unmapped.
This change makes sure we explicitly mop them all up before pinning
the page.
Unlike aliases created by kmap, the there can be vmalloc aliases even
for non-high pages, so we must do the flush unconditionally.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "x86: default to reboot via ACPI"
x86: align DirectMap in /proc/meminfo
AMD IOMMU: fix lazy IO/TLB flushing in unmap path
x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
x86: remove VISWS and PARAVIRT around NR_IRQS puzzle
x86: mention ACPI in top-level Kconfig menu
x86: size NR_IRQS on 32-bit systems the same way as 64-bit
x86: don't allow nr_irqs > NR_IRQS
x86/docs: remove noirqbalance param docs
x86: don't use tsc_khz to calculate lpj if notsc is passed
x86, voyager: fix smp_intr_init() compile breakage
AMD IOMMU: fix detection of NP capable IOMMUs
Impact: fix 32-bit Xen guest boot crash
On 32-bit PAE, pud_page, for no good reason, didn't really return a
struct page *. Since Jan Beulich's fix "i386/PAE: fix pud_page()",
pud_page does return a struct page *.
Because PAE has 3 pagetable levels, the pud level is folded into the
pgd level, so pgd_page() is the same as pud_page(), and now returns
a struct page *. Update the xen/mmu.c code which uses pgd_page()
accordingly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enable the wchan config menu item for now on x86-64 arch? This will
at least allow people to enable/disable frame pointers on scheduler
functions.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit c7ffa6c262.
the assumptio of this change was that this would not break
any existing machine. Andrey Borzenkov reported troubles with
the ACPI reboot method: the system would hang on reboot, necessiating
a power cycle. Probably more systems are affected as well.
Also, there are patches queued up for v2.6.29 to disable virtualization
on emergency_restart() - which was the original motivation of
this change.
Reported-by: Andrey Borzenkov <arvidjaar@mail.ru>
Bisected-by: Andrey Borzenkov <arvidjaar@mail.ru>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: right-align /proc/meminfo consistent with other fields
When the split-LRU patches added Inactive(anon) and Inactive(file) lines
to /proc/meminfo, all counts were moved two columns rightwards to fit in.
Now move x86's DirectMap lines two columns rightwards to line up.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Lazy flushing needs to take care of the unmap path too which is not yet
implemented and leads to stale IO/TLB entries. This is fixed by this
patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: clarify/update CONFIG_NUMA text
CONFIG_NUMA description talk about a bit old thing.
So, following changes are better.
o CONFIG_NUMA is no longer EXPERIMENTAL
o Opteron is not the only processor of NUMA topology on x86_64 no longer,
but also Intel Core7i has it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix rare x2apic hang
On x86, x2apic mode accesses for sending IPI's don't have serializing
semantics. If the IPI receivner refers(in lock-free fashion) to some
memory setup by the sender, the need for smp_mb() before sending the
IPI becomes critical in x2apic mode.
Add the smp_mb() in native_flush_tlb_others() before sending the IPI.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix warning message when PARAVIRT is set in config
Remove stale #ifdef components from our IRQ sizing logic.
x86/Voyager is the only holdout.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clarify menuconfig text
Mention ACPI in the top-level menu to give a clue as to where
it lives. This matches what ia64 does.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
set fpstate field of signal context at setup_sigcontext().
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: quick start and stop of function tracer
This patch adds a way to disable the function tracer quickly without
the need to run kstop_machine. It adds a new variable called
function_trace_stop which will stop the calls to functions from mcount
when set. This is just an on/off switch and does not handle recursion
like preempt_disable().
It's main purpose is to help other tracers/debuggers start and stop tracing
fuctions without the need to call kstop_machine.
The config option HAVE_FUNCTION_TRACE_MCOUNT_TEST is added for archs
that implement the testing of the function_trace_stop in the mcount
arch dependent code. Otherwise, the test is done in the C code.
x86 is the only arch at the moment that supports this.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make NR_IRQS big enough for system with lots of apic/pins
If lots of IO_APIC's are there (or can be there), size the same way
as 64-bit, depending on MAX_IO_APICS and NR_CPUS.
This fixes the boot problem reported by Ben Hutchings on a 32-bit
server with 5 IO-APICs and 240 IO-APIC pins.
Signed-off-by: Yinghai <yinghai@kernel.org>
Tested-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot hang on 32-bit systems with more than 224 IO-APIC pins
On some 32-bit systems with a lot of IO-APICs probe_nr_irqs() can
return a value larger than NR_IRQS. This will lead to probe_irq_on()
overrunning the irq_desc array.
I hit this when running net-next-2.6 (close to 2.6.28-rc3) on a
Supermicro dual Xeon system. NR_IRQS is 224 but probe_nr_irqs() detects
5 IOAPICs and returns 240. Here are the log messages:
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
Tue Nov 4 16:53:47 2008 IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x02] address[0xfec81000] gsi_base[24])
Tue Nov 4 16:53:47 2008 IOAPIC[1]: apic_id 2, version 32, address 0xfec81000, GSI 24-47
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x03] address[0xfec81400] gsi_base[48])
Tue Nov 4 16:53:47 2008 IOAPIC[2]: apic_id 3, version 32, address 0xfec81400, GSI 48-71
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x04] address[0xfec82000] gsi_base[72])
Tue Nov 4 16:53:47 2008 IOAPIC[3]: apic_id 4, version 32, address 0xfec82000, GSI 72-95
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x05] address[0xfec82400] gsi_base[96])
Tue Nov 4 16:53:47 2008 IOAPIC[4]: apic_id 5, version 32, address 0xfec82400, GSI 96-119
Tue Nov 4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Tue Nov 4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Tue Nov 4 16:53:47 2008 Enabling APIC mode: Flat. Using 5 I/O APICs
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Minor optimization.
Implement change_bit with immediate bit count as "lock xorb". This is
similar to "lock orb" and "lock andb" for set_bit and clear_bit
functions.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.
Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)
lat_ctx on a NUMA box is particularly happy about this change:
before:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.60
| 2 5.70
after:
| phoenix:~/l> ./lat_ctx -s 0 2
| "size=0k ovr=2.65
| 2 2.07
a 2.75x speedup.
pipe-test is similarly happy about it too:
| phoenix:~/sched-tests> ./pipe-test
| 18.26 usecs/loop.
| 14.70 usecs/loop.
| 14.38 usecs/loop.
| 10.55 usecs/loop. # +WAKE_AFFINE on domain0+domain1
| 8.63 usecs/loop.
| 8.59 usecs/loop.
| 9.03 usecs/loop.
| 8.94 usecs/loop.
| 8.96 usecs/loop.
| 8.63 usecs/loop.
Also:
- disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
- enable SD_WAKE_BALANCE on SMP domains
Sysbench+postgresql improves all around the board, quite significantly:
.28-rc3-11474e2c .28-rc3-11474e2c-tune
-------------------------------------------------
1: 571 688 +17.08%
2: 1236 1206 -2.55%
4: 2381 2642 +9.89%
8: 4958 5164 +3.99%
16: 9580 9574 -0.07%
32: 7128 8118 +12.20%
64: 7342 8266 +11.18%
128: 7342 8064 +8.95%
256: 7519 7884 +4.62%
512: 7350 7731 +4.93%
-------------------------------------------------
SUM: 55412 59341 +6.62%
So it's a win both for the runup portion, the peak area and the tail.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Should permit VMware detection on older platforms where the
vendor is changed. Could theoretically cause a regression if some
weird serial number scheme contains the string "VMware" by pure
chance. Seems unlikely, especially with the mixed case.
In some user configured cases, VMware may choose not to put a VMware specific
DMI string, but the product serial key is always there and is VMware specific.
Add a interface to check the serial key, when checking for VMware in the DMI
information.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: do not do function-tracing in the early-printk code
this is useful when earlyprintk=vga,keep is used to debug tracer
plugins.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix udelay when "notsc" boot parameter is passed
With notsc passed on commandline, tsc may not be used for
udelays, make sure that we do not use tsc_khz to calculate
the lpj value in such cases.
Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix possible failure to calibrate the TSC on Vmware near 4 GHz
The current version of the code to get the tsc frequency from
the VMware hypervisor, will be broken on processor with frequency
(4G-1) HZ, because on such processors eax will have UINT_MAX
and that would be legitimate.
We instead check that EBX did change to decide if we were able to
read the frequency from the hypervisor.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'io-mappings-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
io mapping: clean up #ifdefs
io mapping: improve documentation
i915: use io-mapping interfaces instead of a variety of mapping kludges
resources: add io-mapping functions to dynamically map large device apertures
x86: add iomap_atomic*()/iounmap_atomic() on 32-bit using fixmaps
Impact: cleanup
clean up ifdefs: change #ifdef CONFIG_X86_32/64 to
CONFIG_HAVE_ATOMIC_IOMAP.
flip around the #ifdef sections to clean up the structure.
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix for non-ftrace architectures
Not all archs implement ftrace, and therefore do not have an asm/ftrace.h.
This patch corrects the problem.
The ftrace_nmi_enter/exit now must be defined for all archs that implement
dynamic ftrace. Currently, only x86 does.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86/Voyager build
Looks like this became static on the rest of x86. Fix it up by adding
an external definition to mach-voyager/setup.c
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enable CONFIG_MEMORY_HOTREMOVE feature on x86. (default-off)
Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels.
This patch makes it possible to configure the memory hotremove
functionality into the x86 kernel as well.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Changes timekeeping on Vmware (or with tsc=reliable).
This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag.
We add a tsc=reliable commandline option to enable this.
This enables legacy hardware without HPET, LAPIC, or ACPI timers
to enter high-resolution timer mode.
Along with that have extended this to be used in virtualization environement
too. Now we also set this flag if the X86_FEATURE_TSC_RELIABLE bit is set.
This is important since there is a wrap-around problem with the acpi_pm timer.
The acpi_pm counter is just 24bits and this can overflow in ~4 seconds. With
the NO_HZ kernels in virtualized environment, there can be situations when
the guest is descheduled for longer duration, as a result we may miss the wrap
of the acpi counter. When TSC is used as a clocksource and acpi_pm timer is
being used as the watchdog clocksource this error in acpi_pm results in TSC
being marked as unstable, and essentially results in time dropping in chunks
of 4 seconds whenever this wrap is missed. Since the virtualized TSC is
reliable on VMware, we should always use the TSCs clocksource on VMware, so
we skip the verfication at runtime, by checking for the feature bit.
Since we reset the flag for mgeode systems too, i have combined
the mgeode case with the feature bit check.
Signed-off-by: Jeff Hansen <jhansen@cardaccess-inc.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
Use the synthetic TSC_RELIABLE bit to workaround virtualization anomalies.
Virtual TSCs can be kept nearly in sync, but because the virtual TSC
offset is set by software, it's not perfect. So, the TSC
synchronization test can fail. Even then the TSC can be used as a
clocksource since the VMware platform exports a reliable TSC to the
guest for timekeeping purposes. Use this bit to check if we need to
skip the TSC sync checks.
Along with this also set the CONSTANT_TSC bit when on VMware, since we
still want to use TSC as clocksource on VM running over hardware which
has unsynchronized TSC's (opteron's), since the hypervisor will take
care of providing consistent TSC to the guest.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
v3->v2 : Abstract the hypervisor detection and feature (tsc_freq) request
behind a hypervisor.c file
v2->v1 : Add a x86_hyper_vendor field to the cpuinfo_x86 structure.
This avoids multiple calls to the hypervisor detection function.
This patch adds function to detect if we are running under VMware.
The current way to check if we are on VMware is following,
# check if "hypervisor present bit" is set, if so read the 0x40000000
cpuid leaf and check for "VMwareVMware" signature.
# if the above fails, check the DMI vendors name for "VMware" string
if we find one we query the VMware hypervisor port to check if we are
under VMware.
The DMI + "VMware hypervisor port check" is needed for older VMware products,
which don't implement the hypervisor signature cpuid leaf.
Also note that since we are checking for the DMI signature the hypervisor
port should never be accessed on native hardware.
This patch also adds a hypervisor_get_tsc_freq function, instead of
calibrating the frequency which can be error prone in virtualized
environment, we ask the hypervisor for it. We get the frequency from
the hypervisor by accessing the hypervisor port if we are running on VMware.
Other hypervisors too can add code to the generic routine to get frequency on
their platform.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix AMDC1E and XTOPOLOGY conflict in cpufeature
x86: build fix
This makes the late e820 resources use 'insert_resource_expand_to_fit()'
instead of doing a 'reserve_region_with_split()', and also avoids
marking them as IORESOURCE_BUSY.
This results in us being perfectly happy to use pre-existing PCI
resources even if they were marked as being in a reserved region, while
still avoiding any _new_ allocations in the reserved regions. It also
makes for a simpler and more accurate resource tree.
Example resource allocation from Jonathan Corbet, who has firmware that
has an e820 reserved entry that covered a big range (e0000000-fed003ff),
and that had various PCI resources in it set up by firmware.
With old kernels, the reserved range would force us to re-allocate all
pre-existing PCI resources, and his reserved range would end up looking
like this:
e0000000-fed003ff : reserved
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
where only the pre-allocated special regions (IOAPIC and HPET) were kept
around.
With 2.6.28-rc2, which uses 'reserve_region_with_split()', Jonathan's
resource tree looked like this:
e0000000-fe7fffff : reserved
fe800000-fe8fffff : PCI Bus 0000:01
fe800000-fe8fffff : reserved
fe900000-fe9d9aff : reserved
fe9d9b00-fe9d9bff : 0000:00:1f.3
fe9d9b00-fe9d9bff : reserved
fe9d9c00-fe9d9fff : 0000:00:1a.7
fe9d9c00-fe9d9fff : reserved
fe9da000-fe9dafff : 0000:00:03.3
fe9da000-fe9dafff : reserved
fe9db000-fe9dbfff : 0000:00:19.0
fe9db000-fe9dbfff : reserved
fe9dc000-fe9dffff : 0000:00:1b.0
fe9dc000-fe9dffff : reserved
fe9e0000-fe9fffff : 0000:00:19.0
fe9e0000-fe9fffff : reserved
fea00000-fea7ffff : 0000:00:02.0
fea00000-fea7ffff : reserved
fea80000-feafffff : 0000:00:02.1
fea80000-feafffff : reserved
feb00000-febfffff : 0000:00:02.0
feb00000-febfffff : reserved
fec00000-fed003ff : reserved
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
and because the reserved entry had been split and moved into the
individual resources, and because it used the IORESOURCE_BUSY flag, the
drivers that actually wanted to _use_ those resources couldn't actually
attach to them:
e1000e 0000:00:19.0: BAR 0: can't reserve mem region [0xfe9e0000-0xfe9fffff]
HDA Intel 0000:00:1b.0: BAR 0: can't reserve mem region [0xfe9dc000-0xfe9dffff]
with this patch, the resource tree instead becomes
e0000000-fed003ff : reserved
fe800000-fe8fffff : PCI Bus 0000:01
fe9d9b00-fe9d9bff : 0000:00:1f.3
fe9d9c00-fe9d9fff : 0000:00:1a.7
fe9d9c00-fe9d9fff : ehci_hcd
fe9da000-fe9dafff : 0000:00:03.3
fe9db000-fe9dbfff : 0000:00:19.0
fe9db000-fe9dbfff : e1000e
fe9dc000-fe9dffff : 0000:00:1b.0
fe9dc000-fe9dffff : ICH HD audio
fe9e0000-fe9fffff : 0000:00:19.0
fe9e0000-fe9fffff : e1000e
fea00000-fea7ffff : 0000:00:02.0
fea80000-feafffff : 0000:00:02.1
feb00000-febfffff : 0000:00:02.0
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
ie the one reserved region now ends up surrounding all the PCI resources
that were allocated inside of it by firmware, and because it is not
marked BUSY, drivers have no problem attaching to the pre-allocated
resources.
Reported-and-tested-by: Jonathan Corbet <corbet@lwn.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robert Hancock <hancockr@shaw.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: None, bit reservation only
Add a synthetic TSC_RELIABLE feature bit which will be used to mark
TSC as reliable so that we could skip all the runtime checks for
TSC stablity, which have false positives in virtual environment.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: simplify implementation, cleanup
If !(pgd_val(*pgd) & _PAGE_PRESENT) in PAE mode, we need not get value of
pmd_table again.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch cleans up the NMI safe code for dynamic ftrace as suggested
by Andrew Morton.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce new APIs, separate kmap code from CONFIG_HIGHMEM
This takes the code used for CONFIG_HIGHMEM memory mappings except that
it's designed for dynamic IO resource mapping.
These fixmaps are available even with CONFIG_HIGHMEM turned off.
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the kexec bootstrap code implementation from assembly to C
This patch transforms the kexec page tables setup code from assembler
code to C code in machine_kexec_prepare. This improves readability and
reduces code line number.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save .text size when kexec is built in but not loaded
This patch adds an architecture specific struct kimage_arch into
struct kimage. The pointers to page table pages used by kexec are
added to struct kimage_arch. The page tables pages are dynamically
allocated in machine_kexec_prepare instead of statically from BSS
segment. This will save up to 20k memory when kexec image is not
loaded.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save kernel .text by loosening kexec page alignment
This patch removes PAGE_SIZE alignment from relocate_kernel(). Before
kexec jump patches are merged, control page is mapped to
relocate_kernel in kexec page tables, so relocate_kernel must be
PAGE_SIZE aligned. Now, control page is mapped to identity mapped
address, so relocate_kernel need not to be PAGE_SIZE aligned any
more. This can reduce a few KB from kernel text segement.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on certain UP configs
fix:
arch/x86/kernel/cpu/common.c: In function 'cpu_init':
arch/x86/kernel/cpu/common.c:1141: error: 'boot_cpu_id' undeclared (first use in this function)
arch/x86/kernel/cpu/common.c:1141: error: (Each undeclared identifier is reported only once
arch/x86/kernel/cpu/common.c:1141: error: for each function it appears in.)
Pull in asm/smp.h on UP, so that we get the definition of
boot_cpu_id.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
lguest: fix irq vectors.
lguest: fix early_ioremap.
lguest: fix example launcher compile after moved asm-x86 dir.
do_IRQ: cannot handle IRQ -1 vector 0x20 cpu 0
------------[ cut here ]------------
kernel BUG at arch/x86/kernel/irq_32.c:219!
We're not ISA: we have a 1:1 mapping from vectors to irqs.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
dmi_scan_machine breaks under lguest:
lguest: unhandled trap 14 at 0xc04edeae (0xffa00000)
This is because we use current_cr3 for the read_cr3() paravirt
function, and it isn't set until the first cr3 change. We got away
with it until this happened.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
fix:
arch/x86/kernel/cpu/common.c: In function 'early_identify_cpu':
arch/x86/kernel/cpu/common.c:553: error: 'struct cpuinfo_x86' has no member named 'cpu_index'
as cpu_index is only available on SMP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix /proc/cpuinfo output on x86/Voyager
Ever since
| commit 92cb7612ae
| Author: Mike Travis <travis@sgi.com>
| Date: Fri Oct 19 20:35:04 2007 +0200
|
| x86: convert cpuinfo_x86 array to a per_cpu array
We've had an extra field in cpuinfo_x86 which is cpu_index.
Unfortunately, voyager has never initialised this, although the only
noticeable impact seems to be that /proc/cpuinfo shows all zeros for
the processor ids.
Anyway, fix this by initialising the boot CPU properly and setting the
index when the secondaries update.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on x86/Voyager
Given commits like this:
| Author: Suresh Siddha <suresh.b.siddha@intel.com>
| Date: Tue Jul 29 10:29:19 2008 -0700
|
| x86, xsave: enable xsave/xrstor on cpus with xsave support
Which deliberately expose boot cpu dependence to pieces of the system,
I think it's time to explicitly have a variable for it to prevent this
continual misassumption that the boot CPU is zero.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow /dev/mem mmaps on non-PAT CPUs/platforms
Fix mmap to /dev/mem when CONFIG_X86_PAT is off and CONFIG_STRICT_DEVMEM is
off
mmap to /dev/mem on kernel memory has been failing since the
introduction of PAT (CONFIG_STRICT_DEVMEM=n case). Seems like
the check to avoid cache aliasing with PAT is kicking in even
when PAT is disabled. The bug seems to have crept in 2.6.26.
This patch makes sure that the mmap to regular
kernel memory succeeds if CONFIG_STRICT_DEVMEM=n and
PAT is disabled, and the checks to avoid cache aliasing
still happens if PAT is enabled.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Tested-by: Tim Sirianni <tim@scalemp.com>
Cc: <stable@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build failure on x86/Voyager
Before:
| commit 329513a35d
| Author: Yinghai Lu <yhlu.kernel@gmail.com>
| Date: Wed Jul 2 18:54:40 2008 -0700
|
| x86: move prefill_possible_map calling early
prefill_possible_mask() was hidden under CONFIG_HOTPLUG_CPU rendering
it invisitble to voyager. Since this commit it's exposed, but not
provided by the voyager subarch, so add a dummy stub to fix the link
breakage.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86/Voyager boot
CONFIG_SMP is used for features which work on *all* x86 boxes.
CONFIG_X86_SMP is used for standard PC like x86 boxes (for things like
multi core and apics)
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: boot up secondary CPUs as well on x86/Voyager systems
This commit:
| commit 3e9704739d
| Author: Glauber Costa <gcosta@redhat.com>
| Date: Wed May 28 13:01:54 2008 -0300
|
| x86: boot secondary cpus through initial_code
removed the use of initialize_secondary. However, it didn't update
voyager, so the secondary cpus no longer boot. Fix this by adding the
initial_code switch to voyager as well.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add more debug info to /debugfs/tracing/dyn_ftrace_total_info
This patch adds dynamic ftrace NMI update statistics to the
/debugfs/tracing/dyn_ftrace_total_info stat file.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crashes that can occur in NMI handlers, if their code is modified
Modifying code is something that needs special care. On SMP boxes,
if code that is being modified is also being executed on another CPU,
that CPU will have undefined results.
The dynamic ftrace uses kstop_machine to make the system act like a
uniprocessor system. But this does not address NMIs, that can still
run on other CPUs.
One approach to handle this is to make all code that are used by NMIs
not be traced. But NMIs can call notifiers that spread throughout the
kernel and this will be very hard to maintain, and the chance of missing
a function is very high.
The approach that this patch takes is to have the NMIs modify the code
if the modification is taking place. The way this works is that just
writing to code executing on another CPU is not harmful if what is
written is the same as what exists.
Two buffers are used: an IP buffer and a "code" buffer.
The steps that the patcher takes are:
1) Put in the instruction pointer into the IP buffer
and the new code into the "code" buffer.
2) Set a flag that says we are modifying code
3) Wait for any running NMIs to finish.
4) Write the code
5) clear the flag.
6) Wait for any running NMIs to finish.
If an NMI is executed, it will also write the pending code.
Multiple writes are OK, because what is being written is the same.
Then the patcher must wait for all running NMIs to finish before
going to the next line that must be patched.
This is basically the RCU approach to code modification.
Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
for his guidence on what is safe and what is not.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: include file dependency cleanup
Fix compile errors of files that include asm/uv/uv_hub.h but do
not include linux/timer.h.
[ such files are not mainline right now. ]
Signed-of-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce nmi_watchdog=lapic and nmi_watchdog=ioapic aliases
Add sensible names as "lapic" and "ioapic" to
nmi_watchdog boot parameter. Sometimes it is not
that easy to recall what exactly nmi_watchdog=1
does mean so we allow the using of symbolic names here.
Old numeric values remain valid.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To the unsuspecting user it is quite annoying that this broken and
inconsistent with x86-64 definition still exists.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove incorrect WARN_ON(1)
Gets rid of dmesg spam created during physical memory hot-add which
will very likely confuse users. The change removes what appears to
be debugging code which I assume was unintentionally included in:
x86: arch/x86/mm/init_64.c printk fixes
commit 10f22dde55
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: some new sparse warnings in e820.c etc, but no functional change.
As with regular ioremap, iounmap etc, annotate with __iomem.
Fixes the following sparse warnings, will produce some new ones
elsewhere in arch/x86 that will get worked out over time.
arch/x86/mm/ioremap.c:402:9: warning: cast removes address space of expression
arch/x86/mm/ioremap.c:406:10: warning: cast adds address space to expression (<asn:2>)
arch/x86/mm/ioremap.c:782:19: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
ftrace: fix current_tracer error return
tracing: fix a build error on alpha
ftrace: use a real variable for ftrace_nop in x86
tracing/ftrace: make boot tracer select the sched_switch tracer
tracepoint: check if the probe has been registered
asm-generic: define DIE_OOPS in asm-generic
trace: fix printk warning for u64
ftrace: warning in kernel/trace/ftrace.c
ftrace: fix build failure
ftrace, powerpc, sparc64, x86: remove notrace from arch ftrace file
ftrace: remove ftrace hash
ftrace: remove mcount set
ftrace: remove daemon
ftrace: disable dynamic ftrace for all archs that use daemon
ftrace: add ftrace warn on to disable ftrace
ftrace: only have ftrace_kill atomic
ftrace: use probe_kernel
ftrace: comment arch ftrace code
ftrace: return error on failed modified text.
ftrace: dynamic ftrace process only text section
...
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: fix irqs on/off ip tracing
lockdep: minor fix for debug_show_all_locks()
x86: restore the old swiotlb alloc_coherent behavior
x86: use GFP_DMA for 24bit coherent_dma_mask
swiotlb: remove panic for alloc_coherent failure
xen: compilation fix of drivers/xen/events.c on IA64
xen: portability clean up and some minor clean up for xencomm.c
xen: don't reload cr3 on suspend
kernel/resource: fix reserve_region_with_split() section mismatch
printk: remove unused code from kernel/printk.c
Impact: fix AMD Family 11h boot hangs / USB device problems
The AMD Fam11h CPUs have a K8 northbridge. This northbridge is different
from other family's because it lacks GART support (as I just learned).
But the kernel implicitly expects a GART if it finds an AMD northbridge.
Fix this by removing the Fam11h northbridge id from the scan list of K8
northbridges. This patch also changes the message in the GART driver
about missing K8 northbridges to tell that the GART is missing which is
the correct information in this case.
Reported-by: Jouni Malinen <jkmalinen@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so users are not confused with memhole causing big total ram
we don't need to worry about 32 bit, because memhole is always
above max_low_pfn.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix kdump crash on 32-bit sparsemem kernels
Since linux-2.6.27, kdump has failed on i386 sparsemem kernel.
1st-kernel gets a panic just before switching to 2nd-kernel.
The cause is that a kernel accesses invalid mem_section by
page_to_pfn(image->swap_page) at machine_kexec().
image->swap_page is allocated if kexec for hibernation, but
it is not allocated if kdump. So if kdump, a kernel should
not access the mem_section corresponding to image->swap_page.
The attached patch fixes this invalid access.
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: kexec-ml <kexec@lists.infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The rdmsr instruction(et al) for i386 and x86-64 are semantically same.
The only difference is how gcc interpret constraint "A" for these targets.
Signed-off-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
APIC_DEBUG is always 2.
need to update inquire_remote_apic to check apic_verbosity with
it instead.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Improve the help text of the X86_PTRACE_BTS config.
Make X86_DS invisible and depend on X86_PTRACE_BTS.
Reported-by: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo said mtrr_cleanup() is big and ugly.
so break it up into more functions and make it more readable.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Every call of kvm_set_irq() should offer an irq_source_id, which is
allocated by kvm_request_irq_source_id(). Based on irq_source_id, we
identify the irq source and implement logical OR for shared level
interrupts.
The allocated irq_source_id can be freed by kvm_free_irq_source_id().
Currently, we support at most sizeof(unsigned long) different irq sources.
[Amit: - rebase to kvm.git HEAD
- move definition of KVM_USERSPACE_IRQ_SOURCE_ID to common file
- move kvm_request_irq_source_id to the update_irq ioctl]
[Xiantao: - Add kvm/ia64 stuff and make it work for kvm/ia64 guests]
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The pvmmu TLB flush handler should request a root sync, similarly to
a native read-write CR3.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: fix crash with memory hotplug
Shuahua Li found:
| I just did some experiments on a desktop for memory hotplug and this bug
| triggered a crash in my test.
|
| Yinghai's suggestion also fixed the bug.
We don't need to round it, just remove that extra -1
Signed-off-by: Yinghai <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Change UV heartbeat function to use idle_cpu to determine cpu's
"idleness". Realign uv_hub definitions.
Signed-of-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
As promised, now that dumpstack_32 and dumpstack_64 have so many bits
in common, we should merge the in-sync bits into a common file, to
prevent them from diverging again.
This patch removes bits which are common between dumpstack_32.c and
dumpstack_64.c and places them in a common dumpstack.c which is built
for both 32 and 64 bit arches.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/dumpstack.c | 319 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/dumpstack.h | 39 +++++
arch/x86/kernel/dumpstack_32.c | 294 -------------------------------------
arch/x86/kernel/dumpstack_64.c | 285 ------------------------------------
5 files changed, 363 insertions(+), 576 deletions(-)
Impact: change NMI watchdog detection and disabling sequence
Currently, if the NMI watchdog fails using IOAPIC method, it'll only disable
interrupts on 8259 if the timer is passing thru it. This patch disables
NMI delivery on LINT0 if the NMI watchdog initial test fails, just for safety.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change/improve the way /proc/sys/kernel/nmi_watchdog works
This patch adds support to enable/disable IOAPIC NMI watchdog in runtime via
procfs.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
now that the code is moved and converted to a work queue,
there's some minor cleanups that can be done.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the implementation of the debug feature
the periodic corruption checks are better off run from a work queue; there's
nothing time critical about them and this way the amount of
interrupt-context work is reduced.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The corruption check code is rather sizable and it's likely to grow over
time when we add checks for more types of corruptions (there's a few
candidates in kerneloops.org that I want to add checks for)... so lets move
it to its own file
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Before moving the code to it's own file, fix some style issues
in the corruption check code.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid section mismatch warning, clean up
The dynamic ftrace determines which nop is safe to use at start up.
When it finds a safe nop for patching, it sets a pointer called ftrace_nop
to point to the code. All call sites are then patched to this nop.
Later, when tracing is turned on, this ftrace_nop variable is again used
to compare the location to make sure it is a nop before we update it to
an mcount call. If this fails just once, a warning is printed and ftrace
is disabled.
Rakib Mullick noted that the code that sets up the nop is a .init section
where as the nop itself is in the .text section. This is needed because
the nop is used later on after boot up. The problem is that the test of the
nop jumps back to the setup code and causes a "section mismatch" warning.
Rakib first recommended to convert the nop to .init.text, but as stated
above, this would fail since that text is used later.
The real solution is to extend Rabik's patch, and to make the ftrace_nop
into an array, and just save the code from the assembly to this array.
Now the section can stay as an init section, and we have a nop to use
later on.
Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
CONFIG_IRQBALANCE was removed in commit 8b8e8c1bf; this ifdef was still
around.
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: on SGI UV platforms, fix boot crash
UV initialization is currently called too late to call alloc_bootmem_pages().
The current sequence is:
start_kernel()
mem_init()
free_all_bootmem() <--- discard of bootmem
rest_init()
kernel_init()
smp_prepare_cpus()
native_smp_prepare_cpus()
uv_system_init() <--- uses alloc_bootmem_pages()
It should be calling kmalloc().
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix guest kernel boot crash on certain configs
Recent i686 2.6.27 kernels with a certain amount of memory (between
736 and 855MB) have a problem booting under a hypervisor that supports
batched mprotect (this includes the RHEL-5 Xen hypervisor as well as
any 3.3 or later Xen hypervisor).
The problem ends up being that xen_ptep_modify_prot_commit() is using
virt_to_machine to calculate which pfn to update. However, this only
works for pages that are in the p2m list, and the pages coming from
change_pte_range() in mm/mprotect.c are kmap_atomic pages. Because of
this, we can run into the situation where the lookup in the p2m table
returns an INVALID_MFN, which we then try to pass to the hypervisor,
which then (correctly) denies the request to a totally bogus pfn.
The right thing to do is to use arbitrary_virt_to_machine, so that we
can be sure we are modifying the right pfn. This unfortunately
introduces a performance penalty because of a full page-table-walk,
but we can avoid that penalty for pages in the p2m list by checking if
virt_addr_valid is true, and if so, just doing the lookup in the p2m
table.
The attached patch implements this, and allows my 2.6.27 i686 based
guest with 768MB of memory to boot on a RHEL-5 hypervisor again.
Thanks to Jeremy for the suggestions about how to fix this particular
issue.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We can remove MMIOTRACE_HOOKS and replace it with just MMIOTRACE.
MMIOTRACE_HOOKS is a remnant from the time when I thought that
something else could also use the kmmio facilities.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: start per CPU heartbeat LED timers on SGI UV systems
The SGI UV system has no LEDS but uses one of the system controller
regs to indicate the online internal state of the cpu. There is a
heartbeat bit indicating that the cpu is responding to interrupts,
and an idle bit indicating whether the cpu is idle when the heartbeat
interrupt occurs. The current period is one second.
When a cpu panics, an error code is written by BIOS to this same reg.
This patchset provides the following:
* x86_64: Add base functionality for writing to the specific SCIR's
for each cpu.
* heartbeat: Invert "heartbeat" bit to indicate the cpu is
"interruptible". If the current thread is the idle thread,
then indicate system is "idle".
* if hotplug enabled, all bits are set (0xff) when the cpu is disabled.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, no functionality changed
ia32_setup_rt_frame() has a duplicated code block labelled
"Make -mregparm=3 work" for setting up the register parameters
to the user-mode signal handler.
This is harmless but ugly. Remove the redundant assignments.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Thu, Oct 23, 2008 at 04:09:52PM -0700, Alexander Beregalov wrote:
> arch/x86/kernel/built-in.o: In function `iommu_setup':
> pci-dma.c:(.init.text+0x36ad): undefined reference to `forbid_dac'
> pci-dma.c:(.init.text+0x36cc): undefined reference to `forbid_dac'
> pci-dma.c:(.init.text+0x3711): undefined reference to `forbid_dac
This patch partially reverts a patch to add IOMMU support to ia64. The
forbid_dac variable was incorrectly moved to quirks.c, which isn't built
when PCI is disabled.
Tested-by: "Alexander Beregalov" <a.beregalov@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This restores the old swiotlb alloc_coherent behavior (before the
alloc_coherent rewrite):
http://lkml.org/lkml/2008/8/12/200
The old alloc_coherent avoids GFP_DMA allocation first and if the
allocated address is not fit for the device's coherent_dma_mask, then
dma_alloc_coherent does GFP_DMA allocation. If it fails,
alloc_coherent calls swiotlb_alloc_coherent (in short, we rarely used
swiotlb_alloc_coherent).
After the alloc_coherent rewrite, dma_alloc_coherent
(include/asm-x86/dma-mapping.h) directly calls swiotlb_alloc_coherent.
It means that we possibly can't handle a device having dma_masks >
24bit < 32bits since swiotlb_alloc_coherent doesn't have the above
GFP_DMA retry mechanism.
This patch fixes x86's swiotlb alloc_coherent to use the GFP_DMA retry
mechanism, which dma_generic_alloc_coherent() provides now
(pci-nommu.c and GART IOMMU driver also use
dma_generic_alloc_coherent).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
dma_alloc_coherent (include/asm-x86/dma-mapping.h) avoids GFP_DMA
allocation first and if the allocated address is not fit for the
device's coherent_dma_mask, then dma_alloc_coherent does GFP_DMA
allocation. This is because dma_alloc_coherent avoids precious GFP_DMA
zone if possible. This is also how the old dma_alloc_coherent
(arch/x86/kernel/pci-dma.c) works.
However, if the coherent_dma_mask of a device is 24bit, there is no
point to go into the above GFP_DMA retry mechanism. We had better use
GFP_DMA in the first place.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix section mismatch warning - apic_x2apic_phys
x86: fix section mismatch warning - apic_x2apic_cluster
x86: fix section mismatch warning - apic_x2apic_uv_x
x86: fix section mismatch warning - apic_physflat
x86: fix section mismatch warning - apic_flat
x86: memtest fix use of reserve_early()
x86 syscall.h: fix argument order
x86/tlb_uv: remove strange mc146818rtc include
x86: remove redundant KERN_DEBUG on pr_debug
x86: do_boot_cpu - check if we have ESR register
x86: MAINTAINERS change for AMD microcode patch loader
x86/proc: fix /proc/cpuinfo cpu offline bug
x86: call dmi-quirks for HP Laptops after early-quirks are executed
x86, kexec: fix hang on i386 when panic occurs while console_sem is held
MCE: Don't run 32bit machine checks with interrupts on
x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: make variables static
* 'v28-range-hrtimers-for-linus-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (37 commits)
hrtimers: add missing docbook comments to struct hrtimer
hrtimers: simplify hrtimer_peek_ahead_timers()
hrtimers: fix docbook comments
DECLARE_PER_CPU needs linux/percpu.h
hrtimers: fix typo
rangetimers: fix the bug reported by Ingo for real
rangetimer: fix BUG_ON reported by Ingo
rangetimer: fix x86 build failure for the !HRTIMERS case
select: fix alpha OSF wrapper
select: fix alpha OSF wrapper
hrtimer: peek at the timer queue just before going idle
hrtimer: make the futex() system call use the per process slack value
hrtimer: make the nanosleep() syscall use the per process slack
hrtimer: fix signed/unsigned bug in slack estimator
hrtimer: show the timer ranges in /proc/timer_list
hrtimer: incorporate feedback from Peter Zijlstra
hrtimer: add a hrtimer_start_range() function
hrtimer: another build fix
hrtimer: fix build bug found by Ingo
hrtimer: make select() and poll() use the hrtimer range feature
...
* 'x86/um-header' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
x86: canonicalize remaining header guards
x86: drop double underscores from header guards
x86: Fix ASM_X86__ header guards
x86, um: get rid of uml-config.h
x86, um: get rid of arch/um/Kconfig.arch
x86, um: get rid of arch/um/os symlink
x86, um: get rid of excessive includes of uml-config.h
x86, um: get rid of header symlinks
x86, um: merge Kconfig.i386 and Kconfig.x86_64
x86, um: get rid of sysdep symlink
x86, um: trim the junk from uml ptrace-*.h
x86, um: take vm-flags.h to sysdep
x86, um: get rid of uml asm/arch
x86, um: get rid of uml highmem.h
x86, um: get rid of uml unistd.h
x86, um: get rid of system.h -> system.h include
x86, um: uml atomic.h is not needed anymore
x86, um: untangle uml ldt.h
x86, um: get rid of more uml asm/arch uses
x86, um: remove dead header (uml module-generic.h; never used these days)
...
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (123 commits)
dock: make dock driver not a module
ACPI: fix ia64 build warning
ACPI: hack around sysfs warning with link order
ACPI suspend: fix build warning when CONFIG_ACPI_SLEEP=n
intel_menlo: fix build warning
panasonic-laptop: fix build
ACPICA: Update version to 20080926
ACPICA: Add support for zero-length buffer-to-string conversions
ACPICA: New: Validation for predefined ACPI methods/objects
ACPICA: Fix for implicit return compatibility
ACPICA: Fixed a couple memory leaks associated with "implicit return"
ACPICA: Optimize buffer allocation procedure
ACPICA: Fix possible memory leak, error exit path
ACPICA: Fix fault after mem allocation failure in AML parser
ACPICA: Remove unused ACPI register bit definition
ACPICA: Update version to 20080829
ACPICA: Fix possible memory leak in acpi_ns_get_external_pathname
ACPICA: Cleanup for internal Reference Object
ACPICA: Update comments - no functional changes
ACPICA: Update for Reference ACPI_OPERAND_OBJECT
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile: (21 commits)
OProfile: Fix buffer synchronization for IBS
oprofile: hotplug cpu fix
oprofile: fixing whitespaces in arch/x86/oprofile/*
oprofile: fixing whitespaces in arch/x86/oprofile/*
oprofile: fixing whitespaces in drivers/oprofile/*
x86/oprofile: add the logic for enabling additional IBS bits
x86/oprofile: reordering functions in nmi_int.c
x86/oprofile: removing unused function parameter in add_ibs_begin()
oprofile: more whitespace fixes
oprofile: whitespace fixes
OProfile: Rename IBS sysfs dir into "ibs_op"
OProfile: Rework string handling in setup_ibs_files()
OProfile: Rework oprofile_add_ibs_sample() function
oprofile: discover counters for op ppro too
oprofile: Implement Intel architectural perfmon support
oprofile: Don't report Nehalem as core_2
oprofile: drop const in num counters field
Revert "Oprofile Multiplexing Patch"
x86, oprofile: BUG: using smp_processor_id() in preemptible code
x86/oprofile: fix on_each_cpu build error
...
Manually fixed trivial conflicts in
drivers/oprofile/{cpu_buffer.c,event_buffer.h}
Impact: clarify Kconfig help text
The OPTIMIZE_INLINING help currently says "The gcc 4.x series have a
rewritten inlining algorithm and disabling this option will generate a
smaller kernel there."
This contradicts other parts of the help text and my own tests:
5463127 2008-10-11 19:51 vmlinux.no-opt
5456152 2008-10-11 19:56 vmlinux.opt
Reword text to say that enabling OPTIMIZE_INLINING will lead to smaller
kernels with gcc 4.x or later.
Signed-off-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.infradead.org/iommu-2.6:
Admit to maintaining VT-d, for my sins.
dmar: fix uninitialised 'ret' variable in dmar_parse_dev()
intel-iommu: use coherent_dma_mask in alloc_coherent
amd_iommu: fix nasty bug that caused ILLEGAL_DEVICE_TABLE_ENTRY errors
intel-iommu: IA64 support
dmar: remove the quirk which disables dma-remapping when intr-remapping enabled
dmar: Use queued invalidation interface for IOTLB and context invalidation
dmar: context cache and IOTLB invalidation using queued invalidation
dmar: use spin_lock_irqsave() in qi_submit_sync()
The entire file of ftrace.c in the arch code needs to be marked
as notrace. It is much cleaner to do this from the Makefile with
CFLAGS_REMOVE_ftrace.o.
[ powerpc already had this in its Makefile. ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The arch dependent function ftrace_mcount_set was only used by the daemon
start up code. This patch removes it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Morton suggested using the proper API for reading and writing
kernel areas that might fault.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add comments to explain what is happening in the x86 arch ftrace code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Have the ftrace_modify_code return error values:
-EFAULT on error of reading the address
-EINVAL if what is read does not match what it expected
-EPERM if the write fails to update after a successful match.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Canonicalize a few remaining header guards, with the exception for
those which are still in subarchitecture directories.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Drop double underscores from header guards in arch/x86/include. They
are used inconsistently, and are not necessary.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Change header guards named "ASM_X86__*" to "_ASM_X86_*" since:
a. the double underscore is ugly and pointless.
b. no leading underscore violates namespace constraints.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xc008): Section mismatch in reference from the variable apic_x2apic_phys to the function .init.text:x2apic_acpi_madt_oem_check()
The variable apic_x2apic_phys references
the function __init x2apic_acpi_madt_oem_check()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xbf88): Section mismatch in reference from the variable apic_x2apic_cluster to the function .init.text:x2apic_acpi_madt_oem_check()
The variable apic_x2apic_cluster references
the function __init x2apic_acpi_madt_oem_check()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xbf08): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .init.text:uv_acpi_madt_oem_check()
The variable apic_x2apic_uv_x references
the function __init uv_acpi_madt_oem_check()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xbe88): Section mismatch in reference from the variable apic_physflat to the function .init.text:physflat_acpi_madt_oem_check()
The variable apic_physflat references
the function __init physflat_acpi_madt_oem_check()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xbe08): Section mismatch in reference from the variable apic_flat to the function .init.text:flat_acpi_madt_oem_check()
The variable apic_flat references
the function __init flat_acpi_madt_oem_check()
This is harmless, because the .acpi_madt_oem_check is only called
during init time. But we keep the function pointer around in a .data
function pointer template, so it's better we do not keep that stale
- so mark this function non-__init.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hi all,
Wrong usage of 2nd parameter in reserve_early call.
66/75: reserve_early(start_bad, last_bad - start_bad, "BAD RAM");
^^^^^^^^^^^^^^^^^^^^
The correct way is to use 'end' address and not 'size'.
As a bonus a fix to the printk format.
Signed-off-by: Daniele Calore <orkaan@orkaan.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For some reason tlb_uv was including linux/mc146818rtc.h. It really
just needs linux/seq_file.h
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citix.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix APIC IRQ irregularities on certain older boxes
We should touch the APIC ESR register if only we have it.
The patch fixes the problem mentioned by Max Kellermann:
http://lkml.org/lkml/2008/10/17/147
Bisected-by: Max Kellermann <mk@cm4all.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
[ mingo@elte.hu: build fix ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix missing CPUs in /proc/cpuinfo after CPU hotunplug/hotreplug
In my test, I found that if a cpu has been offline,
the next cpus may not be shown in the /proc/cpuinfo.
if one read() cannot consume the whole /proc/cpuinfo,
c_start() will be called again in the next read() calls.
And *pos has been increased by 1 by the caller(seq_read()).
if this time the cpu#*pos is offline, c_start() will return
NULL, and the next cpus can not be shown.
this fix use next_cpu_nr(*pos - 1, cpu_online_map) to
search the next unshown cpu.
the most easy way to reproduce this bug:
1) offline cpu#1 (cpu#0 is online)
2) dd ibs=2 if=/proc/cpuinfo
the result is that only cpu#0 is shown.
cpu#2 and cpu#3 .... cannot be shown in /proc/cpuinfo
it's bug.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make warning message disappear - functionality unchanged
Problems with bogus IRQ0 override of those laptops should be fixed
with commits
x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: SB450: skip IRQ0 override if it is not routed to INT2 of IOAPIC
that introduce early-quirks based on chipset configuration.
For further information, see
http://bugzilla.kernel.org/show_bug.cgi?id=11516
Instead of removing the related dmi-quirks completely we'd like to
keep them for (at least) one kernel version -- to double-check whether
the early-quirks really took effect. But the dmi-quirks need to be
called after early-quirks are executed. With this patch calling
sequence for dmi-quriks is changed as follows:
acpi_boot_table_init() (dmi-quirks)
...
early_quirks() (detect bogus IRQ0 override)
...
acpi_boot_init() (late dmi-quirks and setup IO APIC)
Note: Plan is to remove the "late dmi-quirks" with next kernel version.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make i386's die() equal to x86_64's version.
Whitespace-only changes on x86_64, to make it equal to i386's
version. (user_mode and user_mode_vm are equal on x86_64.)
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use oops_begin and oops_end in die_nmi.
Whitespace-only changes on x86_64, to make it equal to i386's
version.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
oops_begin/oops_end should always be used in pairs. On x86_64
oops_begin increments die_nest_count, and oops_end decrements
die_nest_count. Doing this makes oops_begin and oops_end equal
to the x86_64 versions.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Always call oops_exit from oops_end, even if signr==0.
Also, move add_taint(TAINT_DIE) from __die to oops_end
on x86_64 and interchange two lines to make oops_end
more similar to the i386-version.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
oops_end is preceded by either a call to __die, or a conditional
call to crash_kexec. Move the conditional call to crash_kexec
from the end of __die to the start of oops_end and remove
the superfluous call to crash_kexec in die_nmi.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change oops_end such that signr=0 signals that do_exit
is not to be called.
Currently, each use of __die is soon followed by a call
to oops_end and 'regs' is set to NULL if oops_end is expected
not to call do_exit. Change all such pairs to set signr=0
instead. On x86_64 oops_end is used 'bare' in die_nmi; use
signr=0 instead of regs=NULL there, too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
crash_kexec should not be called with console_sem held. Move
the call before bust_spinlocks(0) in oops_end to avoid the
problem.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: "Neil Horman" <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's a corner case in 32 bit x86 kdump at the moment. When the box
panics via nmi, we call bust_spinlocks(1) to disable sensitivity to the
console_sem (allowing us to print to the console in all cases), but we don't
call crash_kexec, until after we call bust_spinlocks(0), which re-enables
console_sem sensitivity.
The result is that, if we get an nmi while the console_sem is held and
kdump is configured, and we try to print something to the console during
kdump shutdown (which we often do) we deadlock the box. The fix is to
simply do what 64 bit die_nmi does which is to not call bust_spinlocks(0)
until after we call crash_kexec.
Patch below tested successfully by me.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
DIRECT_GBPAGES was under DEBUG_KERNEL && EXPERIMENTAL and disabled by default.
Turn it on by default and put it under EMBEDDED.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Running machine checks with interrupt on is a extremly bad idea. The machine
check handler only runs when the system is broken and needs to finish
as quickly as possible.
Remove the respective bogus post 2.6.27 regression and call
the machine check vector directly again.
This removes only code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[Cherry-picked from x86/mce]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: fix hung bootup and other misbehavior on certain laptops
On some more HP laptops BIOS reports an IRQ0 override
but the SB600 chipset is configured such that timer
interrupts go to INT0 of IOAPIC.
Check IRQ0 routing and if it is routed to INT0 of IOAPIC skip the
timer override.
See following bug reports:
http://bugzilla.kernel.org/show_bug.cgi?id=11715http://bugzilla.kernel.org/show_bug.cgi?id=11516
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use consistent names for region size and conherence id on x86 and ia64.
The SGI xp drivers are used on both ia64 and x86. Using the same
names (sn_coherency_id, sn_region_size) simplies the driver code.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These variables are only used in their source files, so make them static.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Intel 7300 Memory Controller supports dynamic throttling of memory which can
be used to save power when system is idle. This driver does the memory
throttling when all CPUs are idle on such a system.
Refer to "Intel 7300 Memory Controller Hub (MCH)" datasheet
for the config space description.
Signed-off-by: Andy Henroid <andrew.d.henroid@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
needed if the i7300_idle driver is to be modular.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Len Brown <len.brown@intel.com>
Fix off-by-one in for_each_irq_desc_reverse().
Impact is near zero in practice, because nothing substantial wants to
iterate down to IRQ#0 - but fix it nevertheless.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (41 commits)
PCI: fix pci_ioremap_bar() on s390
PCI: fix AER capability check
PCI: use pci_find_ext_capability everywhere
PCI: remove #ifdef DEBUG around dev_dbg call
PCI hotplug: fix get_##name return value problem
PCI: document the pcie_aspm kernel parameter
PCI: introduce an pci_ioremap(pdev, barnr) function
powerpc/PCI: Add legacy PCI access via sysfs
PCI: Add ability to mmap legacy_io on some platforms
PCI: probing debug message uniformization
PCI: support PCIe ARI capability
PCI: centralize the capabilities code in probe.c
PCI: centralize the capabilities code in pci-sysfs.c
PCI: fix 64-vbit prefetchable memory resource BARs
PCI: replace cfg space size (256/4096) by macros.
PCI: use resource_size() everywhere.
PCI: use same arg names in PCI_VDEVICE comment
PCI hotplug: rpaphp: make debug var unique
PCI: use %pF instead of print_fn_descriptor_symbol() in quirks.c
PCI: fix hotplug get_##name return value problem
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel
Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL
This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu
and x86/uv.
The sparseirq branch is just preliminary groundwork: no sparse IRQs are
actually implemented by this tree anymore - just the new APIs are added
while keeping the old way intact as well (the new APIs map 1:1 to
irq_desc[]). The 'real' sparse IRQ support will then be a relatively
small patch ontop of this - with a v2.6.29 merge target.
* 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits)
genirq: improve include files
intr_remapping: fix typo
io_apic: make irq_mis_count available on 64-bit too
genirq: fix name space collisions of nr_irqs in arch/*
genirq: fix name space collision of nr_irqs in autoprobe.c
genirq: use iterators for irq_desc loops
proc: fixup irq iterator
genirq: add reverse iterator for irq_desc
x86: move ack_bad_irq() to irq.c
x86: unify show_interrupts() and proc helpers
x86: cleanup show_interrupts
genirq: cleanup the sparseirq modifications
genirq: remove artifacts from sparseirq removal
genirq: revert dynarray
genirq: remove irq_to_desc_alloc
genirq: remove sparse irq code
genirq: use inline function for irq_to_desc
genirq: consolidate nr_irqs and for_each_irq_desc()
x86: remove sparse irq from Kconfig
genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n
...
Update assorted email addresses and related info to point
to a single current, valid address.
additionally
- trivial CREDITS entry updates. (Not that this file means much any more)
- remove arjans dead redhat.com address from powernow driver
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generated 'capflags.c' file wasn't properly ignored, and the list of
files in scripts/basic/ wasn't up-to-date.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch updates the Intel Ibex Peak (PCH) LPC and SMBus Controller
DeviceIDs.
The LPC Controller ID is set by Firmware within the range of
0x3b00-3b1f. This range is included in pci_ids.h using min and max
values, and irq.c now has code to handle the range (in lieu of 32
additions to a SWITCH statement).
The SMBus Controller ID is a fixed-value and will not change.
Signed-off-by: Seth Heasley <seth.heasley@intel.com>
Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Use "[%04x:%04x]" for PCI vendor/device IDs to follow the format
used by lspci(8).
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Due to confusion between the ftrace infrastructure and the gcc profiling
tracer "ftrace", this patch renames the config options from FTRACE to
FUNCTION_TRACER. The other two names that are offspring from FTRACE
DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same.
This patch was generated mostly by script, and partially by hand.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In ftrace, logic is defined in the WARN_ON_ONCE, which can become a
nop with some configs. This patch fixes it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change various rtc related code to use the new bcd2bin/bin2bcd functions
instead of the obsolete BCD_TO_BIN/BIN_TO_BCD/BCD2BIN/BIN2BCD macros.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o elfcorehdr_addr is used by not only the code under CONFIG_PROC_VMCORE
but also by the code which is not inside CONFIG_PROC_VMCORE. For
example, is_kdump_kernel() is used by powerpc code to determine if
kernel is booting after a panic then use previous kernel's TCE table.
So even if CONFIG_PROC_VMCORE is not set in second kernel, one should be
able to correctly determine that we are booting after a panic and setup
calgary iommu accordingly.
o So remove the assumption that elfcorehdr_addr is under
CONFIG_PROC_VMCORE.
o Move definition of elfcorehdr_addr to arch dependent crash files.
(Unfortunately crash dump does not have an arch independent file
otherwise that would have been the best place).
o kexec.c is not the right place as one can Have CRASH_DUMP enabled in
second kernel without KEXEC being enabled.
o I don't see sh setup code parsing the command line for
elfcorehdr_addr. I am wondering how does vmcore interface work on sh.
Anyway, I am atleast defining elfcoredhr_addr so that compilation is not
broken on sh.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Simon Horman <horms@verge.net.au>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements a new freezer subsystem in the control groups
framework. It provides a way to stop and resume execution of all tasks in
a cgroup by writing in the cgroup filesystem.
The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks
in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in
the cgroup. Reading will return the current state.
* Examples of usage :
# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks
to get status of the freezer subsystem :
# cat /containers/0/freezer.state
RUNNING
to freeze all tasks in the container :
# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN
to unfreeze all tasks in the container :
# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING
This is the basic mechanism which should do the right thing for user space
task in a simple scenario.
It's important to note that freezing can be incomplete. In that case we
return EBUSY. This means that some tasks in the cgroup are busy doing
something that prevents us from completely freezing the cgroup at this
time. After EBUSY, the cgroup will remain partially frozen -- reflected
by freezer.state reporting "FREEZING" when read. The state will remain
"FREEZING" until one of these things happens:
1) Userspace cancels the freezing operation by writing "RUNNING" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: export thaw_process]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are on 64-bit so better use u64 instead of u32 to deal with
addresses:
static void __init iommu_set_device_table(struct amd_iommu *iommu)
{
u64 entry;
...
entry = virt_to_phys(amd_iommu_dev_table);
...
(I am wondering why gcc 4.2.x did not warn about the assignment
between u32 and unsigned long.)
Cc: iommu@lists.linux-foundation.org
Cc: stable@kernel.org
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The current Intel IOMMU code assumes that both host page size and Intel
IOMMU page size are 4KiB. The first patch supports variable page size.
This provides support for IA64 which has multiple page sizes.
This patch also adds some other code hooks for IA64 platform including
DMAR_OPERATION_TIMEOUT definition.
[dwmw2: some cleanup]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The driver would like to map IO space directly for copying data in when
appropriate, to avoid CPU cache flushing for streaming writes.
kmap_atomic_pfn lets us avoid IPIs associated with ioremap for this process.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
x86 ACPI: Fix breakage of resume on 64-bit UP systems with SMP kernel
We are now using per CPU GDT tables in head_64.S and the original
early_gdt_descr.address is invalidated after boot by
setup_per_cpu_areas(). This breaks resume from suspend to RAM on
x86_64 UP systems using SMP kernels, because this part of head_64.S
is also executed during the resume and the invalid GDT address
causes the system to crash. It doesn't break on 'true' SMP systems,
because early_gdt_descr.address is modified every time
native_cpu_up() runs. However, during resume it should point to the
GDT of the boot CPU rather than to another CPU's GDT.
For this reason, during suspend to RAM always make
early_gdt_descr.address point to the boot CPU's GDT.
This fixes http://bugzilla.kernel.org/show_bug.cgi?id=11568, which
is a regression from 2.6.26.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-and-tested-by: Andy Wettstein <ajw1980@gmail.com>
* 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (134 commits)
KVM: ia64: Add intel iommu support for guests.
KVM: ia64: add directed mmio range support for kvm guests
KVM: ia64: Make pmt table be able to hold physical mmio entries.
KVM: Move irqchip_in_kernel() from ioapic.h to irq.h
KVM: Separate irq ack notification out of arch/x86/kvm/irq.c
KVM: Change is_mmio_pfn to kvm_is_mmio_pfn, and make it common for all archs
KVM: Move device assignment logic to common code
KVM: Device Assignment: Move vtd.c from arch/x86/kvm/ to virt/kvm/
KVM: VMX: enable invlpg exiting if EPT is disabled
KVM: x86: Silence various LAPIC-related host kernel messages
KVM: Device Assignment: Map mmio pages into VT-d page table
KVM: PIC: enhance IPI avoidance
KVM: MMU: add "oos_shadow" parameter to disable oos
KVM: MMU: speed up mmu_unsync_walk
KVM: MMU: out of sync shadow core
KVM: MMU: mmu_convert_notrap helper
KVM: MMU: awareness of new kvm_mmu_zap_page behaviour
KVM: MMU: mmu_parent_walk
KVM: x86: trap invlpg
KVM: MMU: sync roots on mmu reload
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix compat-vdso
x86/mm: unify init task OOM handling
x86/mm: do not trigger a kernel warning if user-space disables interrupts and generates a page fault
On some more HP laptops BIOS reports an IRQ0 override
but the SB600 chipset is configured such that timer
interrupts go to INT0 of IOAPIC.
Check IRQ0 routing and if it is routed to INT0 of IOAPIC skip the
timer override.
http://bugzilla.kernel.org/show_bug.cgi?id=11715http://bugzilla.kernel.org/show_bug.cgi?id=11516
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits)
UIO: Fix mapping of logical and virtual memory
UIO: add automata sercos3 pci card support
UIO: Change driver name of uio_pdrv
UIO: Add alignment warnings for uio-mem
Driver core: add bus_sort_breadthfirst() function
NET: convert the phy_device file to use bus_find_device_by_name
kobject: Cleanup kobject_rename and !CONFIG_SYSFS
kobject: Fix kobject_rename and !CONFIG_SYSFS
sysfs: Make dir and name args to sysfs_notify() const
platform: add new device registration helper
sysfs: use ilookup5() instead of ilookup5_nowait()
PNP: create device attributes via default device attributes
Driver core: make bus_find_device_by_name() more robust
usb: turn dev_warn+WARN_ON combos into dev_WARN
debug: use dev_WARN() rather than WARN_ON() in device_pm_add()
debug: Introduce a dev_WARN() function
sysfs: fix deadlock
device model: Do a quickcheck for driver binding before doing an expensive check
Driver core: Fix cleanup in device_create_vargs().
Driver core: Clarify device cleanup.
...
Currently, it is possible to set a graphics VESA mode at boot time via the
vga= parameter even when no framebuffer driver supporting this is
configured. This could lead to the system booting with a black screen,
without a usable console.
Fix this problem by only allowing to set graphics modes at boot time if a
supporting framebuffer driver is configured.
Signed-off-by: Michal Januszewski <spock@gentoo.org>
Acked-by: Krzysztof Helt <krzysztof.h1@wp.pl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches re-introduces the iommu_num_pages function so that
it can be used by each architecture specific IOMMU implementations. The
series also changes IOMMU implementations for X86, Alpha, PowerPC and
UltraSparc. The other implementations are not yet changed because the
modifications required are not obvious and I can't test them on real
hardware.
This patch:
This is a preparation patch for introducing a generic iommu_num_pages function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nothing arch specific in get/settimeofday. The details of the timeval
conversion varied a little from arch to arch, but all with the same
results.
Also add an extern declaration for sys_tz to linux/time.h because externs
in .c files are fowned upon. I'll kill the externs in various other files
in a sparate patch.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ]
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct stat / compat_stat is the same on all architectures, so
cp_compat_stat should be, too.
Turns out it is, except that various architectures have slightly and some
high2lowuid/high2lowgid or the direct assignment instead of the
SET_UID/SET_GID that expands to the correct one anyway.
This patch replaces the arch-specific cp_compat_stat implementations with
a common one based on the x86-64 one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ]
Acked-by: Kyle McMartin <kyle@mcmartin.ca> [ parisc bits ]
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's somewhat unlikely that it happens, but right now a race window
between interrupts or machine checks or oopses could corrupt the tainted
bitmap because it is modified in a non atomic fashion.
Convert the taint variable to an unsigned long and use only atomic bit
operations on it.
Unfortunately this means the intvec sysctl functions cannot be used on it
anymore.
It turned out the taint sysctl handler could actually be simplified a bit
(since it only increases capabilities) so this patch actually removes
code.
[akpm@linux-foundation.org: remove unneeded include]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using "def_bool n" is pointless, simply using bool here appears more
appropriate.
Further, retaining such options that don't have a prompt and aren't
selected by anything seems also at least questionable.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print the name of the last-accessed sysfs file when we oops, to help track
down oopses which occur in sysfs store/read handlers. Because these oopses
tend to not leave any trace of the offending code in the stack traces.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
show_interrupts() and proc helpers are basically the same for
32 and 64 bit. Move them to a shared source file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The sparseirq patches introduced some more ugliness in show_interrupts().
Clean it up all together and make the code easier to read by splitting out
the "tail" function which prints the special interrupts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This code is not ready, but we need to rip it out instead of rebasing
as we would lose the APIC/IO_APIC unification otherwise.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use APIC_DIVISOR being set to 16 for both 32/64bit
mode. To escape APIC timer underflow during calibration
set it to the maximum possible value.
Also typo error (CONFG instead of proper CONFIG) fixed.
The error was caught by Venkatesh Pallipadi, thanks a lot Venkatesh!
See details on http://lkml.org/lkml/2008/10/9/425
Reported-by: Venkatesh Pallipad <venkatesh.pallipadi@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create /sys/firmware/sgi_uv sysfs entries for partition_id and coherence_id.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a bios call to return partitioning related info.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the EFI callback function and associated wrapper code.
Initialize SAL system table entry info at boot time.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Look for a UV entry in the EFI tables.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/uv_irq.c: In function 'uv_ack_apic':
arch/x86/kernel/uv_irq.c:26: error: implicit declaration of function 'ack_APIC_irq'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide a means for UV interrupt MMRs to be setup with the message to be sent
when an MSI is raised.
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the below compile warnings due to recent HPET MSI changes
arch/x86/kernel/hpet.c:48: warning: 'hpet_devs' defined but not used
arch/x86/kernel/hpet.c:50: warning: 'per_cpu__cpu_hpet_dev' defined but not used
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Arjan van de Ven noticed that we changed IRQ numbers from decimal
to hex in /proc/interrupts - that can break user-space utilities
like irqbalanced.
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix problem caused by reordering of the calls to uv_cpu_init() &
uv_system_init. Originally, uv_cpu_init() was called AFTER uv_system_init.
This order was recently broken as a side-effect of other patches.
With this patch, initialization of cpu 0 is now done by the system_init
call.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Interrupt remapping could lead to NULL dereference in case of
kzalloc failed and memory leak in other way. So fix the
both cases.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we don't have CONFIG_X86_PM_TIMER=y compiler warns about
unused variables. Move PM timer based calibration into a
separate function and make the code cleaner and the compiler
happy as well.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On 82489DX we don't have ESR register so we should not
write it.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace __DO_ACTION macro with io_apic_modify_irq function.
This allow us to 'grep' definitions being hided by
__DO_ACTION macro:
__unmask_IO_APIC_irq
__mask_IO_APIC_irq
__mask_and_edge_IO_APIC_irq
__unmask_and_level_IO_APIC_irq
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>