Impact: new feature
Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.
To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.
When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).
This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix trampoline sizing bug, save space
While debugging a suspend-to-RAM related issue it occured to me that
if the trampoline code had grown past 4 KB, we would have been
allocating too little memory for it, since the 4 KB size of the
trampoline is hardcoded into arch/x86/kernel/e820.c . Change that
by making the kernel compute the trampoline size and allocate as much
memory as necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Just come across this when booting on an old hw..
Looks somewhat ugly, that single missing space ;)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot crash with numcpus=0 on certain systems
Fix early exception in __get_smp_config with nosmp.
Bail out early when there is no MP table.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The access to the iommu->need_sync member needs to be protected by the
iommu->lock. Otherwise this is a possible race condition. Fix it with
this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In some rare cases a request can arrive an IOMMU with its originial
requestor id even it is aliased. Handle this by setting the device table
entry to the same protection domain for the original and the aliased
requestor id.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: remove stale IOTLB entries
In the non-default nofullflush case the GART is only flushed when
next_bit wraps around. But it can happen that an unmap operation unmaps
memory which is behind the current next_bit location. If these addresses
are reused it may result in stale GART IO/TLB entries. Fix this by
setting the GART next_bit always behind an unmapped location.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Import: robustness checks
Add more checks in the function graph code to detect errors and
perhaps print out better information if a bug happens.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature, let entry function decide to trace or not
This patch lets the graph tracer entry function decide if the tracing
should be done at the end as well. This requires all function graph
entry functions return 1 if it should trace, or 0 if the return should
not be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: better dumpstack output
I noticed in my crash dumps and even in the stack tracer that a
lot of functions listed in the stack trace are simply
return_to_handler which is ftrace graphs way to insert its own
call into the return of a function.
But we lose out where the actually function was called from.
This patch adds in hooks to the dumpstack mechanism that detects
this and finds the real function to print. Both are printed to
let the user know that a hook is still in place.
This does give a funny side effect in the stack tracer output:
Depth Size Location (80 entries)
----- ---- --------
0) 4144 48 save_stack_trace+0x2f/0x4d
1) 4096 128 ftrace_call+0x5/0x2b
2) 3968 16 mempool_alloc_slab+0x16/0x18
3) 3952 384 return_to_handler+0x0/0x73
4) 3568 -240 stack_trace_call+0x11d/0x209
5) 3808 144 return_to_handler+0x0/0x73
6) 3664 -128 mempool_alloc+0x4d/0xfe
7) 3792 128 return_to_handler+0x0/0x73
8) 3664 -32 scsi_sg_alloc+0x48/0x4a [scsi_mod]
As you can see, the real functions are now negative. This is due
to them not being found inside the stack.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new ftrace_graph_stop function
While developing more features of function graph, I hit a bug that
caused the WARN_ON to trigger in the prepare_ftrace_return function.
Well, it was hard for me to find out that was happening because the
bug would not print, it would just cause a hard lockup or reboot.
The reason is that it is not safe to call printk from this function.
Looking further, I also found that it calls unregister_ftrace_graph,
which grabs a mutex and calls kstop machine. This would definitely
lock the box up if it were to trigger.
This patch adds a fast and safe ftrace_graph_stop() which will
stop the function tracer. Then it is safe to call the WARN ON.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: consistency change for function graph
This patch makes function graph record the mcount caller address
the same way the function tracer does.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
There exists macros for x86 asm to handle x86_64 and i386.
This patch updates function graph asm to use them.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge x86/dumpstack into tracing/ftrace because upcoming ftrace changes
depend on cleanups already in x86/dumpstack.
Also merge to latest upstream -rc.
Impact: remove stale IOTLB entries
In the non-default nofullflush case the GART is only flushed when
next_bit wraps around. But it can happen that an unmap operation unmaps
memory which is behind the current next_bit location. If these addresses
are reused it may result in stale GART IO/TLB entries. Fix this by
setting the GART next_bit always behind an unmapped location.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove CONFIG_APM_REAL_MODE_POWER_OFF like CONFIG_APM_POWER_OFF which
has been done for linux-2.2.14pre8 (http://lkml.org/lkml/1999/11/23/3).
Re-introducing CONFIG_APM_POWER_OFF got nack-ed. Stephen didn't bother
to remove CONFIG_APM_REAL_MODE_POWER_OFF, let's get rid of it now.
Reference: http://lkml.org/lkml/2008/5/7/97
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend and enable the function graph tracer to 64-bit x86
This patch implements the support for function graph tracer under x86-64.
Both static and dynamic tracing are supported.
This causes some small CPP conditional asm on arch/x86/kernel/ftrace.c I
wanted to use probe_kernel_read/write to make the return address
saving/patching code more generic but it causes tracing recursion.
That would be perhaps useful to implement a notrace version of these
function for other archs ports.
Note that arch/x86/process_64.c is not traced, as in X86-32. I first
thought __switch_to() was responsible of crashes during tracing because I
believed current task were changed inside but that's actually not the
case (actually yes, but not the "current" pointer).
So I will have to investigate to find the functions that harm here, to
enable tracing of the other functions inside (but there is no issue at
this time, while process_64.c stays out of -pg flags).
A little possible race condition is fixed inside this patch too. When the
tracer allocate a return stack dynamically, the current depth is not
initialized before but after. An interrupt could occur at this time and,
after seeing that the return stack is allocated, the tracer could try to
trace it with a random uninitialized depth. It's a prevention, even if I
hadn't problems with it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq.h: fix missing/extra kernel-doc
genirq: __irq_set_trigger: change pr_warning to pr_debug
irq: fix typo
x86: apic honour irq affinity which was set in early boot
genirq: fix the affinity setting in setup_irq
genirq: keep affinities set from userspace across free/request_irq()
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: always define DECLARE_PCI_UNMAP* macros
x86: fixup config space size of CPU functions for AMD family 11h
x86, bts: fix wrmsr and spinlock over kmalloc
x86, pebs: fix PEBS record size configuration
x86, bts: turn macro into static inline function
x86, bts: exclude ds.c from build when disabled
arch/x86/kernel/pci-calgary_64.c: change simple_strtol to simple_strtoul
x86: use limited register constraint for setnz
xen: pin correct PGD on suspend
x86: revert irq number limitation
x86: fixing __cpuinit/__init tangle, xsave_cntxt_init()
x86: fix __cpuinit/__init tangle in init_thread_xstate()
oprofile: fix an overflow in ppro code
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] powernow-k8: ignore out-of-range PstateStatus value
[CPUFREQ] Documentation: Add Blackfin to list of supported processors
Impact: update comment
Clarify that too small aperture is valid reason for this code.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
If we take a closer look at the rff_trace/rff_action ret_from_fork code,
we have to realize that it does all the wrong things: for example it
checks the TIF flag - while later on jumping back to the ret-from-syscall
path - duplicating the check needlessly.
But checking for _TIF_SYSCALL_TRACE is completely unnecessary here because
we clear that flag for every freshly forked task. So the whole "tracing"
code here, for which there is a out of line jump optimization that makes
it even harder to read, is in reality completely dead code ...
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>
We merge this branch because x86/debug touches code that we started
cleaning up in x86/irq. The two branches started out independent,
but as unexpected amount of activity went into x86/irq, they became
dependent. Resolve that by this cross-merge.
child_rip is called not by its name but indirectly
rather so make it global and aligned.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
entry_32.S is now the only user of KPROBE_ENTRY / KPROBE_END,
treewide. This patch reorders entry_64.S and explicitly generates
a separate section for functions that need the protection. The
generated code before and after the patch is equal.
The KPROBE_ENTRY and KPROBE_END macro's are removed too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up assembly macros and annotations - with some object impact
entry_64.S is the only user of KPROBE_ENTRY / KPROBE_END on
x86_64. This patch reorders entry_64.S and explicitly generates
a separate section for functions that need the protection. The
generated code before and after the patch is equal.
Implicitly changing sections in assembly files makes it more
difficult to follow why the assembler is doing certain things.
For example,
.p2align 5
KPROBE_ENTRY(...)
was not doing what you would expect. Other section changes
(__ex_table, .fixup, .init.rodata) are done explicitly already.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make global variables and a function static
Fix following sparse warnings:
arch/x86/kernel/microcode_core.c:102:22: warning: symbol
'microcode_ops' was not declared. Should it be static?
arch/x86/kernel/microcode_core.c:206:24: warning: symbol
'microcode_pdev' was not declared. Should it be static?
arch/x86/kernel/microcode_core.c:322:6: warning: symbol
'microcode_update_cpu' was not declared. Should it be static?
arch/x86/kernel/microcode_intel.c:468:22: warning: symbol
'microcode_intel_ops' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new "power-tracer" ftrace plugin
This patch adds a C/P-state ftrace plugin that will generate
detailed statistics about the C/P-states that are being used,
so that we can look at detailed decisions that the C/P-state
code is making, rather than the too high level "average"
that we have today.
An example way of using this is:
mount -t debugfs none /sys/kernel/debug
echo cstate > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace | perl scripts/trace/cstate.pl > out.svg
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: more efficient code for ftrace graph tracer
This patch uses the dynamic patching, when available, to patch
the function graph code into the kernel.
This patch will ease the way for letting both function tracing
and function graph tracing run together.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature
This patch sets a C-like output for the function graph tracing.
For this aim, we now call two handler for each function: one on the entry
and one other on return. This way we can draw a well-ordered call stack.
The pid of the previous trace is loosely stored to be compared against
the one of the current trace to see if there were a context switch.
Without this little feature, the call tree would seem broken at
some locations.
We could use the sched_tracer to capture these sched_events but this
way of processing is much more simpler.
2 spaces have been chosen for indentation to fit the screen while deep
calls. The time of execution in nanosecs is printed just after closed
braces, it seems more easy this way to find the corresponding function.
If the time was printed as a first column, it would be not so easy to
find the corresponding function if it is called on a deep depth.
I plan to output the return value but on 32 bits CPU, the return value
can be 32 or 64, and its difficult to guess on which case we are.
I don't know what would be the better solution on X86-32: only print
eax (low-part) or even edx (high-part).
Actually it's thee same problem when a function return a 8 bits value, the
high part of eax could contain junk values...
Here is an example of trace:
sys_read() {
fget_light() {
} 526
vfs_read() {
rw_verify_area() {
security_file_permission() {
cap_file_permission() {
} 519
} 1564
} 2640
do_sync_read() {
pipe_read() {
__might_sleep() {
} 511
pipe_wait() {
prepare_to_wait() {
} 760
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 504
} 1587
clear_buddies() {
} 512
add_cfs_task_weight() {
} 519
update_min_vruntime() {
} 511
} 5602
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 496
} 1631
clear_buddies() {
} 496
update_min_vruntime() {
} 527
} 4580
hrtick_update() {
hrtick_start_fair() {
} 488
} 1489
} 13700
} 14949
} 16016
msecs_to_jiffies() {
} 496
put_prev_task_fair() {
} 504
pick_next_task_fair() {
} 489
pick_next_task_rt() {
} 496
pick_next_task_fair() {
} 489
pick_next_task_idle() {
} 489
------------8<---------- thread 4 ------------8<----------
finish_task_switch() {
} 1203
do_softirq() {
__do_softirq() {
__local_bh_disable() {
} 669
rcu_process_callbacks() {
__rcu_process_callbacks() {
cpu_quiet() {
rcu_start_batch() {
} 503
} 1647
} 3128
__rcu_process_callbacks() {
} 542
} 5362
_local_bh_enable() {
} 587
} 8880
} 9986
kthread_should_stop() {
} 669
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
calc_delta_mine() {
} 511
update_min_vruntime() {
} 511
} 2813
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch changes the name of the "return function tracer" into
function-graph-tracer which is a more suitable name for a tracing
which makes one able to retrieve the ordered call stack during
the code flow.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A workaround for AMD CPU family 11h erratum 311 might cause that the
P-state Status Register shows a "current P-state" which is larger than
the "current P-state limit" in P-state Current Limit Register. For the
wrong P-state value there is no ACPI _PSS object defined and
powernow-k8/cpufreq can't determine the proper CPU frequency for that
state.
As a consequence this can cause a panic during boot (potentially with
all recent kernel versions -- at least I have reproduced it with
various 2.6.27 kernels and with the current .28 series), as an
example:
powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82 processors (2 \
)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
BUG: unable to handle kernel paging request at ffff88086e7528b8
IP: [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
PGD 202063 PUD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.28-rc3-dirty #16
RIP: 0010:[<ffffffff80486361>] [<ffffffff80486361>] cpufreq_stats_update+0x4a/0\
f
Synaptics claims to have extended capabilities, but I'm not able to read them.<6\
6
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88006e7528c0
RDX: 00000000ffffffff RSI: ffff88006e54af00 RDI: ffffffff808f056c
RBP: 00000000fffee697 R08: 0000000000000003 R09: ffff88006e73f080
R10: 0000000000000001 R11: 00000000002191c0 R12: ffff88006fb83c10
R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88006fb50740(0000) knlGS:0000000000000000
Unable to initialize Synaptics hardware.
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88086e7528b8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88006fb82000, task ffff88006fb816d0)
Stack:
ffff88006e74da50 0000000000000000 ffff88006e54af00 ffffffff804863c7
ffff88006e74da50 0000000000000000 00000000ffffffff 0000000000000000
ffff88006fb83c10 ffffffff8024b46c ffffffff808f0560 ffff88006fb83c10
Call Trace:
[<ffffffff804863c7>] ? cpufreq_stat_notifier_trans+0x51/0x83
[<ffffffff8024b46c>] ? notifier_call_chain+0x29/0x4c
[<ffffffff8024b561>] ? __srcu_notifier_call_chain+0x46/0x61
[<ffffffff8048496d>] ? cpufreq_notify_transition+0x93/0xa9
[<ffffffff8021ab8d>] ? powernowk8_target+0x1e8/0x5f3
[<ffffffff80486687>] ? cpufreq_governor_performance+0x1b/0x20
[<ffffffff80484886>] ? __cpufreq_governor+0x71/0xa8
[<ffffffff80484b21>] ? __cpufreq_set_policy+0x101/0x13e
[<ffffffff80485bcd>] ? cpufreq_add_dev+0x3f0/0x4cd
[<ffffffff8048577a>] ? handle_update+0x0/0x8
[<ffffffff803c2062>] ? sysdev_driver_register+0xb6/0x10d
[<ffffffff8056592c>] ? powernowk8_init+0x0/0x7e
[<ffffffff8048604c>] ? cpufreq_register_driver+0x8f/0x140
[<ffffffff80209056>] ? _stext+0x56/0x14f
[<ffffffff802c2234>] ? proc_register+0x122/0x17d
[<ffffffff802c23a0>] ? create_proc_entry+0x73/0x8a
[<ffffffff8025c259>] ? register_irq_proc+0x92/0xaa
[<ffffffff8025c2c8>] ? init_irq_proc+0x57/0x69
[<ffffffff807fc85f>] ? kernel_init+0x116/0x169
[<ffffffff8020cc79>] ? child_rip+0xa/0x11
[<ffffffff807fc749>] ? kernel_init+0x0/0x169
[<ffffffff8020cc6f>] ? child_rip+0x0/0x11
Code: 05 c5 83 36 00 48 c7 c2 48 5d 86 80 48 8b 04 d8 48 8b 40 08 48 8b 34 02 48\
RIP [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
RSP <ffff88006fb83b20>
CR2: ffff88086e7528b8
---[ end trace 0678bac75e67a2f7 ]---
Kernel panic - not syncing: Attempted to kill init!
In short, aftereffect of the wrong P-state is that
cpufreq_stats_update() uses "-1" as index for some array in
cpufreq_stats_update (unsigned int cpu)
{
...
if (stat->time_in_state)
stat->time_in_state[stat->last_index] =
cputime64_add(stat->time_in_state[stat->last_index],
cputime_sub(cur_time, stat->last_time));
...
}
Fortunately, the wrong P-state value is returned only if the core is
in P-state 0. This fix solves the problem by detecting the
out-of-range P-state, ignoring it, and using "0" instead.
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: restructure DS memory allocation to be done by the usage site of DS
Require pre-allocated buffers in ds.h.
Move the BTS buffer allocation for ptrace into ptrace.c.
The pointer to the allocated buffer is stored in the traced task's
task_struct together with the handle returned by ds_request_bts().
Removes memory accounting code.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: generalize the DS code to shared buffers
Change the in-kernel ds.h interface to identify the tracer via a
handle returned on ds_request_~().
Tracers used to be identified via their task_struct.
The changes are required to allow DS to be shared between different
tasks, which is needed for perfmon2 and for ftrace.
For ptrace, the handle is stored in the traced task's task_struct.
This should probably go into a (arch-specific) ptrace context some
time.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sleeping-with-spinlock-held bugs/crashes
- Turn a wrmsr to write the DS_AREA MSR into a wrmsrl.
- Use irqsave variants of spinlocks.
- Do not allocate memory while holding spinlocks.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix DS hw enablement on 64-bit x86
Fix the PEBS record size in the DS configuration.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the CONFIG guard from the .c file into the makefile.
Reported-by: Andi Kleen <andi-suse@firstfloor.org>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix theoretical option string parsing overflow
Since bridge is unsigned, it would seem better to use simple_strtoul that
simple_strtol.
A simplified version of the semantic patch that makes this change is as
follows: (http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r2@
long e;
position p;
@@
e = simple_strtol@p(...)
@@
position p != r2.p;
type T;
T e;
@@
e =
- simple_strtol@p
+ simple_strtoul
(...)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: muli@il.ibm.com
Cc: jdmason@kudzu.us
Cc: discuss@x86-64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build warning
this warning:
arch/x86/kernel/hpet.c:36: warning: ‘hpet_num_timers’ defined but not used
Triggers because hpet_num_timers is unused in the !CONFIG_PCI_MSI case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reverts sequence of reboot fallbacks
Checkin 14d7ca5c57 changed the default
reboot method to "pci", a.k.a. port CF9. Unfortunately this has been
shown to cause lockups on at least two systems for which REBOOT_KBD
worked, both Thinkpads with Intel chipsets. Checkin
3889d0cea2 reverted the default, but did
not revert the fallback chain. This checkin reverts the fallback
chain; port CF9 is now only done by explicit "reboot=pci" or a future
potential DMI key.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: make global variable static
Fix this sparse warning:
arch/x86/kernel/hpet.c:36:18: warning: symbol 'hpet_num_timers' was
not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve backtrace quality
avoid the confusion in call trace because of the lack of padding at the
tail of function.
When do_exit gets called, the return address behind call instruction is
pushed into stack. If something get wrong in do_exit, for x86_64, the
entry "kernel_execve +0x00/0xXX" rather than "child_rip +0xYY/0xZZ" is
in the call trace.
That looks confusing, so add a u2d to make the return address still part
of the original call site. (This also catches any instances of us returning
from that function somehow.)
Signed-off-by: jia zhang <jia.zhang2008@gmail.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: move some code out of .kprobes.text
KPROBE_ENTRY switches code generation to .kprobes.text, and KPROBE_END
uses .popsection to get back to the previous section (.text, normally).
Also replace ENDPROC by END, for consistency.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup of entry_64.S
Except for the order and the place of the functions, this
patch should not change the generated code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move recently introduced dwarf2 macros to dwarf2.h file.
It allow us to not duplicate them in assembly files.
Active usage of _cfi macros don't make assembly files
more obvious to understand but we already have a lot of
macros there which requires to search the definitions
of them *anyway*. But at least it make every cfi usage
one line shorter.
Also some code alignment is done.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix MSIx not enough irq numbers available regression
The manual revert of the sparse_irq patches missed to bring the number
of possible irqs back to the .27 status. This resulted in a regression
when two multichannel network cards were placed in a system with only
one IO_APIC - causing the networking driver to not have the right
IRQ and the device not coming up.
Remove the dynamic allocation logic leftovers and simply return
NR_IRQS in probe_nr_irqs() for now.
Fixes: http://lkml.org/lkml/2008/11/19/354
Reported-by: Jesper Dangaard Brouer <hawk@diku.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jesper Dangaard Brouer <hawk@diku.dk>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/hypervisor.c:37:15: warning: symbol
'get_hypervisor_tsc_freq' was not declared. Should it be static?
arch/x86/kernel/cpu/hypervisor.c:53:16: warning: symbol
'init_hypervisor' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/vmware.c:69:5: warning: symbol 'vmware_platform'
was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:89:15: warning: symbol
'vmware_get_tsc_khz' was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:107:16: warning: symbol
'vmware_set_feature_bits' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make the headers portion of signal_32.c and signal_64.c the same.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new (default-off) tracing visualization feature
Usage example:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled
Then read one of 'trace','latency_trace','trace_pipe'.
To get the best output you can compile your userspace programs with
frame pointers (at least glibc + the app you are tracing).
Signed-off-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use deeper function tracing depth safely
Some tests showed that function return tracing needed a more deeper depth
of function calls. But it could be unsafe to store these return addresses
to the stack.
So these arrays will now be allocated dynamically into task_struct of current
only when the tracer is activated.
Typical scheme when tracer is activated:
- allocate a return stack for each task in global list.
- fork: allocate the return stack for the newly created task
- exit: free return stack of current
- idle init: same as fork
I chose a default depth of 50. I don't have overruns anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Simplify the irq-sampled stack overflow debug check:
- eliminate an #idef
- use WARN_ONCE() to emit a single warning (all bets are off
after the first such warning anyway)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make stack overflow debug check and printout narrower
stack_overflow_check() should consider the stack usage of pt_regs, and
thus it could warn us in advance. Additionally, it looks better for
the warning time to start at INITIAL_JIFFIES.
Assuming that rsp gets close to the check point before interrupt
arrives: when interrupt really happens, thread_info will be partly
overrode.
Signed-off-by: jia zhang <jia.zhang2008@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reverts default reboot method.
Checkin 14d7ca5c57 changed the default
reboot method to "pci", a.k.a. port CF9. Unfortunately this has been
shown to cause lockups on at least two systems for which REBOOT_KBD
worked, both Thinkpads with Intel chipsets. This reverts the default
to REBOOT_KBD, while leaving the option to have "reboot=pci" specified
explicitly or via a DMI match.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix bootup crash
Even though it tested fine for me, there was still a bug in the
first patch: I have overlooked a call to ptregscall_common. This
patch fixes that, I think, but the code is never executed for
me while running a debian install... (I tested this by putting
an "1:jmp 1b" in there.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
DISABLE_INTERRUPTS(CLBR_NONE)/TRACE_IRQS_OFF is now always
executed just before paranoid_exit. Move it there.
Split out paranoidzeroentry, paranoiderrorentry, and
paranoidzeroentry_ist to get more readable macro's.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, shrink kernel image size
Also expand the paranoid_exit0 macro into nmi_exit inside the
nmi stub in the case of enabled irq-tracing.
This gives a few hundred bytes code size reduction.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The save_rest function completes a partial stack frame for use
by the PTREGSCALL macro. This also avoids the indirect call in
PTREGSCALLs.
This adds the macro movq_cfi_restore to hide the CFI_RESTORE
annotation when restoring a register from the stack frame.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
The break builds with older binutils (2.16.1):
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:282: Error: too many positional arguments
arch/x86/kernel/entry_64.S:283: Error: too many positional arguments
arch/x86/kernel/entry_64.S:284: Error: too many positional arguments
arch/x86/kernel/entry_64.S:285: Error: too many positional arguments
arch/x86/kernel/entry_64.S:286: Error: too many positional arguments
arch/x86/kernel/entry_64.S:287: Error: too many positional arguments
arch/x86/kernel/entry_64.S:288: Error: too many positional arguments
arch/x86/kernel/entry_64.S:289: Error: too many positional arguments
arch/x86/kernel/entry_64.S:290: Error: too many positional arguments
Took some time to figure out the detail that GAS chokes on: it's
negative offsets. Rearrange the calculations to make sure we never
go negative.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we migrate an interrupt from one CPU to another, we set the
move_in_progress flag and clean up the vectors later once they're not
being used. If you're unlucky and call destroy_irq() before the vectors
become un-used, the move_in_progress flag is never cleared, which causes
the interrupt to become unusable.
This was discovered by Jesse Brandeburg for whom it manifested as an
MSI-X device refusing to use MSI-X mode when the driver was unloaded
and reloaded repeatedly.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This add-on patch to x86: move entry_64.S register saving out
of the macros visually cleans up the appearance of the code by
introducing some basic helper macro's. It also adds some cfi
annotations which were missing.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Annotate xsave_cntxt_init() as "can be called outside of __init".
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix incorrect __init annotation
This patch removes the following section mismatch warning. A patch set
was send previously (http://lkml.org/lkml/2008/11/10/407). But
introduce some other problem, reported by Rufus
(http://lkml.org/lkml/2008/11/11/46). Then Ingo Molnar suggest that,
it's best to remove __init from xsave_cntxt_init(void). Which is the
second patch in this series. Now, this one removes the following
warning.
WARNING: arch/x86/kernel/built-in.o(.cpuinit.text+0x2237): Section
mismatch in reference from the function cpu_init() to the function
.init.text:init_thread_xstate()
The function __cpuinit cpu_init() references
a function __init init_thread_xstate().
If init_thread_xstate is only used by cpu_init then
annotate init_thread_xstate with a matching annotation.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Here is a combined patch that moves "save_args" out-of-line for
the interrupt macro and moves "error_entry" mostly out-of-line
for the zeroentry and errorentry macros.
The save_args function becomes really straightforward and easy
to understand, with the possible exception of the stack switch
code, which now needs to copy the return address of to the
calling function. Normal interrupts arrive with ((~vector)-0x80)
on the stack, which gets adjusted in common_interrupt:
<common_interrupt>:
(5) addq $0xffffffffffffff80,(%rsp) /* -> ~(vector) */
(4) sub $0x50,%rsp /* space for registers */
(5) callq ffffffff80211290 <save_args>
(5) callq ffffffff80214290 <do_IRQ>
<ret_from_intr>:
...
An apic interrupt stub now look like this:
<thermal_interrupt>:
(5) pushq $0xffffffffffffff05 /* ~(vector) */
(4) sub $0x50,%rsp /* space for registers */
(5) callq ffffffff80211290 <save_args>
(5) callq ffffffff80212b8f <smp_thermal_interrupt>
(5) jmpq ffffffff80211f93 <ret_from_intr>
Similarly the exception handler register saving function becomes
simpler, without the need of any parameter shuffling. The stub
for an exception without errorcode looks like this:
<overflow>:
(6) callq *0x1cad12(%rip) # ffffffff803dd448 <pv_irq_ops+0x38>
(2) pushq $0xffffffffffffffff /* no syscall */
(4) sub $0x78,%rsp /* space for registers */
(5) callq ffffffff8030e3b0 <error_entry>
(3) mov %rsp,%rdi /* pt_regs pointer */
(2) xor %esi,%esi /* no error code */
(5) callq ffffffff80213446 <do_overflow>
(5) jmpq ffffffff8030e460 <error_exit>
And one for an exception with errorcode like this:
<segment_not_present>:
(6) callq *0x1cab92(%rip) # ffffffff803dd448 <pv_irq_ops+0x38>
(4) sub $0x78,%rsp /* space for registers */
(5) callq ffffffff8030e3b0 <error_entry>
(3) mov %rsp,%rdi /* pt_regs pointer */
(5) mov 0x78(%rsp),%rsi /* load error code */
(9) movq $0xffffffffffffffff,0x78(%rsp) /* no syscall */
(5) callq ffffffff80213209 <do_segment_not_present>
(5) jmpq ffffffff8030e460 <error_exit>
Unfortunately, this last type is more than 32 bytes. But the total space
savings due to this patch is about 2500 bytes on an smp-configuration,
and I think the code is clearer than it was before. The tested kernels
were non-paravirt ones (i.e., without the indirect call at the top of
the exception handlers).
Anyhow, I tested this patch on top of a recent -tip. The machine
was an 2x4-core Xeon at 2333MHz. Measured where the delays between
(almost-)adjacent rdtsc instructions. The graphs show how much
time is spent outside of the program as a function of the measured
delay. The area under the graph represents the total time spent
outside the program. Eight instances of the rdtsctest were
started, each pinned to a single cpu. The histogams are added.
For each kernel two measurements were done: one in mostly idle
condition, the other while running "bonnie++ -f", bound to cpu 0.
Each measurement took 40 minutes runtime. See the attached graphs
for the results. The graphs overlap almost everywhere, but there
are small differences.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
arch/x86/kernel/setup.c:592: warning: 'dmi_low_memory_corruption' defined but not used
this is only used if CONFIG_X86_RESERVE_LOW_64K is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce size of the kernel image a bit
Fix:
arch/x86/kernel/genx2apic_uv_x.c:403: warning: 'uv_heartbeat_disable' defined but not used
the function is only used when CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell Optiplex 330 appears to hang on reboot. This is resolved by adding
a quirk to set bios reboot.
Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
Signed-off-by: Steve Conklin <steve.conklin@canonical.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
We can autodetect those system that need cluster apic, and update genapic
accordingly.
We can also remove wakeup.h for e7000, because it's default one is now
the same as overall default mach_wakecpu.h
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen the reach of the low-memory-protect DMI quirk
Phoenix BIOSes variously identify their vendor as "Phoenix Technologies,
LTD" or "Phoenix Technologies LTD" (without the comma.)
This patch makes the identification string in the bad_bios_dmi_table
more general (following a suggestion by Ingo Molnar), so that both
versions are handled.
Again, the patched file compiles cleanly and the patch has been tested
successfully on my machine.
Signed-off-by: Philipp Kohlbecher <xt28@gmx.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: makes device isolation the default for AMD IOMMU
Some device drivers showed double-free bugs of DMA memory while testing
them with AMD IOMMU. If all devices share the same protection domain
this can lead to data corruption and data loss. Prevent this by putting
each device into its own protection domain per default.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
this compiler warning:
arch/x86/kernel/ds.c: In function 'ds_request':
arch/x86/kernel/ds.c:368: warning: 'context' may be used uninitialized in this function
Shows that the code flow in ds_request() is buggy - it goes into
the unlock+release-context path even when the context is not allocated
yet.
First allocate the context, then do the other checks.
Also, take care with GFP allocations under the ds_lock spinlock.
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: help to find the better depth of trace
We decided to arbitrary define the depth of function return trace as
"20". Perhaps this is not enough. To help finding an optimal depth, we
measure now the overrun: the number of functions that have been missed
for the current thread. By default this is not displayed, we have to
do set a particular flag on the return tracer: echo overrun >
/debug/tracing/trace_options And the overrun will be printed on the
right.
As the trace shows below, the current 20 depth is not enough.
update_wall_time+0x37f/0x8c0 -> update_xtime_cache (345 ns) (Overruns: 2838)
update_wall_time+0x384/0x8c0 -> clocksource_get_next (1141 ns) (Overruns: 2838)
do_timer+0x23/0x100 -> update_wall_time (3882 ns) (Overruns: 2838)
tick_do_update_jiffies64+0xbf/0x160 -> do_timer (5339 ns) (Overruns: 2838)
tick_sched_timer+0x6a/0xf0 -> tick_do_update_jiffies64 (7209 ns) (Overruns: 2838)
vgacon_set_cursor_size+0x98/0x120 -> native_io_delay (2613 ns) (Overruns: 274)
vgacon_cursor+0x16e/0x1d0 -> vgacon_set_cursor_size (33151 ns) (Overruns: 274)
set_cursor+0x5f/0x80 -> vgacon_cursor (36432 ns) (Overruns: 274)
con_flush_chars+0x34/0x40 -> set_cursor (38790 ns) (Overruns: 274)
release_console_sem+0x1ec/0x230 -> up (721 ns) (Overruns: 274)
release_console_sem+0x225/0x230 -> wake_up_klogd (316 ns) (Overruns: 274)
con_flush_chars+0x39/0x40 -> release_console_sem (2996 ns) (Overruns: 274)
con_write+0x22/0x30 -> con_flush_chars (46067 ns) (Overruns: 274)
n_tty_write+0x1cc/0x360 -> con_write (292670 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2a/0x90 -> native_apic_mem_write (330 ns) (Overruns: 274)
irq_enter+0x17/0x70 -> idle_cpu (413 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2f/0x90 -> irq_enter (1525 ns) (Overruns: 274)
ktime_get_ts+0x40/0x70 -> getnstimeofday (465 ns) (Overruns: 274)
ktime_get_ts+0x60/0x70 -> set_normalized_timespec (436 ns) (Overruns: 274)
ktime_get+0x16/0x30 -> ktime_get_ts (2501 ns) (Overruns: 274)
hrtimer_interrupt+0x77/0x1a0 -> ktime_get (3439 ns) (Overruns: 274)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wakeup_secondary_cpu with hotplug
We can not put that into x86_quirks, because that is __initdata.
So try to move that to genapic, and add update_genapic in x86_quirks.
later we even could use that stub to:
1. autodetect CONFIG_ES7000_CLUSTERED_APIC
2. more correct inquire_remote_apic with apic_verbosity setting.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix incorrectly marked unstable TSC clock
Patch (commit 0d12cdd "sched: improve sched_clock() performance") has
a regression on one of the test systems here.
With the patch, I see:
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 28 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to check_tsc_sync_source failed
Whereas, without the patch syncs pass fine on all CPUs:
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Due to this, TSC is marked unstable, when it is not actually unstable.
This is because syncs in check_tsc_wrap() goes away due to this commit.
As per the discussion on this thread, correct way to fix this is to add
explicit syncs as below?
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix secondary-CPU wakeup/init path with numaq and es7000
While looking at wakeup_secondary_cpu for WAKE_SECONDARY_VIA_NMI:
|#ifdef WAKE_SECONDARY_VIA_NMI
|/*
| * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
| * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
| * won't ... remember to clear down the APIC, etc later.
| */
|static int __devinit
|wakeup_secondary_cpu(int logical_apicid, unsigned long start_eip)
|{
| unsigned long send_status, accept_status = 0;
| int maxlvt;
|...
| if (APIC_INTEGRATED(apic_version[phys_apicid])) {
| maxlvt = lapic_get_maxlvt();
I noticed that there is no warning about undefined phys_apicid...
because WAKE_SECONDARY_VIA_NMI and WAKE_SECONDARY_VIA_INIT can not be
defined at the same time. So NUMAQ is using wrong wakeup_secondary_cpu.
WAKE_SECONDARY_VIA_NMI, WAKE_SECONDARY_VIA_INIT and
WAKE_SECONDARY_VIA_MIP are variants of a weird and fragile
preprocessor-driven "HAL" mechanisms to specify the kind of secondary-CPU
wakeup strategy a given x86 kernel will use.
The vast majority of systems want to use INIT for secondary wakeup - NUMAQ
uses an NMI, (old-style-) ES7000 uses 'MIP' (a firmware driven in-memory
flag to let secondaries continue).
So convert these mechanisms to x86_quirks and add a
->wakeup_secondary_cpu() method to specify the rare exception
to the sane default.
Extend genapic accordingly as well, for 32-bit.
While looking further, I noticed that functions in wakecup.h for numaq
and es7000 are different to the default in mach_wakecpu.h - but smpboot.c
will only use default mach_wakecpu.h with smphook.h.
So we need to add mach_wakecpu.h for mach_generic, to properly support
numaq and es7000, and vectorize the following SMP init methods:
int trampoline_phys_low;
int trampoline_phys_high;
void (*wait_for_init_deassert)(atomic_t *deassert);
void (*smp_callin_clear_local_apic)(void);
void (*store_NMI_vector)(unsigned short *high, unsigned short *low);
void (*restore_NMI_vector)(unsigned short *high, unsigned short *low);
void (*inquire_remote_apic)(int apicid);
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
All blame goes to: color white,red "[^[:graph:]]+$"
in .nanorc ;).
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix es7000 build
CC arch/x86/kernel/es7000_32.o
arch/x86/kernel/es7000_32.c: In function find_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:255: error: implicit declaration of function acpi_get_table_with_size
arch/x86/kernel/es7000_32.c:261: error: implicit declaration of function early_acpi_os_unmap_memory
arch/x86/kernel/es7000_32.c: In function unmap_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:277: error: implicit declaration of function __acpi_unmap_table
make[1]: *** [arch/x86/kernel/es7000_32.o] Error 1
we applied one patch out of order...
| commit a73aaedd95
| Author: Yinghai Lu <yhlu.kernel@gmail.com>
| Date: Sun Sep 14 02:33:14 2008 -0700
|
| x86: check dsdt before find oem table for es7000, v2
|
| v2: use __acpi_unmap_table()
that patch need:
x86: use early_ioremap in __acpi_map_table
x86: always explicitly map acpi memory
acpi: remove final __acpi_map_table mapping before setting acpi_gbl_permanent_mmap
acpi/x86: introduce __apci_map_table, v4
submitted to the ACPI tree but not upstream yet.
fix it until those patches applied, need to revert this one
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a problem where ds_request() returned an error without releasing the
ds lock.
Reported-by: Stephane Eranian <eranian@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the support for dynamic tracing on the function return tracer.
The whole difference with normal dynamic function tracing is that we don't need
to hook on a particular callback. The only pro that we want is to nop or set
dynamically the calls to ftrace_caller (which is ftrace_return_caller here).
Some security checks ensure that we are not trying to launch dynamic tracing for
return tracing while normal function tracing is already running.
An example of trace with getnstimeofday set as a filter:
ktime_get_ts+0x22/0x50 -> getnstimeofday (2283 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1396 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1825 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1426 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1524 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1434 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1502 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1404 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1397 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1051 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1314 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1344 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1163 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1390 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1374 ns)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix possible race condition in ftrace function return tracer
This fixes a possible race condition if index incrementation
is not immediately flushed in memory.
Thanks for Andi Kleen and Steven Rostedt for pointing out this issue
and give me this solution.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow archs more flexibility on dynamic ftrace implementations
Dynamic ftrace has largly been developed on x86. Since x86 does not
have the same limitations as other architectures, the ftrace interaction
between the generic code and the architecture specific code was not
flexible enough to handle some of the issues that other architectures
have.
Most notably, module trampolines. Due to the limited branch distance
that archs make in calling kernel core code from modules, the module
load code must create a trampoline to jump to what will make the
larger jump into core kernel code.
The problem arises when this happens to a call to mcount. Ftrace checks
all code before modifying it and makes sure the current code is what
it expects. Right now, there is not enough information to handle modifying
module trampolines.
This patch changes the API between generic dynamic ftrace code and
the arch dependent code. There is now two functions for modifying code:
ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into
a nop, where the original text is calling addr. (mod is the
module struct if called by module init)
ftrace_make_caller(rec, addr) - convert the code rec->ip that should
be a nop into a caller to addr.
The record "rec" now has a new field called "arch" where the architecture
can add any special attributes to each call site record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e51af66308, which was
wrongly hoovered up and submitted about a month after a better fix had
already been merged.
The better fix is commit cbda1ba898
("PCI/iommu: blacklist DMAR on Intel G31/G33 chipsets"), where we do
this blacklisting based on the DMI identification for the offending
motherboard, since sometimes this chipset (or at least a chipset with
the same PCI ID) apparently _does_ actually have an IOMMU.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: Fix interrupt via the apicinterrupt macro
Checkin 939b787130 changed the
"interrupt" macro, but the "interrupt" macro is also invoked
indirectly from the "apicinterrupt" macro.
The "apicinterrupt" macro probably should have its own collection of
systematic stubs for the same reason the main IRQ code does; as is it
is a huge amount of replicated code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Optimize a bit the function return tracer
This patch changes the calling convention of prepare_ftrace_return to
pass its arguments by register. This will optimize it a bit and
prepare it to support dynamic tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove spinlocks and irq disabling in function return tracer.
I've tried to figure out all of the race condition that could happen
when the tracer pushes or pops a return address trace to/from the
current thread_info.
Theory:
_ One thread can only execute on one cpu at a time. So this code
doesn't need to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an
interrupt. If an interrupt is raised, it will increase the index of
the return stack storage and then execute until the end of the
tracing to finally free the index it used. We don't need to disable
irqs.
This is theorical. In practice, I've tested it with a two-core SMP and
had no problem at all. Perhaps -tip testing could confirm it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: name change of unlikely tracer and profiler
Ingo Molnar suggested changing the config from UNLIKELY_PROFILE
to BRANCH_PROFILING. I never did like the "unlikely" name so I
went one step farther, and renamed all the unlikely configurations
to a "BRANCH" variant.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add debug check
If none of the perfctrs are free when calculating cpu_khz we default to
using ctr 3 (ie, we just choose 3). This may lead to an incorrect tsc
freq value which can cause the system to be unstable.
To aid in future debugging, WARN the user of a potential problem.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
KVM: Fix pit memory leak if unable to allocate irq source id
KVM: ia64: fix vmm_spin_{un}lock for !CONFIG_SMP
KVM: VMX: Set IGMT bit in EPT entry
KVM: Require the PCI subsystem
x86: KVM guest: fix section mismatch warning in kvmclock.c
KVM: ia64: Use guest signal mask when blocking
KVM: MMU: increase per-vcpu rmap cache alloc size
Older versions of gas don't implement the C-style != operator, they
instead want the Pascal-style <> operator. Change != to <> so we
don't break compilation with those old versions of gas.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (47 commits)
ACPI: pci_link: remove acpi_irq_balance_set() interface
fujitsu-laptop: Add DMI callback for Lifebook S6420
ACPI: EC: Don't do transaction from GPE handler in poll mode.
ACPI: EC: lower interrupt storm treshold
ACPICA: Use spinlock for acpi_{en|dis}able_gpe
ACPI: EC: restart failed command
ACPI: EC: wait for last write gpe
ACPI: EC: make kernel messages more useful when GPE storm is detected
ACPI: EC: revert msleep patch
thinkpad_acpi: fingers off backlight if video.ko is serving this functionality
sony-laptop: fingers off backlight if video.ko is serving this functionality
msi-laptop: fingers off backlight if video.ko is serving this functionality
fujitsu-laptop: fingers off backlight if video.ko is serving this functionality
eeepc-laptop: fingers off backlight if video.ko is serving this functionality
compal: fingers off backlight if video.ko is serving this functionality
asus-acpi: fingers off backlight if video.ko is serving this functionality
Acer-WMI: fingers off backlight if video.ko is serving this functionality
ACPI video: if no ACPI backlight support, use vendor drivers
ACPI: video: Ignore devices that aren't present in hardware
Delete an unwanted return statement at evgpe.c
...
Impact: make nmi_shootdown_cpus() callable from preemptible context
We need to know on which CPU we are running on, and we don't want to be
preempted while doing this.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen nmi_shootdown_cpus() availability
The X86_LOCAL_APIC #ifdef was for kdump. For !SMP, the function simply
does nothing.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make nmi_shootdown_cpus() available to the rest of the x86 platform
Now nmi_shootdown_cpus() is ready to be used by non-kdump code also.
Move it to reboot.c.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make API available to the rest of x86 platform code
Add prototype to asm/reboot.h.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend nmi_shootdown_cpus() with a callback
The reboot code will use a different function on crash_nmi_callback().
Adding a function pointer parameter to nmi_shootdown_cpus() for that.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
For the kdump-specific code that was living on nmi_shootdown_cpus().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This variable will be moved to non-kdump-specific code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The NMI CPU-halting code will be used on non-kdump cases, also
(e.g. emergency_reboot when virtualization is enabled).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bootup crash
the branch tracer missed arch/x86/vdso/vclock_gettime.c from
disabling tracing, which caused such bootup crashes:
[ 201.840097] init[1]: segfault at 7fffed3fe7c0 ip 00007fffed3fea2e sp 000077
also clean up the ugly ifdefs in arch/x86/kernel/vsyscall_64.c by
creating DISABLE_UNLIKELY_PROFILE facility for code to turn off
instrumentation on a per file basis.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new unlikely/likely profiler
Andrew Morton recently suggested having an in-kernel way to profile
likely and unlikely macros. This patch achieves that goal.
When configured, every(*) likely and unlikely macro gets a counter attached
to it. When the condition is hit, the hit and misses of that condition
are recorded. These numbers can later be retrieved by:
/debugfs/tracing/profile_likely - All likely markers
/debugfs/tracing/profile_unlikely - All unlikely markers.
# cat /debug/tracing/profile_unlikely | head
correct incorrect % Function File Line
------- --------- - -------- ---- ----
2167 0 0 do_arch_prctl process_64.c 832
0 0 0 do_arch_prctl process_64.c 804
2670 0 0 IS_ERR err.h 34
71230 5693 7 __switch_to process_64.c 673
76919 0 0 __switch_to process_64.c 639
43184 33743 43 __switch_to process_64.c 624
12740 64181 83 __switch_to process_64.c 594
12740 64174 83 __switch_to process_64.c 590
# cat /debug/tracing/profile_unlikely | \
awk '{ if ($3 > 25) print $0; }' |head -20
44963 35259 43 __switch_to process_64.c 624
12762 67454 84 __switch_to process_64.c 594
12762 67447 84 __switch_to process_64.c 590
1478 595 28 syscall_get_error syscall.h 51
0 2821 100 syscall_trace_leave ptrace.c 1567
0 1 100 native_smp_prepare_cpus smpboot.c 1237
86338 265881 75 calc_delta_fair sched_fair.c 408
210410 108540 34 calc_delta_mine sched.c 1267
0 54550 100 sched_info_queued sched_stats.h 222
51899 66435 56 pick_next_task_fair sched_fair.c 1422
6 10 62 yield_task_fair sched_fair.c 982
7325 2692 26 rt_policy sched.c 144
0 1270 100 pre_schedule_rt sched_rt.c 1261
1268 48073 97 pick_next_task_rt sched_rt.c 884
0 45181 100 sched_info_dequeued sched_stats.h 177
0 15 100 sched_move_task sched.c 8700
0 15 100 sched_move_task sched.c 8690
53167 33217 38 schedule sched.c 4457
0 80208 100 sched_info_switch sched_stats.h 270
30585 49631 61 context_switch sched.c 2619
# cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }'
39900 36577 47 pick_next_task sched.c 4397
20824 15233 42 switch_mm mmu_context_64.h 18
0 7 100 __cancel_work_timer workqueue.c 560
617 66484 99 clocksource_adjust timekeeping.c 456
0 346340 100 audit_syscall_exit auditsc.c 1570
38 347350 99 audit_get_context auditsc.c 732
0 345244 100 audit_syscall_entry auditsc.c 1541
38 1017 96 audit_free auditsc.c 1446
0 1090 100 audit_alloc auditsc.c 862
2618 1090 29 audit_alloc auditsc.c 858
0 6 100 move_masked_irq migration.c 9
1 198 99 probe_sched_wakeup trace_sched_switch.c 58
2 2 50 probe_wakeup trace_sched_wakeup.c 227
0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144
4514 2090 31 __grab_cache_page filemap.c 2149
12882 228786 94 mapping_unevictable pagemap.h 50
4 11 73 __flush_cpu_slab slub.c 1466
627757 330451 34 slab_free slub.c 1731
2959 61245 95 dentry_lru_del_init dcache.c 153
946 1217 56 load_elf_binary binfmt_elf.c 904
102 82 44 disk_put_part genhd.h 206
1 1 50 dst_gc_task dst.c 82
0 19 100 tcp_mss_split_point tcp_output.c 1126
As you can see by the above, there's a bit of work to do in rethinking
the use of some unlikelys and likelys. Note: the unlikely case had 71 hits
that were more than 25%.
Note: After submitting my first version of this patch, Andrew Morton
showed me a version written by Daniel Walker, where I picked up
the following ideas from:
1) Using __builtin_constant_p to avoid profiling fixed values.
2) Using __FILE__ instead of instruction pointers.
3) Using the preprocessor to stop all profiling of likely
annotations from vsyscall_64.c.
Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar
for their feed back on this patch.
(*) Not ever unlikely is recorded, those that are used by vsyscalls
(a few of them) had to have profiling disabled.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This removes the acpi_irq_balance_set() interface from the PCI
interrupt link driver.
x86 used acpi_irq_balance_set() to tell the PCI interrupt link
driver to configure links to minimize IRQ sharing. But the link
driver can easily figure out whether to turn on IRQ balancing
based on the IRQ model (PIC/IOAPIC/etc), so we can get rid of
that external interface.
It's better for the driver to figure this out at init-time. If
we set it externally via the x86 code, the interface reduces
modularity, and we depend on the fact that acpi_process_madt()
happens before we process the kernel command line.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Changes reboot behavior.
If port CF9 seems to be safe to touch, attempt it before trying the
keyboard controller. Port CF9 is not available on all chipsets (a
significant but decreasing number of modern chipsets don't implement
it), but port CF9 itself should in general be safe to poke (no ill
effects if unimplemented) on any system which has PCI Configuration
Method #1 or #2, as it falls inside the PCI configuration port range
in both cases. No chipset without PCI is known to have port CF9,
either, although an explicit "pci=bios" would mean we miss this and
therefore don't use port CF9. An explicit "reboot=pci" can be used to
force the use of port CF9.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move the IRQ stub generation to assembly to simplify it and for
consistency with 32 bits. Doing it in a C file with asm() statements
doesn't help clarity, and it prevents some optimizations.
Shrink the IRQ stubs down to just over four bytes per (we fit seven
into a 32-byte chunk.) This shrinks the total icache consumption of
the IRQ stubs down to an even kilobyte, if all of them are in active
use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips.
To get the most effect, cache-align the IRQ stubs.
This makes the 64-bit code match changes already done to the 32-bit
code, and should open up irqinit*.c for unification.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Shrink the IRQ stubs on 32 bits down to just over four bytes per (we
fit seven into a 32-byte chunk.) This shrinks the total icache
consumption of the IRQ stubs down to an even kilobyte, if all of them
are in active use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips (the exception being
486/Elan/Geode which have only 16-byte cachelines, but are unlikely to
have too many interrupt sources.)
To get the most effect, cache-align the IRQ stubs.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>