Impact: change calling convention of existing clock_event APIs
struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.
Another single-patch change. For safety, we BUG_ON() in
clockevents_register_device() if it's not set.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Impact: change existing irq_chip API
Not much point with gentle transition here: the struct irq_chip's
setaffinity method signature needs to change.
Fortunately, not widely used code, but hits a few architectures.
Note: In irq_select_affinity() I save a temporary in by mangling
irq_desc[irq].affinity directly. Ingo, does this break anything?
(Folded in fix from KOSAKI Motohiro)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: jeremy@xensource.com
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Impact: change calling convention of existing cpumask APIs
Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.
These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com
Impact: cleanup
Each SMP arch defines these themselves. Move them to a central
location.
Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
CONFIG_INIT_ALL_POSSIBLE for this rather than break them.
2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
Those archs simply have phys_cpu_present_map replaced everywhere.
3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
so I just manipulate them both in sync.
4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
declarations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
Cc: ink@jurassic.park.msu.ru
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: tony.luck@intel.com
Cc: takata@linux-m32r.org
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: paulus@samba.org
Cc: schwidefsky@de.ibm.com
Cc: lethal@linux-sh.org
Cc: wli@holomorphy.com
Cc: davem@davemloft.net
Cc: jdike@addtoit.com
Cc: mingo@redhat.com
Impact: clean up, speed up
->it_pid (was ->it_process) has also a special meaning: if it is NULL,
the timer is under deletion or it wasn't initialized yet. We can check
->it_signal != NULL instead, this way we can
- simplify sys_timer_create() a bit
- remove yet another check from lock_timer()
- move put_pid(->it_pid) into release_posix_timer() which
runs outside of ->it_lock
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: restructure, clean up code
k_itimer holds the ref to the ->it_process until sys_timer_delete(). This
allows to pin up to RLIMIT_SIGPENDING dead task_struct's. Change the code
to use "struct pid *" instead.
The patch doesn't kill ->it_process, it places ->it_pid into the union.
->it_process is still used by do_cpu_nanosleep() as before. It would be
trivial to change the nanosleep code as well, but since it uses it_process
in a special way I think it is better to keep this field for grep.
The patch bloats the kernel by 104 bytes and it also adds the new pointer,
->it_signal, to k_itimer. It is used by lock_timer() to verify that the
found timer was not created by another process. It is not clear why do we
use the global database (and thus the global idr_lock) for posix timers.
We still need the signal_struct->posix_timers which contains all useable
timers, perhaps it is better to use some form of per-process array
instead.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In my device I get many interrupts from a high speed USB device in a very
short period of time. The system spends a lot of time reprogramming the
hardware timer which is in a slower timing domain as compared to the CPU.
This results in the CPU spending a huge amount of time waiting for the
timer posting to be done. All of this reprogramming is useless as the
wake up time has not changed.
As measured using ETM trace this drops my reprogramming penalty from
almost 60% CPU load down to 15% during high interrupt rate. I can send
traces to show this.
Suppress setting of duplicate timer event when timer already stopped.
Timer programming can be very costly and can result in long cpu stall/wait
times.
[akpm@linux-foundation.org: coding-style fixes]
[tglx@linutronix.de: move the check to the right place and avoid raising
the softirq for nothing]
Signed-off-by: Richard Woodruff <r-woodruff2@ti.com>
Cc: johnstul@us.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We merge the irq/sparseirq, x86/quirks and x86/reboot trees into the
cpus4096 tree because the io-apic changes in the sparseirq change
conflict with the cpumask changes in the cpumask tree, and we
want to resolve those.
arch_reinit_sched_domains() used to call arch_update_cpu_topology()
via arch_init_sched_domains(). This call got lost with
e761b77252 ("cpu hotplug, sched: Introduce
cpu_active_map and redo sched domain managment (take 2)".
So we might end up with outdated and missing cpus in the cpu core
maps (architecture used to call arch_reinit_sched_domains if cpu
topology changed).
This adds a call to arch_update_cpu_topology in partition_sched_domains
which gets called whenever scheduling domains get updated. Which is
what is supposed to happen when cpu topology changes.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change arch_update_cpu_topology so it returns 1 if the cpu topology changed
and 0 if it didn't change. This will be useful for the next patch which adds
a call to this function in partition_sched_domains.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The trace point only caught one of many places where a task changes cpu,
put it in the right place to we get all of them.
Change the signature while we're at it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure code, cleanup
Remove BTS bits from the hw-branch-tracer (renamed from bts-tracer) and
use the ds interface.
Signed-off-by: Markus Metzger <markut.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove false positive warning
After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
it still might have pending softirqs. However take_cpu_down makes sure
that the idle task will run next instead of ksoftirqd on the taken down cpu.
The idle task will call tick_nohz_stop_sched_tick which might warn about
pending softirqs just before the cpu kills itself completely.
However the pending softirqs on the dead cpu aren't a problem because they
will be moved to an online cpu during CPU_DEAD handling.
So make sure we warn only for online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I added EXPORT_SYMBOL_GPLs for all functions part of the API
(ring_buffer.h). This is required since oprofile is using the ring
buffer and the compilation as modules would fail otherwise.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
test in 2.6.7! Bad bad bug, good good find.
The i_mmap_writable count must be incremented for VM_SHARED (just as
i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
when dup_mmap() copies the vma for fork: it has its own more optimal
version of __vma_link_file(), and I missed this out. So the count
was later going down to 0 (dangerous) when one end unmapped, then
wrapping negative (inefficient) when the other end unmapped.
The only impact on x86 would have been that setting a mandatory lock on
a file which has at some time been opened O_RDWR and mapped MAP_SHARED
(but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
when it should succeed, or succeed when it should fail.
But those architectures which rely on flush_dcache_page() to flush
userspace modifications back into the page before the kernel reads it,
may in some cases have skipped the flush after such a fork - though any
repetitive test will soon wrap the count negative, in which case it will
flush_dcache_page() unnecessarily.
Fix would be a two-liner, but mapping variable added, and comment moved.
Reported-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
to my 966c8c12dc sprint_symbol(): use
less stack exposing a bug in slub's list_locations() -
kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
beyond the end of page provided.
The 100 slop which list_locations() allows at end of page looks roughly
enough for all the other stuff it might print after the symbol before
it checks again: break out KSYM_SYMBOL_LEN earlier than before.
Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
them.
[akpm@linux-foundation.org: ftrace.h needs module.h]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc Miles Lane <miles.lane@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running kmemtraced, which uses splice() on relayfs, causes a hard lock on
x86-64 SMP. As described by Tom Zanussi:
It looks like you hit the same problem as described here:
commit 8191ecd1d1
splice: fix infinite loop in generic_file_splice_read()
relay uses the same loop but it never got noticed or fixed.
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Tested-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: fix possible deadlock in CPU hot-remove path
This patch fixes a possible deadlock scenario in the CPU remove path.
migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
with the lock held. Then one of the tasks on the migration queue ends up
calling tg_shares_up which then also tries to acquire the same rq->lock.
[c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
[c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
[c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
[c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
[c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
[c000000058eab640] c00000000007ada4 .complete+0x54/0x80
[c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
[c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
[c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
[c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
[c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
[c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
[c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
[c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
[c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
[c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Delete excess kernel-doc notation in kernel/auditsc.c:
Warning(linux-2.6.27-git10//kernel/auditsc.c:1481): Excess function parameter or struct member 'tsk' description in 'audit_syscall_entry'
Warning(linux-2.6.27-git10//kernel/auditsc.c:1564): Excess function parameter or struct member 'tsk' description in 'audit_syscall_exit'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently audit=0 on the kernel command line does absolutely nothing.
Audit always loads and always uses its resources such as creating the
kernel netlink socket. This patch causes audit=0 to actually disable
audit. Audit will use no resources and starting the userspace auditd
daemon will not cause the kernel audit system to activate.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Documented the currently bogus state of support for CFS user groups with
user namespaces. In particular, all users in a user namespace should be
children of the user which created the user namespace. This is yet to
be implemented.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
> Ingo, this addition fixes the hotplug issue on my machine
And because we're all human...
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this warning:
kernel/trace/trace.c: In function ‘trace_vprintk’:
kernel/trace/trace.c:3626: warning: ‘flags’ may be used uninitialized in this function
shows some confusion about irq_flags / flags use here. We already have
irq_flags so remove the extra flags variable.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix SD_BALANCE_NEWIDLEand broaden its use
load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE is
set at higher level domain (3-CPU) and not in low level domain (2-MC).
pulled_task is initialised to -1 and checked for non-zero which is
always true if the lowest level sched_domain does not have
SD_BALANCE_NEWIDLE flag set.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Provide a way to pause the function graph tracer
As suggested by Steven Rostedt, the previous patch that prevented from
spinlock function tracing shouldn't use the raw_spinlock to fix it.
It's much better to follow lockdep with normal spinlock, so this patch
adds a new flag for each task to make the function graph tracer able
to be paused. We also can send an ftrace_printk whithout worrying of
the irrelevant traced spinlock during insertion.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Apply some suggestions of Steven Rostedt:
_turn tracing_selftest_running into a simple int (no need of an atomic_t)
_set it __read_mostly
_fix a comment style
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: trace more functions
When the function graph tracer is configured, three more files are not
traced to prevent only four functions to be traced. And this impacts the
normal function tracer too.
arch/x86/kernel/process_64/32.c:
I had crashes when I let this file traced. After some debugging, I saw
that the "current" task point was changed inside__swtich_to(), ie:
"write_pda(pcurrent, next_p);" inside process_64.c Since the tracer store
the original return address of the function inside current, we had
crashes. Only __switch_to() has to be excluded from tracing.
kernel/module.c and kernel/extable.c:
Because of a function used internally by the function graph tracer:
__kernel_text_address()
To let the other functions inside these files to be traced, this patch
introduces the __notrace_funcgraph function prefix which is __notrace if
function graph tracer is configured and nothing if not.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.
To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.
When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).
This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: optimize the sched domains tree some more
The addition of SD_SERIALIZE flag added to SD_NODE_INIT prevented top level
dummy numa sched_domain to be properly degenerated on non-numa smp machine.
The reason is that in sd_parent_degenerate(), it found that the child and
parent does not have comon sched_domain flags due to SD_SERIALIZE. However,
for non-numa smp box, the top level is a dummy with a single sched_group.
Filter out SD_SERIALIZE if it is on non-numa machine to properly degenerate
top level node sched_domain. this will cut back some of the sd domain walk
in the load balancer code.
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: provide trace headers to explain a bit the output
This patch implements the print_headers callback for the function graph
tracer. These headers are output according to the current trace options.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While ideally CLONE_NEWUSER will eventually require no
privilege, the required permission checks are currently
not there. As a result, CLONE_NEWUSER has the same effect
as a setuid(0)+setgroups(1,"0"). While we already require
CAP_SYS_ADMIN, requiring CAP_SETUID and CAP_SETGID seems
appropriate.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: James Morris <jmorris@namei.org>
(These two patches are in the next-unacked branch of
git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/userns-2.6.
If they get some ACKs, then I hope to feed this into security-next.
After these two, I think we're ready to tackle userns+capabilities)
Fairsched creates a per-uid directory under /sys/kernel/uids/.
So when you clone(CLONE_NEWUSER), it tries to create
/sys/kernel/uids/0, which already exists, and you get back
-ENOMEM.
This was supposed to be fixed by sysfs tagging, but that
was postponed (ok, rejected until sysfs locking is fixed).
So, just as with network namespaces, we just don't create
those directories for user namespaces other than the init.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Impact: clean up
Using (struct pid *)-1 as the pointer for ftrace_swapper_pid is
a little confusing for others. This patch uses the address of the
actual init pid structure instead. This change is only for
clarity. It does not affect the code itself. Hopefully soon the
swapper tasks will all have their own pid structure and then
we can clean up the code a bit more.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
As suggested by Steven Rostedt, this patch provide a new macro
task_curr_ret_stack() to move the cpp conditionnal CONFIG into
the linux/ftrace.h headers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix default empty traces on function-graph-tracer
The actual ftrace_trace_task() checks if ftrace_pid_trace is allocated
and return 1 if it is true.
If it is NULL, it will check the bit of pid tracing flag for the current
task (which are not set by default).
So by default, a task is not traced.
Actually all tasks should be traced by default and filter_by_pid when
ftrace_pid_trace is allocated.
The appropriate condition should be to return 1 if filter_by_pid is
set.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acke-dby: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix tracer selfstests false results
After setting a ftrace_printk somewhere in th kernel, I saw the
Function tracer selftest failing.
When a selftest occurs, the ring buffer is lurked to see if
some entries were inserted. But concurrent insertion such as
ftrace_printk could occured at the same time and could give
false positive or negative results.
This patch prevent prevent from TRACE_PRINT entries insertion
during selftests.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev:
[PATCH] fix bogus argument of blkdev_put() in pktcdvd
[PATCH 2/2] documnt FMODE_ constants
[PATCH 1/2] kill FMODE_NDELAY_NOW
[PATCH] clean up blkdev_get a little bit
[PATCH] Fix block dev compat ioctl handling
[PATCH] kill obsolete temporary comment in swsusp_close()
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
time: catch xtime_nsec underflows and fix them
posix-cpu-timers: fix clock_gettime with CLOCK_PROCESS_CPUTIME_ID
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
check_hung_task(): unsigned sysctl_hung_task_warnings cannot be less than 0
documentation: local_ops fix on_each_cpu
Add a sysctl to tweak the RSS limit used to decide when to grow
the TSB for an address space.
In order to avoid expensive divides and multiplies only simply
positive and negative powers of two are supported.
The function computed takes the number of TSB translations that will
fit at one time in the TSB of a given size, and either adds or
subtracts a percentage of entries. This final value is the
RSS limit.
See tsb_size_to_rss_limit().
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: fix hrtimer locking (reported by lockdep) in the CPU hotplug case
This addition fixes the hotplug locking issue on my machine
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
it had been put there to mark the call of blkdev_put() that
needed proper argument propagated to it; later patch in the
same series had done just that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Handle the TRACE_PRINT entries from the function grapg tracer
and output them as a C comment just below the function that called
it, as if it was a comment inside this function.
Example with an ftrace_printk inside might_sleep() function:
void __might_sleep(char *file, int line)
{
static unsigned long prev_jiffy; /* ratelimiting */
ftrace_printk("Hi I'm a comment in might_sleep() :-)");
A chunk of a resulting trace:
0) | _reiserfs_free_block() {
0) | reiserfs_read_bitmap_block() {
0) | __bread() {
0) | __getblk() {
0) | __find_get_block() {
0) 0.698 us | mark_page_accessed();
0) 2.267 us | }
0) | __might_sleep() {
0) | /* Hi I'm a comment in might_sleep() :-) */
0) 1.321 us | }
0) 5.872 us | }
0) 7.313 us | }
0) 8.718 us | }
And this patch brings two minor fixes:
- The newline after a switch-out task has disappeared
- The "|" sign just before the cpu number on task-switch has been deleted.
0) 0.616 us | pick_next_task_rt();
0) 1.457 us | _spin_trylock();
0) 0.653 us | _spin_unlock();
0) 0.728 us | _spin_trylock();
0) 0.631 us | _spin_unlock();
0) 0.729 us | native_load_sp0();
0) 0.593 us | native_load_tls();
------------------------------------------
0) cat-2834 => migrati-3
------------------------------------------
0) | finish_task_switch() {
0) 0.841 us | _spin_unlock_irq();
0) 0.616 us | post_schedule_rt();
0) 3.882 us | }
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce new lockdep API
Allow to change a held lock's class. Basically the same as the existing
code to change a subclass therefore reuse all that.
The XFS code will be able to use this to annotate their inode locking.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a bug in function filter setting
when writing function to set_graph_function, we should check whether it
has existed in set_graph_function to avoid duplicating.
Signed-off-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
This patch lets the swapper tasks of all CPUS be filtered by the
set_ftrace_pid file.
If '0' is echoed into this file, then all the idle tasks (aka swapper)
is flagged to be traced. This affects all CPU idle tasks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up, extend PID filtering to PID namespaces
Eric Biederman suggested using the struct pid for filtering on
pids in the kernel. This patch is based off of a demonstration
of an implementation that Eric sent me in an email.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: New feature
This patch makes the changes to set_ftrace_pid apply to the function
graph tracer.
# echo $$ > /debugfs/tracing/set_ftrace_pid
# echo function_graph > /debugfs/tracing/current_tracer
Will cause only the current task to be traced. Note, the trace flags are
also inherited by child processes, so the children of the shell
will also be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Use the new task struct trace flags to determine if a process should be
traced or not.
Note: this moves the searching of the pid to the slow path of setting
the pid field. This needs to be converted to the pid name space.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the file:
/debugfs/tracing/set_graph_function
which can be used along with the function graph tracer.
When this file is empty, the function graph tracer will act as
usual. When the file has a function in it, the function graph
tracer will only trace that function.
For example:
# echo blk_unplug > /debugfs/tracing/set_graph_function
# cat /debugfs/tracing/trace
[...]
------------------------------------------
| 2) make-19003 => kjournald-2219
------------------------------------------
2) | blk_unplug() {
2) | dm_unplug_all() {
2) | dm_get_table() {
2) 1.381 us | _read_lock();
2) 0.911 us | dm_table_get();
2) 1. 76 us | _read_unlock();
2) + 12.912 us | }
2) | dm_table_unplug_all() {
2) | blk_unplug() {
2) 0.778 us | generic_unplug_device();
2) 2.409 us | }
2) 5.992 us | }
2) 0.813 us | dm_table_put();
2) + 29. 90 us | }
2) + 34.532 us | }
You can add up to 32 functions into this file. Currently we limit it
to 32, but this may change with later improvements.
To add another function, use the append '>>':
# echo sys_read >> /debugfs/tracing/set_graph_function
# cat /debugfs/tracing/set_graph_function
blk_unplug
sys_read
Using the '>' will clear out the function and write anew:
# echo sys_write > /debug/tracing/set_graph_function
# cat /debug/tracing/set_graph_function
sys_write
Note, if you have function graph running while doing this, the small
time between clearing it and updating it will cause the graph to
record all functions. This should not be an issue because after
it sets the filter, only those functions will be recorded from then on.
If you need to only record a particular function then set this
file first before starting the function graph tracer. In the future
this side effect may be corrected.
The set_graph_function file is similar to the set_ftrace_filter but
it does not take wild cards nor does it allow for more than one
function to be set with a single write. There is no technical reason why
this is the case, I just do not have the time yet to implement that.
Note, dynamic ftrace must be enabled for this to appear because it
uses the dynamic ftrace records to match the name to the mcount
call sites.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix time warp bug
Alex Shi, along with Yanmin Zhang have been noticing occasional time
inconsistencies recently. Through their great diagnosis, they found that
the xtime_nsec value used in update_wall_time was occasionally going
negative. After looking through the code for awhile, I realized we have
the possibility for an underflow when three conditions are met in
update_wall_time():
1) We have accumulated a second's worth of nanoseconds, so we
incremented xtime.tv_sec and appropriately decrement xtime_nsec.
(This doesn't cause xtime_nsec to go negative, but it can cause it
to be small).
2) The remaining offset value is large, but just slightly less then
cycle_interval.
3) clocksource_adjust() is speeding up the clock, causing a
corrective amount (compensating for the increase in the multiplier
being multiplied against the unaccumulated offset value) to be
subtracted from xtime_nsec.
This can cause xtime_nsec to underflow.
Unfortunately, since we notify the NTP subsystem via second_overflow()
whenever we accumulate a full second, and this effects the error
accumulation that has already occured, we cannot simply revert the
accumulated second from xtime nor move the second accumulation to after
the clocksource_adjust call without a change in behavior.
This leaves us with (at least) two options:
1) Simply return from clocksource_adjust() without making a change if we
notice the adjustment would cause xtime_nsec to go negative.
This would work, but I'm concerned that if a large adjustment was needed
(due to the error being large), it may be possible to get stuck with an
ever increasing error that becomes too large to correct (since it may
always force xtime_nsec negative). This may just be paranoia on my part.
2) Catch xtime_nsec if it is negative, then add back the amount its
negative to both xtime_nsec and the error.
This second method is consistent with how we've handled earlier rounding
issues, and also has the benefit that the error being added is always in
the oposite direction also always equal or smaller then the correction
being applied. So the risk of a corner case where things get out of
control is lessened.
This patch fixes bug 11970, as tested by Yanmin Zhang
http://bugzilla.kernel.org/show_bug.cgi?id=11970
Reported-by: alex.shi@intel.com
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
fs/nfsd/nfs4recover.c
Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().
Signed-off-by: James Morris <jmorris@namei.org>
Impact: graph tracer race/crash fix
There is a nasy race in startup of a new process running the
function graph tracer. In fork.c:
total_forks++;
spin_unlock(¤t->sighand->siglock);
write_unlock_irq(&tasklist_lock);
ftrace_graph_init_task(p);
proc_fork_connector(p);
cgroup_post_fork(p);
return p;
The new task is free to run as soon as the tasklist_lock is released.
This is before the ftrace_graph_init_task. If the task does run
it will be using the same ret_stack and curr_ret_stack as the parent.
This will cause crashes that are difficult to debug.
This patch moves the ftrace_graph_init_task to just after the alloc_pid
code. This fixes the above race.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to output of stack trace
If a function is not found in the stack of the stack tracer, the
number printed is quite strange. This fixes the algorithm to handle
missing functions better.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
CONFIG_FUNCTION_GRAPH_TRACER depends on FUNCTION_TRACER already,
(turning it non-default) so it so making it default-n is pointless.
So enable it by default - it's a nice extension of the function tracer.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: better trace output of duration for long calls
The old duration output didn't exceeded 9999.999 us to fit the column
and the nanosecs were always 3 numbers. As Ingo suggested, it's better
to have the whole microseconds elapsed time and shift the nanosecs precision
if needed to fit the maximum 7 numbers. And usec need more number, the case
should be rare and important enough to break a bit the column alignment to
show it.
So, depending of the duration value, we now have these patterns:
u.nnn us
uu.nnn us
uuu.nnn us
uuuu.nnn us
uuuuu.nn us
uuuuuu.n us
uuuuuuuu..... us
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend function-graph output: let one know which thread called a function
This patch implements a helper function to print the couple cmdline/pid.
Its output is provided during task switching and on each row if the new
"funcgraph-proc" defualt-off option is set through trace_options file.
The output is center aligned and never exceeds 14 characters. The cmdline
is truncated over 7 chars.
But note that if the pid exceeds 6 characters, the column will overflow (but
the situation is abnormal).
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature, let entry function decide to trace or not
This patch lets the graph tracer entry function decide if the tracing
should be done at the end as well. This requires all function graph
entry functions return 1 if it should trace, or 0 if the return should
not be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Andrew Morton pointed out that the kernel convention of a variable
named page should be of type page struct. The ring buffer uses
a variable named "page" for a pointer to something else.
This patch converts those to be called "bpage" (as in "buffer page").
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new ftrace_graph_stop function
While developing more features of function graph, I hit a bug that
caused the WARN_ON to trigger in the prepare_ftrace_return function.
Well, it was hard for me to find out that was happening because the
bug would not print, it would just cause a hard lockup or reboot.
The reason is that it is not safe to call printk from this function.
Looking further, I also found that it calls unregister_ftrace_graph,
which grabs a mutex and calls kstop machine. This would definitely
lock the box up if it were to trigger.
This patch adds a fast and safe ftrace_graph_stop() which will
stop the function tracer. Then it is safe to call the WARN ON.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new API to ring buffer
This patch adds a new interface into the ring buffer that allows a
page to be read from the ring buffer on a given CPU. For every page
read, one must also be given to allow for a "swap" of the pages.
rpage = ring_buffer_alloc_read_page(buffer);
if (!rpage)
goto err;
ret = ring_buffer_read_page(buffer, &rpage, cpu, full);
if (!ret)
goto empty;
process_page(rpage);
ring_buffer_free_read_page(rpage);
The caller of these functions must handle any waits that are
needed to wait for new data. The ring_buffer_read_page will simply
return 0 if there is no data, or if "full" is set and the writer
is still on the current page.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: get ready for splice changes
This patch moves the commit and timestamp into the beginning of each
data page of the buffer. This change will allow the page to be moved
to another location (disk, network, etc) and still have information
in the page to be able to read it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix for lockdep and ftrace
The raw_local_irq_save/restore confuses lockdep. This patch
converts them to the local_irq_save/restore variants.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge x86/dumpstack into tracing/ftrace because upcoming ftrace changes
depend on cleanups already in x86/dumpstack.
Also merge to latest upstream -rc.
Impact: emit new warning
We periodically waste time tracking down problems from the genirq
framework not respecting IRQF_DISABLED for some shared IRQ cases. Linus
views this as "will not fix", but we're still left with the bugs caused by
this misbehavior.
This patch adds a nag message in request_irq(), so that drivers can fix
their IRQ handlers to avoid this problem.
Note that developers will never see the relevant bugs when they run with
LOCKDEP, so it's no wonder these bugs are hard to find. (That also means
LOCKDEP is overlooking some IRQ-related bugs involving IRQ handlers that
don't set IRQF_DISABLED...)
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix __irq_set_trigger() for IRQ_LEVEL
When recording the irq trigger type, let's also make sure
that IRQ_LEVEL gets set correctly.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend and enable the function graph tracer to 64-bit x86
This patch implements the support for function graph tracer under x86-64.
Both static and dynamic tracing are supported.
This causes some small CPP conditional asm on arch/x86/kernel/ftrace.c I
wanted to use probe_kernel_read/write to make the return address
saving/patching code more generic but it causes tracing recursion.
That would be perhaps useful to implement a notrace version of these
function for other archs ports.
Note that arch/x86/process_64.c is not traced, as in X86-32. I first
thought __switch_to() was responsible of crashes during tracing because I
believed current task were changed inside but that's actually not the
case (actually yes, but not the "current" pointer).
So I will have to investigate to find the functions that harm here, to
enable tracing of the other functions inside (but there is no issue at
this time, while process_64.c stays out of -pg flags).
A little possible race condition is fixed inside this patch too. When the
tracer allocate a return stack dynamically, the current depth is not
initialized before but after. An interrupt could occur at this time and,
after seeing that the return stack is allocated, the tracer could try to
trace it with a random uninitialized depth. It's a prevention, even if I
hadn't problems with it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix "no output from tracer" bug caused by ftrace_update_pid_func()
When disabling single thread function trace using
"echo -1 > set_ftrace_pid", the normal function trace
has to restore to original function, otherwise the normal
function trace will not work well.
Without this commit, something like below:
$ ps |grep 850
850 root 2556 S -/bin/sh
$ echo 850 > /debug/tracing/set_ftrace_pid
$ echo function > /debug/tracing/current_tracer
$ echo 1 > /debug/tracing/tracing_enabled
$ sleep 1
$ echo 0 > /debug/tracing/tracing_enabled
$ cat /debug/tracing/trace_pipe |wc -l
59704
$ echo -1 > /debug/tracing/set_ftrace_pid
$ echo 1 > /debug/tracing/tracing_enabled
$ sleep 1
$ echo 0 > /debug/tracing/tracing_enabled
$ more /debug/tracing/trace_pipe
<====== nothing output now!
it should output trace record.
Signed-off-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The description for 'D' was missing in the comment... (causing me a
minute of WTF followed by looking at more of the code)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: extend information in /proc/sched_debug
This patch adds uid information in sched_debug for CONFIG_USER_SCHED
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: prevent divide by zero error in cpu_avg_load_per_task, update
sched, cpusets: fix warning in kernel/cpuset.c
sched: prevent divide by zero error in cpu_avg_load_per_task
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq.h: fix missing/extra kernel-doc
genirq: __irq_set_trigger: change pr_warning to pr_debug
irq: fix typo
x86: apic honour irq affinity which was set in early boot
genirq: fix the affinity setting in setup_irq
genirq: keep affinities set from userspace across free/request_irq()
All architectures now use the generic compat_sys_ptrace, as should every
new architecture that needs 32bit compat (if we'll ever get another).
Remove the now superflous __ARCH_WANT_COMPAT_SYS_PTRACE define, and also
kill a comment about __ARCH_SYS_PTRACE that was added after
__ARCH_SYS_PTRACE was already gone.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Regarding the bug addressed in:
4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task
Linus points out that the fix is not complete:
> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.
So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this warning:
kernel/cpuset.c: In function ‘generate_sched_domains’:
kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function
triggers because GCC does not recognize that ndoms stays uninitialized
only if doms is NULL - but that flow is covered at the end of
generate_sched_domains().
Help out GCC by initializing this variable to 0. (that's prudent anyway)
Also, this function needs a splitup and code flow simplification:
with 160 lines length it's clearly too long.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build error on branch tracer
This should fix a build error reported on alpha in linux-next:
CC kernel/trace/trace_branch.o
kernel/trace/trace_branch.c: In function 'probe_likely_condition':
kernel/trace/trace_branch.c:44: error: implicit declaration of function 'raw_local_irq_save'
kernel/trace/trace_branch.c:76: error: implicit declaration of function 'raw_local_irq_restore'
Unfortunately, I can't test it since I don't have any Alpha build environment.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move double_lock_balance()/double_unlock_balance() higher to fix the following
with gcc-3.4.6:
CC kernel/sched.o
In file included from kernel/sched.c:1605:
kernel/sched_rt.c: In function `find_lock_lowest_rq':
kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
kernel/sched_rt.c:1077: sorry, unimplemented: called from here
make[2]: *** [kernel/sched.o] Error 1
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, eliminate code
now that warn_on_slowpath() uses warn_slowpath(...,NULL), we can
eliminate warn_on_slowpath() altogether and use warn_slowpath().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend WARN_ON() output with DMI_PRODUCT_NAME
It's very useful for many low level WARN_ON's to find out which
motherboard has the broken BIOS etc... this patch adds a printk
to the WARN_ON code for this.
On architectures without DMI, gcc should optimize the code out.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, code reduction
warn_slowpath is a superset of warn_on_slowpath; just have
warn_on_slowpath call warn_slowpath with a NULL 3rd argument.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove unused code
__local_bh_enable has been replaced with _local_bh_enable.
As comments says "it always nests inside local_bh_enable() sections"
has not been valid now. Also there is no reason to use __local_bh_enable
anywhere, so we can remove this useless function.
Signed-off-by: Liming Wang <liming.wang@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make ftrace position computing more sane
First remove useless ->pos field. Then we needn't check seq_printf
in .show like other place.
Signed-off-by: Liming Wang <liming.wang@windriver.com>
Reviewed-by: Bruce Ashfield <bruce.ashfield@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are architectures that still have no stacktrace support.
Signed-off-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: increase the visual qualities of the call-graph-tracer output
This patch applies various trace output formatting changes:
- CPU is now a decimal number, followed by a parenthesis.
- Overhead is now on the second column (gives a good visibility)
- Cost is now on the third column, can't exceed 9999.99 us. It is
followed by a virtual line based on a "|" character.
- Functions calls are now the last column on the right. This way, we
haven't dynamic column (which flow is harder to follow) on its right.
- CPU and Overhead have their own option flag. They are default-on but you
can disable them easily:
echo nofuncgraph-cpu > trace_options
echo nofuncgraph-overhead > trace_options
TODO:
_ Refactoring of the thread switch output.
_ Give a default-off option to output the thread and its pid on each row.
_ Provide headers
_ ....
Here is an example of the new trace style:
0) | mutex_unlock() {
0) 0.639 us | __mutex_unlock_slowpath();
0) 1.607 us | }
0) | remove_wait_queue() {
0) 0.616 us | _spin_lock_irqsave();
0) 0.616 us | _spin_unlock_irqrestore();
0) 2.779 us | }
0) 0.495 us | n_tty_set_room();
0) ! 9999.999 us | }
0) | tty_ldisc_deref() {
0) 0.615 us | _spin_lock_irqsave();
0) 0.616 us | _spin_unlock_irqrestore();
0) 2.793 us | }
0) | current_fs_time() {
0) 0.488 us | current_kernel_time();
0) 0.495 us | timespec_trunc();
0) 2.486 us | }
0) ! 9999.999 us | }
0) ! 9999.999 us | }
0) ! 9999.999 us | }
0) | sys_read() {
0) 0.796 us | fget_light();
0) | vfs_read() {
0) | rw_verify_area() {
0) | security_file_permission() {
0) 0.488 us | cap_file_permission();
0) 1.720 us | }
0) 3. 4 us | }
0) | tty_read() {
0) 0.488 us | tty_paranoia_check();
0) | tty_ldisc_ref_wait() {
0) | tty_ldisc_try() {
0) 0.615 us | _spin_lock_irqsave();
0) 0.615 us | _spin_unlock_irqrestore();
0) 5.436 us | }
0) 6.427 us | }
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhance the output of the graph-tracer
This patch applies some ideas of Ingo Molnar and Steven Rostedt.
* Output leaf functions in one line with parenthesis, semicolon and duration
output.
* Add a second column (after cpu) for an overhead sign.
if duration > 100 us, "!"
if duration > 10 us, "+"
else " "
* Print output in us with remaining nanosec: u.n
* Print duration on the right end, following the indentation of the functions.
Use also visual clues: "-" on entry call (no duration to output) and "+" on
return (duration output).
The name of the tracer has been fixed as well: function-branch becomes
function_branch.
Here is an example of the new output:
CPU[000] dequeue_entity() { -
CPU[000] update_curr() { -
CPU[000] update_min_vruntime(); + 0.512 us
CPU[000] } + 1.504 us
CPU[000] clear_buddies(); + 0.481 us
CPU[000] update_min_vruntime(); + 0.504 us
CPU[000] } + 4.557 us
CPU[000] hrtick_update() { -
CPU[000] hrtick_start_fair(); + 0.489 us
CPU[000] } + 1.443 us
CPU[000] + } + 14.655 us
CPU[000] + } + 15.678 us
CPU[000] + } + 16.686 us
CPU[000] msecs_to_jiffies(); + 0.481 us
CPU[000] put_prev_task_fair(); + 0.504 us
CPU[000] pick_next_task_fair(); + 0.482 us
CPU[000] pick_next_task_rt(); + 0.504 us
CPU[000] pick_next_task_fair(); + 0.481 us
CPU[000] pick_next_task_idle(); + 0.489 us
CPU[000] _spin_trylock(); + 0.655 us
CPU[000] _spin_unlock(); + 0.609 us
CPU[000] ------------8<---------- thread bash-2794 ------------8<----------
CPU[000] finish_task_switch() { -
CPU[000] _spin_unlock_irq(); + 0.722 us
CPU[000] } + 2.369 us
CPU[000] ! } + 501972.605 us
CPU[000] ! } + 501973.763 us
CPU[000] copy_from_read_buf() { -
CPU[000] _spin_lock_irqsave(); + 0.670 us
CPU[000] _spin_unlock_irqrestore(); + 0.699 us
CPU[000] copy_to_user() { -
CPU[000] might_fault() { -
CPU[000] __might_sleep(); + 0.503 us
CPU[000] } + 1.632 us
CPU[000] __copy_to_user_ll(); + 0.542 us
CPU[000] } + 3.858 us
CPU[000] tty_audit_add_data() { -
CPU[000] _spin_lock_irq(); + 0.609 us
CPU[000] _spin_unlock_irq(); + 0.624 us
CPU[000] } + 3.196 us
CPU[000] _spin_lock_irqsave(); + 0.624 us
CPU[000] _spin_unlock_irqrestore(); + 0.625 us
CPU[000] + } + 13.611 us
CPU[000] copy_from_read_buf() { -
CPU[000] _spin_lock_irqsave(); + 0.624 us
CPU[000] _spin_unlock_irqrestore(); + 0.616 us
CPU[000] } + 2.820 us
CPU[000]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix divide by zero crash in scheduler rebalance irq
While testing the branch profiler, I hit this crash:
divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[<ffffffff8024a008>] [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
<IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
[<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
[<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
[<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
[<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
[<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
[<ffffffff8021047a>] do_softirq+0x94/0x1cd
[<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
[<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
[<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20
The code for cpu_avg_load_per_task has:
if (rq->nr_running)
rq->avg_load_per_task = rq->load.weight / rq->nr_running;
The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.
The branch profiler simply made the race window bigger.
This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent unnecessary stack recursion
if the resched flag was set before we entered, then don't reschedule.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new "power-tracer" ftrace plugin
This patch adds a C/P-state ftrace plugin that will generate
detailed statistics about the C/P-states that are being used,
so that we can look at detailed decisions that the C/P-state
code is making, rather than the too high level "average"
that we have today.
An example way of using this is:
mount -t debugfs none /sys/kernel/debug
echo cstate > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace | perl scripts/trace/cstate.pl > out.svg
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: locking fix
We can't call cpuset_cpus_allowed_locked() with the rq lock held.
However, the rq lock merely protects us from (1) cpu_online_mask changing
and (2) someone else changing p->cpus_allowed.
The first can't happen because we're being called from a cpu hotplug
notifier. The second doesn't really matter: we are forcing the task off
a CPU it was affine to, so we're not doing very well anyway.
So we remove the rq lock from this path, and all is good.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhancement for function graph tracer
When run on a SMP box, the function graph tracer is confusing because
it shows the different CPUS as changes in the trace.
This patch adds the annotation of 'CPU[###]' where ### is a three digit
number. The output will look similar to this:
CPU[001] dput() {
CPU[000] } 726
CPU[001] } 487
CPU[000] do_softirq() {
CPU[001] } 2221
CPU[000] __do_softirq() {
CPU[000] __local_bh_disable() {
CPU[001] unroll_tree_refs() {
CPU[000] } 569
CPU[001] } 501
CPU[000] rcu_process_callbacks() {
CPU[001] kfree() {
What makes this nice is that now you can grep the file and produce
readable format for a particular CPU.
# cat /debug/tracing/trace > /tmp/trace
# grep '^CPU\[000\]' /tmp/trace > /tmp/trace0
# grep '^CPU\[001\]' /tmp/trace > /tmp/trace1
Will give you:
# head /tmp/trace0
CPU[000] ------------8<---------- thread sshd-3899 ------------8<----------
CPU[000] inotify_dentry_parent_queue_event() {
CPU[000] } 2531
CPU[000] inotify_inode_queue_event() {
CPU[000] } 505
CPU[000] } 69626
CPU[000] } 73089
CPU[000] audit_syscall_exit() {
CPU[000] path_put() {
CPU[000] dput() {
# head /tmp/trace1
CPU[001] ------------8<---------- thread pcscd-3446 ------------8<----------
CPU[001] } 4186
CPU[001] dput() {
CPU[001] } 543
CPU[001] vfs_permission() {
CPU[001] inode_permission() {
CPU[001] shmem_permission() {
CPU[001] generic_permission() {
CPU[001] } 501
CPU[001] } 2205
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhancement to function graph tracer
Export the trace_find_cmdline so the function graph tracer can
use it to print the comms of the threads.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature
This patch enables function tracing and function return to run together.
I've tested this by enabling the stack tracer and return tracer, where
both the function entry and function return are used together with
dynamic ftrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: more efficient code for ftrace graph tracer
This patch uses the dynamic patching, when available, to patch
the function graph code into the kernel.
This patch will ease the way for letting both function tracing
and function graph tracing run together.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature to function trace a single thread
This patch adds the ability to function trace a single thread.
The file:
/debugfs/tracing/set_ftrace_pid
contains the pid to trace. Valid pids are any positive integer.
Writing any negative number to this file will disable the pid
tracing and the function tracer will go back to tracing all of
threads.
This feature works with both static and dynamic function tracing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature
This patch sets a C-like output for the function graph tracing.
For this aim, we now call two handler for each function: one on the entry
and one other on return. This way we can draw a well-ordered call stack.
The pid of the previous trace is loosely stored to be compared against
the one of the current trace to see if there were a context switch.
Without this little feature, the call tree would seem broken at
some locations.
We could use the sched_tracer to capture these sched_events but this
way of processing is much more simpler.
2 spaces have been chosen for indentation to fit the screen while deep
calls. The time of execution in nanosecs is printed just after closed
braces, it seems more easy this way to find the corresponding function.
If the time was printed as a first column, it would be not so easy to
find the corresponding function if it is called on a deep depth.
I plan to output the return value but on 32 bits CPU, the return value
can be 32 or 64, and its difficult to guess on which case we are.
I don't know what would be the better solution on X86-32: only print
eax (low-part) or even edx (high-part).
Actually it's thee same problem when a function return a 8 bits value, the
high part of eax could contain junk values...
Here is an example of trace:
sys_read() {
fget_light() {
} 526
vfs_read() {
rw_verify_area() {
security_file_permission() {
cap_file_permission() {
} 519
} 1564
} 2640
do_sync_read() {
pipe_read() {
__might_sleep() {
} 511
pipe_wait() {
prepare_to_wait() {
} 760
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 504
} 1587
clear_buddies() {
} 512
add_cfs_task_weight() {
} 519
update_min_vruntime() {
} 511
} 5602
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 496
} 1631
clear_buddies() {
} 496
update_min_vruntime() {
} 527
} 4580
hrtick_update() {
hrtick_start_fair() {
} 488
} 1489
} 13700
} 14949
} 16016
msecs_to_jiffies() {
} 496
put_prev_task_fair() {
} 504
pick_next_task_fair() {
} 489
pick_next_task_rt() {
} 496
pick_next_task_fair() {
} 489
pick_next_task_idle() {
} 489
------------8<---------- thread 4 ------------8<----------
finish_task_switch() {
} 1203
do_softirq() {
__do_softirq() {
__local_bh_disable() {
} 669
rcu_process_callbacks() {
__rcu_process_callbacks() {
cpu_quiet() {
rcu_start_batch() {
} 503
} 1647
} 3128
__rcu_process_callbacks() {
} 542
} 5362
_local_bh_enable() {
} 587
} 8880
} 9986
kthread_should_stop() {
} 669
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
calc_delta_mine() {
} 511
update_min_vruntime() {
} 511
} 2813
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch changes the name of the "return function tracer" into
function-graph-tracer which is a more suitable name for a tracing
which makes one able to retrieve the ordered call stack during
the code flow.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new ftrace plugin
A prototype for a BTS ftrace plug-in.
The tracer collects branch trace in a cyclic buffer for each cpu.
The tracer is not configurable and the trace for each snapshot is
appended when doing cat /debug/tracing/trace.
This is a proof of concept that will be extended with future patches
to become a (hopefully) useful tool.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a callback to allow an ftrace plug-in to write its own header.
Move the call to trace->open() up a few lines.
The changes are required by the BTS ftrace plug-in.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, move all hrtimer processing into hardirq context
This is an attempt at removing some of the hrtimer complexity by
reducing the number of callback modes to 1.
This means that all hrtimer callback functions will be ran from HARD-irq
context.
I went through all the 30 odd hrtimer callback functions in the kernel
and saw only one that I'm not quite sure of, which is the one in
net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.
Furthermore, the hrtimer core now calls callbacks directly with IRQs
disabled in case you try to enqueue an expired timer. If this timer is a
periodic timer (which should use hrtimer_forward() to advance its time)
then it might be possible to end up in an inf. recursive loop due to the
fact that hrtimer_forward() doesn't round up to the next timer
granularity, and therefore keeps on calling the callback - obviously
this needs a fix.
Aside from that, this seems to compile and actually boot on my dual core
test box - although I'm sure there are some bugs in, me not hitting any
makes me certain :-)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build warning
this warning:
kernel/lockdep.c:584: warning: ‘print_lock_dependencies’ defined but not used
triggers because print_lock_dependencies() is only used if both
CONFIG_TRACE_IRQFLAGS and CONFIG_PROVE_LOCKING are enabled.
But adding #ifdefs is not an option here - it would spread out to 4-5
other helper functions and uglify the file. So mark this function
as __used - it's static and the compiler can eliminate it just fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The user_ns is moved from nsproxy to user_struct, so that a struct
cred by itself is sufficient to determine access (which it otherwise
would not be). Corresponding ecryptfs fixes (by David Howells) are
here as well.
Fix refcounting. The following rules now apply:
1. The task pins the user struct.
2. The user struct pins its user namespace.
3. The user namespace pins the struct user which created it.
User namespaces are cloned during copy_creds(). Unsharing a new user_ns
is no longer possible. (We could re-add that, but it'll cause code
duplication and doesn't seem useful if PAM doesn't need to clone user
namespaces).
When a user namespace is created, its first user (uid 0) gets empty
keyrings and a clean group_info.
This incorporates a previous patch by David Howells. Here
is his original patch description:
>I suggest adding the attached incremental patch. It makes the following
>changes:
>
> (1) Provides a current_user_ns() macro to wrap accesses to current's user
> namespace.
>
> (2) Fixes eCryptFS.
>
> (3) Renames create_new_userns() to create_user_ns() to be more consistent
> with the other associated functions and because the 'new' in the name is
> superfluous.
>
> (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
> beginning of do_fork() so that they're done prior to making any attempts
> at allocation.
>
> (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
> to fill in rather than have it return the new root user. I don't imagine
> the new root user being used for anything other than filling in a cred
> struct.
>
> This also permits me to get rid of a get_uid() and a free_uid(), as the
> reference the creds were holding on the old user_struct can just be
> transferred to the new namespace's creator pointer.
>
> (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
> preparation rather than doing it in copy_creds().
>
>David
>Signed-off-by: David Howells <dhowells@redhat.com>
Changelog:
Oct 20: integrate dhowells comments
1. leave thread_keyring alone
2. use current_user_ns() in set_user()
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
FUTEX_WAIT_BITSET could be used instead of FUTEX_WAIT by setting the
bit set to FUTEX_BITSET_MATCH_ANY, but FUTEX_WAIT uses CLOCK_REALTIME
while FUTEX_WAIT_BITSET uses CLOCK_MONOTONIC.
Add a flag to select CLOCK_REALTIME for FUTEX_WAIT_BITSET so glibc can
replace the FUTEX_WAIT logic which needs to do gettimeofday() calls
before and after the syscall to convert the absolute timeout to a
relative timeout for FUTEX_WAIT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Impact: Trivial API conversion
NR_CPUS -> nr_cpu_ids
cpumask_t -> struct cpumask
sizeof(cpumask_t) -> cpumask_size()
cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)
cpu_set() -> cpumask_set_cpu()
first_cpu() -> cpumask_first()
cpumask_of_cpu() -> cpumask_of()
cpus_* -> cpumask_*
There are some FIXMEs where we all archs to complete infrastructure
(patches have been sent):
cpu_coregroup_map -> cpu_coregroup_mask
node_to_cpumask* -> cpumask_of_node
There is also one FIXME where we pass an array of cpumasks to
partition_sched_domains(): this implies knowing the definition of
'struct cpumask' and the size of a cpumask. This will be fixed in a
future patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack reduction for large NR_CPUS
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
stack space.
We simply return if the allocation fails: since we don't use it we
could just pass NULL to cpupri_find and have it handle that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction, (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.
The fact cpupro_init is called both before and after the slab is
available makes for an ugly parameter unfortunately.
We also use cpumask_any_and to get rid of a temporary in cpupri_find.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction, (future) size reduction, cleanup
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
We can also use cpulist_parse() instead of doing it manually in
isolated_cpu_setup.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
stack space. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
In this case, we always alloced, but we don't need to any more.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space on the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Note the removal of the initializer of new_mask: since the first thing
we did was "cpus_and(new_mask, new_mask, cpus_allowed)" I just changed
that to "cpumask_and(new_mask, in_mask, cpus_allowed);".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
With some care, we can avoid needing a temporary cpumask (we can't
really allocate here, since we can't fail).
This version calls cpuset_cpus_allowed_locked() with the task_rq_lock
held. I'm fairly sure this works, but there might be a deadlock
hiding.
And of course, we can't get rid of the last cpumask on stack until we
can use cpumask_of_node instead of node_to_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Some jiggling here to make sure we always exit at the bottom (so we hit
the free_cpumask_var there).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>