Commit Graph

4085 Commits

Author SHA1 Message Date
Ingo Molnar
b2613e370d ftrace: build fix for ftraced_suspend
fix:

 kernel/trace/ftrace.c:1615: error: 'ftraced_suspend' undeclared (first use in this function)
 kernel/trace/ftrace.c:1615: error: (Each undeclared identifier is reported only once
 kernel/trace/ftrace.c:1615: error: for each function it appears in.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 16:46:50 +02:00
Steven Rostedt
60bc080090 ftrace: separate out the function enabled variable
Currently the function tracer uses the global tracer_enabled variable that
is used to keep track if the tracer is enabled or not. The function tracing
startup needs to be separated out, otherwise the internal happenings of
the tracer startup is also recorded.

This patch creates a ftrace_function_enabled variable to all the starting
of the function traces to happen after everything has been started.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:22 +02:00
Steven Rostedt
a2bb6a3d85 ftrace: add ftrace_kill_atomic
It has been suggested that I add a way to disable the function tracer
on an oops. This code adds a ftrace_kill_atomic. It is not meant to be
used in normal situations. It will disable the ftrace tracer, but will
not perform the nice shutdown that requires scheduling.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:21 +02:00
Steven Rostedt
26bc83f4cb ftrace: use current CPU for function startup
This is more of a clean up. Currently the function tracer initializes the
tracer with which ever CPU was last used for tracing. This value isn't
realy useful for function tracing, but at least it should be something other
than a random number.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:21 +02:00
Steven Rostedt
ad591240ce ftrace: start wakeup tracing after setting function tracer
Enabling the wakeup tracer before enabling the function tracing causes
some strange results due to the dynamic enabling of the functions.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:20 +02:00
Steven Rostedt
b5c21b4514 ftrace: check proper config for preempt type
There is no CONFIG_PREEMPT_DESKTOP. Use the proper entry CONFIG_PREEMPT.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:19 +02:00
Steven Rostedt
1e16c0a081 ftrace: trace schedule
After the sched_clock code has been removed from sched.c we can now trace
the scheduler. The scheduler has a lot of functions that would be worth
tracing.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:19 +02:00
Steven Rostedt
001b6767b1 ftrace: define function trace nop
When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace
and tracing_stop_function_trace should be nops.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:18 +02:00
Steven Rostedt
007c05d4d2 ftrace: move sched_switch enable after markers
We have two markers now that are enabled on sched_switch. One that records
the context switching and the other that records task wake ups. Currently
we enable the tracing first and then set the markers. This causes some
confusing traces:

# tracer: sched_switch
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
       trace-cmd-3973  [00]   115.834817:   3973:120:R   +     3:  0:S
       trace-cmd-3973  [01]   115.834910:   3973:120:R   +     6:  0:S
       trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
       trace-cmd-3973  [03]   115.834910:   3973:120:R   +    12:  0:S
       trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
          <idle>-0     [02]   115.834910:      0:140:R ==>  3973:120:R

Here we see that trace-cmd with PID 3973 wakes up task 9 but the next line
shows the idle task doing a context switch to task 3973.

Enabling the tracing to _after_ the markers are set creates a much saner
output:

# tracer: sched_switch
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
          <idle>-0     [02]  7922.634225:      0:140:R ==>  4790:120:R
       trace-cmd-4789  [03]  7922.634225:      0:140:R   +  4790:120:R

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-11 15:49:18 +02:00
Abhishek Sagar
98a05ed4bd ftrace: prevent ftrace modifications while being kprobe'd, v2
add two missing chunks for ftrace+kprobe.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-03 14:46:24 +02:00
Pekka Paalanen
3e61e0c976 mmiotrace broken in linux-next (8-bit writes only)
The moment mmiotrace is enabled, I hit a NULL deref in:

IP: [<ffffffff80256e71>] __trace_special+0x17c/0x23a
Call Trace:
 [<ffffffff802573cc>] ftrace_special+0x6f/0x9a
 [<ffffffff8023e3e4>] down+0x19/0x4a
 [<ffffffff80228adc>] acquire_console_sem+0x42/0x58
 [<ffffffff8035d273>] con_flush_chars+0x28/0x43
 [<ffffffff80354a70>] write_chan+0x22e/0x334
 [<ffffffff802244e9>] ? default_wake_function+0x0/0xf
 [<ffffffff8035236d>] tty_write+0x195/0x228
 [<ffffffff80354842>] ? write_chan+0x0/0x334
 [<ffffffff8027c23a>] vfs_write+0xae/0x137
 [<ffffffff8027c6e3>] sys_write+0x47/0x70
 [<ffffffff8020b1db>] system_call_after_swapgs+0x7b/0x80

which means 'entry' in __trace_special() is NULL.

[ mingo@elte.hu: that ftrace_special() was a leftover. ]

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: proski@gnu.org
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-01 10:14:06 +02:00
Ingo Molnar
97e6722b8d Merge branch 'linus' into tracing/ftrace 2008-06-25 12:27:56 +02:00
Jason Wessel
aabdc3b8c3 kgdb: sparse fix
- Fix warning reported by sparse
kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared.
	Should it be static?

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2008-06-24 10:52:55 -05:00
Abhishek Sagar
f22f9a89ce ftrace: avoid modifying kprobe'd records
Avoid modifying the mcount call-site if there is a kprobe installed on it.
These records are not marked as failed however. This allowed the filter
rules on them to remain up-to-date. Whenever the kprobe on the corresponding
record is removed, the record gets updated as normal.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:59 +02:00
Abhishek Sagar
ecea656d1d ftrace: freeze kprobe'd records
Let records identified as being kprobe'd be marked as "frozen". The trouble
with records which have a kprobe installed on their mcount call-site is
that they don't get updated. So if such a function which is currently being
traced gets its tracing disabled due to a new filter rule (or because it
was added to the notrace list) then it won't be updated and continue being
traced. This patch allows scanning of all frozen records during tracing to
check if they should be traced.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:58 +02:00
Abhishek Sagar
395a59d0f8 ftrace: store mcount address in rec->ip
Record the address of the mcount call-site. Currently all archs except sparc64
record the address of the instruction following the mcount call-site. Some
general cleanups are entailed. Storing mcount addresses in rec->ip enables
looking them up in the kprobe hash table later on to check if they're kprobe'd.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: davem@davemloft.net
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-23 22:10:56 +02:00
Linus Torvalds
27f4837cbf Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futexes: fix fault handling in futex_lock_pi
2008-06-23 12:49:22 -07:00
Thomas Gleixner
1b7558e457 futexes: fix fault handling in futex_lock_pi
This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.

David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.

The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.

When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.

After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:

 - the task which acquired the rtmutex
 - the pending owner of the pi_state which did not get the rtmutex

Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.

One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.

Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

 kernel/futex.c |   93 ++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 20 deletions(-)
2008-06-23 13:31:15 +02:00
Ingo Molnar
f34bfb1bee Merge branch 'linus' into tracing/ftrace 2008-06-23 11:11:42 +02:00
Ingo Molnar
198bb971e2 Merge branch 'linus' into sched/urgent 2008-06-23 11:00:26 +02:00
Linus Torvalds
1f1e2ce8a5 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
  rcupreempt: remove export of rcu_batches_completed_bh
  cpuset: limit the input of cpuset.sched_relax_domain_level
2008-06-20 12:37:13 -07:00
Oleg Nesterov
ea71a54670 sched: refactor wait_for_completion_timeout()
Simplify the code and fix the boundary condition of
wait_for_completion_timeout(,0).

We can kill the first __remove_wait_queue() as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-06-20 17:15:49 +02:00
Roland Dreier
bb10ed0994 sched: fix wait_for_completion_timeout() spurious failure under heavy load
It seems that the current implementaton of wait_for_completion_timeout()
has a small problem under very high load for the common pattern:

	if (!wait_for_completion_timeout(&done, timeout))
		/* handle failure */

because the implementation very roughly does (lots of code deleted to
show the basic flow):

	static inline long __sched
	do_wait_for_common(struct completion *x, long timeout, int state)
	{
		if (x->done)
			return timeout;

		do {
			timeout = schedule_timeout(timeout);

			if (!timeout)
				return timeout;

		} while (!x->done);

		return timeout;
	}

so if the system is very busy and x->done is not set when
do_wait_for_common() is entered, it is possible that the first call to
schedule_timeout() returns 0 because the task doing wait_for_completion
doesn't get rescheduled for a long time, even if it is woken up early
enough.

In this case, wait_for_completion_timeout() returns 0 without even
checking x->done again, and the code above falls into its failure case
purely for scheduler reasons, even if the hardware event or whatever was
being waited for happened early enough.

It would make sense to add an extra test to do_wait_for() in the timeout
case and return 1 if x->done is actually set.

A quick audit (not exhaustive) of wait_for_completion_timeout() callers
seems to indicate that no one actually cares about the return value in
the success case -- they just test for 0 (timed out) versus non-zero
(wait succeeded).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20 13:19:32 +02:00
Peter Zijlstra
8a8cde163e sched: rt: dont stop the period timer when there are tasks wanting to run
So if the group ever gets throttled, it will never wake up again.

Reported-by: "Daniel K." <dk@uw.no>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-20 11:00:19 +02:00
Bharath Ravi
d4abc238c9 sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue
This patch corrects the incorrect value of per process run-queue wait
time reported by delay statistics. The anomaly was due to the following
reason. When a process leaves the CPU and immediately starts waiting for
CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
the time of re-entry into the run-queue is never recorded. Due to this,
the waiting time on the runqueue from this point of re-entry upto the
next time it hits the CPU is not accounted for. This is solved by
recording the time of re-entry of a process leaving the CPU in the
sched_info_depart() function IF the process will go back to waiting on
the run-queue. This IF condition is verified by checking whether the
process is still in the TASK_RUNNABLE state.

The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
The values noted prior to the fix did not account for the time spent on
the runqueue waiting. After the fix, the correct values were reported
back to user space.

Signed-off-by: Bharath Ravi <bharathravi1@gmail.com>
Signed-off-by: Madhava K R  <madhavakr@gmail.com>
Cc: dhaval@linux.vnet.ibm.com
Cc: vatsa@in.ibm.com
Cc: balbir@in.ibm.com
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 14:15:28 +02:00
Jason Wessel
9c106c119e softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog().  The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).

This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.

The test case for this problem is to use the following kernel config
options:

CONFIG_KGDB_TESTS=y
CONFIG_KGDB_TESTS_ON_BOOT=y
CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"

It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.

I would consider this patch a regression fix because the problem first
appeared in commit 27ec440779 when some
logic was added to try to periodically sync the clocks.  It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks.  It became clear that accessing
this code from an nmi was the source of the lockups.  Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:45:38 +02:00
Steven Rostedt
afd38009cc rcupreempt: remove export of rcu_batches_completed_bh
In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
the header file. This does not need to be exported, and not only that,
this breaks my PPC build.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: paulus@samba.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19 09:45:37 +02:00
Li Zefan
30e0e17819 cpuset: limit the input of cpuset.sched_relax_domain_level
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
for inputs outside this range.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19 09:45:36 +02:00
Max Krasnyansky
f18f982abf sched: CPU hotplug events must not destroy scheduler domains created by the cpusets
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-06-19 09:14:51 +02:00
Peter Zijlstra
15a8641ead sched: rt-group: fix RR buglet
In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:59 +02:00
Peter Zijlstra
ad2a3f13b7 sched: rt-group: heirarchy aware throttle
The bandwidth throttle code dequeues a group when it runs out of quota, and
re-queues it once the period rolls over and the quota gets refreshed.

Sadly it failed to take the hierarchy into consideration. Share more of the
enqueue/dequeue code with regular task opterations.

Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
so check for that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:57 +02:00
Peter Zijlstra
7ea56616ba sched: rt-group: fix hierarchy
Don't re-set the entity's runqueue to the wrong rq after we've set it
to the right one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:56 +02:00
Dario Faggioli
49307fd6f7 sched: NULL pointer dereference while setting sched_rt_period_us
When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:

 echo 10000 > /proc/sys/kernel/sched_rt_period_us

We get this:

 BUG: unable to handle kernel NULL pointer dereference at 0000008c
 [  947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160
 [  947.683123] *pde = 00000000=20
 [  947.683782] Oops: 0000 [#1]
 [  947.684307] Modules linked in:
 [  947.684308]
 [  947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
 [  947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0
 [  947.684308] EIP is at __rt_schedulable+0x12/0x160
 [  947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
 [  947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
 [  947.684308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
 [  947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
 [  947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
 [  947.684308]        c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
 [  947.684308]        c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
 [  947.684308] Call Trace:
 [  947.684308]  [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50
 [  947.684308]  [<c0216d40>] ? sched_rt_handler+0x80/0x110
 [  947.684308]  [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0
 [  947.684308]  [<c02af2fa>] ? proc_sys_write+0x1a/0x20
 [  947.684308]  [<c0273c36>] ? vfs_write+0x96/0x160
 [  947.684308]  [<c02af2e0>] ? proc_sys_write+0x0/0x20
 [  947.684308]  [<c027423d>] ? sys_write+0x3d/0x70
 [  947.684308]  [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91
 [  947.684308]  =======================
 [  947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
 f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
 89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
 8b
 [  947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP  0068:c6cc9ed0

We think the following patch solves the issue.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19 09:06:54 +02:00
Rabin Vincent
95e904c7da sched: fix defined-but-unused warning
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used

Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-17 10:36:58 +02:00
Ingo Molnar
f22529351f namespacecheck: fixes
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 14:44:13 +02:00
Ingo Molnar
e765ee90da Merge branch 'linus' into tracing/ftrace 2008-06-16 11:15:58 +02:00
Abhishek Sagar
a4500b84c5 ftrace: fix "notrace" filtering priority
This is a fix to give notrace filter rules priority over "set_ftrace_filter"
rules.

This fix ensures that functions which are set to be filtered and are
concurrently marked as "notrace" don't get recorded. As of now, if
a record is marked as FTRACE_FL_FILTER and is enabled, then the notrace
flag is not checked. Tested on x86-32.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-14 08:32:29 +02:00
Masami Hiramatsu
67dddaad5d kprobes: fix error checking of batch registration
Fix error checking routine to catch an error which occurs in first
__register_*probe().

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12 18:05:40 -07:00
Linus Torvalds
1b3cba8e60 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: 64-bit: fix arithmetics overflow
  sched: fair group: fix overflow(was: fix divide by zero)
  sched: fix TASK_WAKEKILL vs SIGKILL race
2008-06-12 12:55:18 -07:00
Lai Jiangshan
7a232e0350 sched: 64-bit: fix arithmetics overflow
(overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)

A weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so it will overflow when there are
too many entities.

Although, overflow occurs very rarely, but it break fairness when
it occurs. 64-bits systems have more memory than 32-bit systems
and 64-bit systems can create more process usually, so overflow may
occur more frequently.

This patch guarantees fairness when overflow happens on 64-bit systems.
Thanks to the optimization of compiler, it changes nothing on 32-bit.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 14:29:54 +02:00
Lai Jiangshan
2e084786f6 sched: fair group: fix overflow(was: fix divide by zero)
I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
(use 2^32, 2^33, ...., 2^63 as shares value)

# mkdir /dev/cpuctl
# mount -t cgroup -o cpu cpuctl /dev/cpuctl
# cd /dev/cpuctl
# mkdir sub
# echo 0x8000000000000000 > sub/cpu.shares
# echo $$ > sub/tasks
oops here! divide by zero.

This is because do_div() expects the 2th parameter to be 32 bits,
but unsigned long is 64 bits in x86_64.

Peter Zijstra pointed it out that the sane thing to do is limit the
shares value to something smaller instead of using an even more
expensive divide.

Also, I found another bug about "the shares value is too large":

pid1 and pid2 are set affinity to cpu#0
pid1 is attached to cg1 and pid2 is attached to cg2

if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
then pid2 got 100% usage of cpu, and pid1 0%

if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
then pid2 got 0% usage of cpu, and pid1 100%

And a weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so the shares value should be limited
to a smaller value.

I think that (1UL << 18) is a good limited value:

1) it's not too large, we can create a lot of group before overflow
2) it's several times the weight value for nice=-19 (not too small)

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 14:23:55 +02:00
Jiri Slaby
20764ff1ef ftrace: fix printout
Do not print loglevel before "entries of %ld bytes". Move it to the previous
pr_info.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-12 11:51:03 +02:00
Ankita Garg
2b1bce1787 ftrace: disable tracing when current_tracer is set to "none"
Found that inspite of setting the current_tracer to "none", trace from
the previous trace type continued to be collected. The patch below fixes
this and causes the trace to be disabled when the "none" type is
selected.

Compile and boot tested the patch for functionality.

Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 14:52:30 +02:00
Ingo Molnar
040ec23d07 sched: sched_clock() lockdep fix
Sitsofe Wheeler bisected the following commit to cause a lockdep to
warn about itself and turn itself off:

> commit c6531cce6e
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Mon May 12 21:21:14 2008 +0200
>
>     sched: do not trace sched_clock

do not use raw irq flags in cpu_clock() as it causes lockdep to lose
track of the true state of the IRQ flag.

Reported-and-bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-06-10 14:52:14 +02:00
Abhishek Sagar
34078a5e44 ftrace: prevent freeing of all failed updates
Steven Rostedt wrote:
> If we unload a module and reload it, will it ever get converted again?

The intent was always to filter core kernel functions to prevent their freeing.
Here's a fix which should allow re-recording of module call-sites.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:59:05 +02:00
Abhishek Sagar
eb9a7bf091 ftrace: add debugfs entry 'failures'
Identify functions which had their mcount call-site updates failed. This can
help us track functions which ftrace shouldn't fiddle with, and are thus not
being traced. If there is no race with any external agent which is modifying
the mcount call-site, then this file displays no entries (normal case).

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:58:17 +02:00
Abhishek Sagar
1d74f2a0f6 ftrace: remove ftrace_ip_converted()
Remove the unneeded function ftrace_ip_converted().

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:57:49 +02:00
Abhishek Sagar
0eb967012e ftrace: prevent freeing of all failed updates
Prevent freeing of records which cause problems and correspond to function from
core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record
as "converted". All other records are patched lazily to NOPs. Failed records
now also remain on frace_hash table. Each invocation of ftrace_record_ip now
checks whether the traced function has ever been recorded (including past
failures) and doesn't re-record it again.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:56:57 +02:00
Oleg Nesterov
16882c1e96 sched: fix TASK_WAKEKILL vs SIGKILL race
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
this allows us to do

	current->state = TASK_INTERRUPTIBLE;
	schedule();

without fear to sleep with pending signal.

However, the code like

	current->state = TASK_KILLABLE;
	schedule();

is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
no effect).

Introduce the new helper, signal_pending_state(), and change schedule() to
use it. Hopefully it will have more users, that is why the task's state is
passed separately.

Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
This is needed to preserve the current behaviour (ptrace_notify). I hope
this check will be removed soon, but this (afaics good) change needs the
separate discussion.

The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
basically the same that schedule() does now. However, this patch of course
bloats schedule().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-10 11:37:25 +02:00
Linus Torvalds
156a9ea43a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6:
  capabilities: remain source compatible with 32-bit raw legacy capability support.
  LSM: remove stale web site from MAINTAINERS
2008-06-06 11:31:55 -07:00