Regarding the bug addressed in:
4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task
Linus points out that the fix is not complete:
> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.
So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix divide by zero crash in scheduler rebalance irq
While testing the branch profiler, I hit this crash:
divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[<ffffffff8024a008>] [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
<IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
[<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
[<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
[<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
[<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
[<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
[<ffffffff8021047a>] do_softirq+0x94/0x1cd
[<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
[<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
[<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20
The code for cpu_avg_load_per_task has:
if (rq->nr_running)
rq->avg_load_per_task = rq->load.weight / rq->nr_running;
The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.
The branch profiler simply made the race window bigger.
This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: locking fix
We can't call cpuset_cpus_allowed_locked() with the rq lock held.
However, the rq lock merely protects us from (1) cpu_online_mask changing
and (2) someone else changing p->cpus_allowed.
The first can't happen because we're being called from a cpu hotplug
notifier. The second doesn't really matter: we are forcing the task off
a CPU it was affine to, so we're not doing very well anyway.
So we remove the rq lock from this path, and all is good.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Trivial API conversion
NR_CPUS -> nr_cpu_ids
cpumask_t -> struct cpumask
sizeof(cpumask_t) -> cpumask_size()
cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)
cpu_set() -> cpumask_set_cpu()
first_cpu() -> cpumask_first()
cpumask_of_cpu() -> cpumask_of()
cpus_* -> cpumask_*
There are some FIXMEs where we all archs to complete infrastructure
(patches have been sent):
cpu_coregroup_map -> cpu_coregroup_mask
node_to_cpumask* -> cpumask_of_node
There is also one FIXME where we pass an array of cpumasks to
partition_sched_domains(): this implies knowing the definition of
'struct cpumask' and the size of a cpumask. This will be fixed in a
future patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction, (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.
The fact cpupro_init is called both before and after the slab is
available makes for an ugly parameter unfortunately.
We also use cpumask_any_and to get rid of a temporary in cpupri_find.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction, (future) size reduction, cleanup
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
We can also use cpulist_parse() instead of doing it manually in
isolated_cpu_setup.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
stack space. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
In this case, we always alloced, but we don't need to any more.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space on the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Note the removal of the initializer of new_mask: since the first thing
we did was "cpus_and(new_mask, new_mask, cpus_allowed)" I just changed
that to "cpumask_and(new_mask, in_mask, cpus_allowed);".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
With some care, we can avoid needing a temporary cpumask (we can't
really allocate here, since we can't fail).
This version calls cpuset_cpus_allowed_locked() with the task_rq_lock
held. I'm fairly sure this works, but there might be a deadlock
hiding.
And of course, we can't get rid of the last cpumask on stack until we
can use cpumask_of_node instead of node_to_cpumask.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Some jiggling here to make sure we always exit at the bottom (so we hit
the free_cpumask_var there).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack usage reduction
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space in the stack. cpumask_var_t is just a struct cpumask for
!CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
def_root_domain is static, and so its masks are initialized with
alloc_bootmem_cpumask_var. After that, alloc_cpumask_var is used.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: (future) size reduction for large NR_CPUS.
We move the 'cpumask' member of sched_group to the end, so when we
kmalloc it we can do a minimal allocation: saves space for small
nr_cpu_ids but big CONFIG_NR_CPUS. Similar trick for 'span' in
sched_domain.
This isn't quite as good as converting to a cpumask_var_t, as some
sched_groups are actually static, but it's safer: we don't have to
figure out where to call alloc_cpumask_var/free_cpumask_var.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: trivial wrap of member accesses
This eases the transition in the next patch.
We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
for_each_cpu_and, and sched_balance_self() due to getting weight before
setting sd to NULL.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new API
any_online_cpu() is a good name, but it takes a cpumask_t, not a
pointer.
There are several places where any_online_cpu() doesn't really want a
mask arg at all. Replace all callers with cpumask_any() and
cpumask_any_and().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new general API
Using lots of allocs rather than one big alloc is less efficient, but
who cares for this setup function?
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: trivial API conversion
This is a simple conversion, but note that for_each_cpu() terminates
with i >= nr_cpu_ids, not i == NR_CPUS like for_each_cpu_mask() did.
I don't convert all of them: sd->span changes in a later patch, so
change those iterators there rather than here.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
* use node_to_cpumask_ptr in place of node_to_cpumask to reduce stack
requirements in sched.c
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use deeper function tracing depth safely
Some tests showed that function return tracing needed a more deeper depth
of function calls. But it could be unsafe to store these return addresses
to the stack.
So these arrays will now be allocated dynamically into task_struct of current
only when the tracer is activated.
Typical scheme when tracer is activated:
- allocate a return stack for each task in global list.
- fork: allocate the return stack for the newly created task
- exit: free return stack of current
- idle init: same as fork
I chose a default depth of 50. I don't have overruns anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This commit:
commit f7b4cddcc5
Author: Oleg Nesterov <oleg@tv-sign.ru>
Date: Tue Oct 16 23:30:56 2007 -0700
do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(ta
Currently move_task_off_dead_cpu() is called under
write_lock_irq(tasklist). This means it can't use task_lock() which is
needed to improve migrating to take task's ->cpuset into account.
Change the code to call move_task_off_dead_cpu() with irqs enabled, and
change migrate_live_tasks() to use read_lock(tasklist).
...forgot to update the comment in front of move_task_off_dead_cpu.
Reference: http://lkml.org/lkml/2008/6/23/135
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make load-balancing more consistent
In the update_shares() path leading to tg_shares_up(), the calculation of
per-cpu cfs_rq shares is rather erratic even under moderate task wake up
rate. The problem is that the per-cpu tg->cfs_rq load weight used in the
sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
are collected at different time. Under moderate system load, we've seen
quite a bit of variation on the cfs_rq->shares and ultimately wildly
affects sched_entity's load weight.
This patch caches the result of initial per-cpu load weight when doing the
sum calculation, and then pass it down to update_group_shares_cpu() for
redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq
shares across all CPUs. It also simplifies the rounding and zero load
weight check.
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: properly rebuild sched-domains on kmalloc() failure
When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.
The regression was introduced by:
| commit dfb512ec48
| Author: Max Krasnyansky <maxk@qualcomm.com>
| Date: Fri Aug 29 13:11:41 2008 -0700
|
| sched: arch_reinit_sched_domains() must destroy domains to force rebuild
After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove unnecessary accounting call
I don't actually understand account_steal_time() and I failed to find the
commit which added account_group_system_time(), but this looks bogus.
In any case rq->idle must be single-threaded, so it can't have ->totals.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: API *CHANGE*. Must update all tracepoint users.
Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
structure in a single spot for all the kernel. It helps reducing memory
consumption, especially when declaring a lot of tracepoints, e.g. for
kmalloc tracing.
*API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
to do it. The name previously used was misleading.
Updates scheduler instrumentation to follow this API change.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Maciej Rutecki reported:
> I have this bug during suspend to disk:
>
> [ 188.592151] Enabling non-boot CPUs ...
> [ 188.592151] SMP alternatives: switching to SMP code
> [ 188.666058] BUG: using smp_processor_id() in preemptible
> [00000000]
> code: suspend_to_disk/2934
> [ 188.666064] caller is native_sched_clock+0x2b/0x80
Which, as noted by Linus, was caused by me, via:
7cbaef9c "sched: optimize sched_clock() a bit"
Move the rq locking a bit earlier in the initialization sequence,
that will make the sched_clock() call in init_idle() non-preemptible.
Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix load balancer load average calculation accuracy
cpu_avg_load_per_task() returns a stale value when nr_running is 0.
It returns an older stale (caculated when nr_running was non zero) value.
This patch returns and sets rq->avg_load_per_task to zero when nr_running
is 0.
Compile and boot tested on a x86_64 box.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve CPU time accounting of tasks under the cpu accounting controller
Add hierarchical accounting to cpu accounting controller and include
cpuacct documentation.
Currently, while charging the task's cputime to its accounting group,
the accounting group hierarchy isn't updated. This patch charges the cputime
of a task to its accounting group and all its parent accounting groups.
Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix hang/crash on ia64 under high load
This is ugly, but the simplest patch by far.
Unlike other similar routines, account_group_exec_runtime() could be
called "implicitly" from within scheduler after exit_notify(). This
means we can race with the parent doing release_task(), we can't just
check ->signal != NULL.
Change __exit_signal() to do spin_unlock_wait(&task_rq(tsk)->lock)
before __cleanup_signal() to make sure ->signal can't be freed under
task_rq(tsk)->lock. Note that task_rq_unlock_wait() doesn't care
about the case when tsk changes cpu/rq under us, this should be OK.
Thanks to Ingo who nacked my previous buggy patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Doug Chapman <doug.chapman@hp.com>
Impact: clean up and fix debug info printout
While looking over the sched_debug code I noticed that we printed the rq
schedstats for every cfs_rq, ammend this.
Also change nr_spead_over into an int, and fix a little buglet in
min_vruntime printing.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The #if/#endif is ugly. Change SCHED_CPUMASK_ALLOC and
SCHED_CPUMASK_FREE to static inline functions.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix rare memory leak in the sched-domains manual reconfiguration code
In the failure path, rd is not attached to a sched domain,
so it causes a leak.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have a test case which measures the variation in the amount of time
needed to perform a fixed amount of work on the preempt_rt kernel. We
started seeing deterioration in it's performance recently. The test
should never take more than 10 microseconds, but we started 5-10%
failure rate.
Using elimination method, we traced the problem to commit
1b12bbc747 (lockdep: re-annotate
scheduler runqueues).
When LOCKDEP is disabled, this patch only adds an additional function
call to double_unlock_balance(). Hence I inlined double_unlock_balance()
and the problem went away. Here is a patch to make this change.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve/change/fix wakeup-buddy scheduling
Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.
This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.
This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.
The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.
This will basically allow a task to continue after it got preempted.
In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, add debug check
It's wrong to make dattr_new = NULL if doms_new == NULL, it introduces
memory leak if dattr_new != NULL. Fortunately dattr_new is always NULL
in this case. So remove the code and add a sanity check.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>