Configurable 16-bit UID and friends support
This allows turning off the legacy 16 bit UID interfaces on embedded platforms.
text data bss dec hex filename
3330172 529036 190556 4049764 3dcb64 vmlinux-baseline
3328268 529040 190556 4047864 3dc3f8 vmlinux
From: Adrian Bunk <bunk@stusta.de>
UID16 was accidentially disabled for !EMBEDDED.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Factor out common code for different RUSAGE_xxx cases.
Don't take ->sighand->siglock in RUSAGE_SELF case, suggested by Ravikiran G
Thirumalai <kiran@scalex86.org>.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use first_cpu(cpu_possible_map) for the single-thread workqueue case. We
used to hardcode 0, but that broke on systems where !cpu_possible(0) when
workqueue_struct->cpu_workqueue_struct was changed from a static array to
alloc_percpu.
Commit id bce61dd49d ("Fix hardcoded cpu=0 in
workqueue for per_cpu_ptr() calls") fixed that for Ben's funky sparc64
system, but it regressed my Power5. Offlining cpu 0 oopses upon the next
call to queue_work for a single-thread workqueue, because now we try to
manipulate per_cpu_ptr(wq->cpu_wq, 1), which is uninitialized.
So we need to establish an unchanging "slot" for single-thread workqueues
which will have a valid percpu allocation. Since alloc_percpu keys off of
cpu_possible_map, which must not change after initialization, make this
slot == first_cpu(cpu_possible_map).
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the redundant spinlock in the function resolve_symbol() as we are
not altering the module list, and we already hold the semaphore.
Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently TAINT_FORCED_RMMOD is totally unused. Because it is marked as
TAINT_FORCED_MODULE instead when user forced a module unload. This patch
marks it correctly
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ensure that an exported symbol does not already exist in the kernel or in
some other module's exported symbol table. This is done by checking the
symbol tables for the exported symbol at the time of loading the module.
Currently this is done after the relocation of the symbol.
Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Signed-off-by: Anand Krishnan <anandhkrishnan@yahoo.co.in>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_setpgid() allows to change ->pgrp of ptraced childs.
'man setpgid' does not tell anything about that, so I consider
this behaviour is a bug.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
setsid() does not work unless the calling process is a
thread_group_leader().
'man setpgid' does not tell anything about that, so I consider this
behaviour is a bug.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
setpgid(0, pgid) or setpgid(forked_child_pid, pgid) does not work unless
the calling process is a thread_group_leader().
'man setpgid' does not tell anything about that, so I consider this
behaviour is a bug.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In fork, child should recopy parent's pgrp/tty after it has tasklist_lock.
Otherwise following a setpgid() on the parent, *after* copy_signal(), the
child will own a stale pgrp (which may be reused); (eg. if copy_mm()
sleeps a long while due to memory pressure). Similar issue for the tty.
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem. It is expected that /sbin/halt -p works exactly like
/sbin/halt, when the kernel does not implement power off functionality.
The kernel can do a lot of work in the reboot notifiers and in
device_shutdown before we even get to machine_power_off. Some of that
shutdown is not safe if you are leaving the power on, and it definitely
gets in the way of using sysrq or pressing ctrl-alt-del. Since the
shutdown happens in generic code there is no way to fix this in
architecture specific code :(
Some machines are kernel oopsing today because of this.
The simple solution is to turn LINUX_REBOOT_CMD_POWER_OFF into
LINUX_REBOOT_CMD_HALT if power_off functionality is not implemented.
This has the unfortunate side effect of disabling the power off
functionality on architectures that leave pm_power_off to null and still
implement something in machine_power_off. And it will break the build on
some architectures that don't have a pm_power_off variable at all.
On both counts I say tough.
For architectures like alpha that don't implement the pm_power_off variable
pm_power_off is declared in linux/pm.h and it is a generic part of our
power management code, and all architectures should implement it.
For architectures like parisc that have a default power off method in
machine_power_off if pm_power_off is not implemented or fails. It is easy
enough to set the pm_power_off variable. And nothing bad happens there,
the machines just stop powering off.
The current semantics are impossible without a flag at the top level so we
can avoid the problem code if a power off is not implemented. pm_power_off
is as good a flag as any with the bonus that it works without modification
on at least x86, x86_64, powerpc, and ppc today.
Andrew can you pick this up and put this in the mm tree. Kernels that
don't compile or don't power off seem saner than kernels that oops or
panic. Until we get the arch specific patches for the problem
architectures this probably isn't smart to push into the stable kernel.
Unfortunately I don't have the time at the moment to walk through every
architecture and make them work. And even if I did I couldn't test it :(
From: Hirokazu Takata <takata@linux-m32r.org>
Add pm_power_off() for build fix of arch/m32r/kernel/process.c.
From: Miklos Szeredi <miklos@szeredi.hu>
UML build fix
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hayato Fujiwara <fujiwara@linux-m32r.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch forces RCU torture threads off various CPUs in the system
allowing them to become idle and go tickless. Meant to test support for
such tickless idle CPU in RCU.
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kernels that have had Windows drivers loaded into them are undebuggable.
I've wasted a number of hours chasing bugs filed in Fedora bugzilla only to
find out much later that the user had used such 'helpers', and their
problems were unreproducable without them loaded.
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some long time ago, dentry struct was carefully tuned so that on 32 bits
UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
of memory cache lines.
Then RCU was added and dentry struct enlarged by two pointers, with nice
results for SMP, but not so good on UP, because breaking the above tuning
(128 + 8 = 136 bytes)
This patch reverts this unwanted side effect, by using an union (d_u),
where d_rcu and d_child are placed so that these two fields can share their
memory needs.
At the time d_free() is called (and d_rcu is really used), d_child is known
to be empty and not touched by the dentry freeing.
Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
the previous content of d_child is not needed if said dentry was unhashed
but still accessed by a CPU because of RCU constraints)
As dentry cache easily contains millions of entries, a size reduction is
worth the extra complexity of the ugly C union.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Paul Jackson <pj@sgi.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes unneeded sig->curr_target recalculation under 'if
(atomic_dec_and_test(&sig->count))' in __exit_signal().
When sig->count == 0 the signal can't be sent to this task and
next_thread(tsk) == tsk anyway.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
zap_other_threads() sets SIGNAL_GROUP_EXIT at the very start,
do_group_exit() doesn't need to do it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__create_workqueue() not checking return of alloc_percpu()
NULL dereference was possible.
Signed-off-by: Ben Collins <bcollins@ubuntu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While rooting aroung in the signal code trying to understand how to fix the
SIG_IGN ploy (set sig handler to SIG_IGN and flood system with high speed
repeating timers) I came across what, I think, is a problem in sigaction()
in that when processing a SIG_IGN request it flushes signals from 1 to
SIGRTMIN and leaves the rest. Attempt to fix this.
Signed-off-by: George Anzinger <george@mvista.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
What's the true meaning of the printk return value? Should it include the
priority prefix length of 3? and what about the timing information? In
both cases it was broken:
strace -e write echo 1 > /dev/kmsg
=> write(1, "1\n", 2) = 5
strace -e write echo "<1>1" > /dev/kmsg
=> write(1, "<1>1\n", 5) = 8
The returned length was "length of input string + 3", I made it "length
of string output to the log buffer".
Note that I couldn't find any printk caller in the kernel interested by its
return value besides kmsg_write.
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Acked-By: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ptrace_get_task_struct() helper that I added as part of the ptrace
consolidation is useful in variety of places that currently opencode it.
Switch them to the common helpers.
Add a ptrace_traceme() helper that needs to be explicitly called, and simplify
the ptrace_get_task_struct() interface. We don't need the request argument
now, and we return the task_struct directly, using ERR_PTR() for error
returns. It's a bit more code in the callers, but we have two sane routines
that do one thing well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use atomic_inc_not_zero for rcu files instead of special case rcuref.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the following cleanups:
- make needlessly global functions static
- every file should include the headers containing the prototypes for
it's global functions
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in
their kernel (eventually this may be most distribution kernels), this patch
removes even the minimal rcu_read_lock() from the memory page allocation path.
Actually, it removes that rcu call for any task that is in the root cpuset
(top_cpuset), which on systems not actively using cpusets, is all tasks.
We don't need the rcu check for tasks in the top_cpuset, because the
top_cpuset is statically allocated, so at no risk of being freed out from
underneath us.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark cpuset global 'number_of_cpusets' as __read_mostly.
This global is accessed everytime a zone is considered in the zonelist loops
beneath __alloc_pages, looking for a free memory page. If number_of_cpusets
is just one, then we can short circuit the mems_allowed check.
Since this global is read alot on a hot path, and written rarely, it is an
excellent candidate for __read_mostly.
Thanks to Christoph Lameter for the suggestion.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Optimize the cpuset impact on page allocation, the most performance critical
cpuset hook in the kernel.
On each page allocation, the cpuset hook needs to check for a possible change
in the current tasks cpuset. It can now handle the common case, of no change,
without taking any spinlock or semaphore, thanks to RCU.
Convert a spinlock on the current task to an rcu_read_lock(), saving
approximately a memory barrier and an atomic op, depending on architecture.
This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
freeing up a detached cpuset until after any critical sections referencing
that pointer.
Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove a couple of more lines of code from the cpuset hooks in the page
allocation code path.
There was a check for a NULL cpuset pointer in the routine
cpuset_update_task_memory_state() that was only needed during system boot,
after the memory subsystem was initialized, before the cpuset subsystem was
initialized, to catch a NULL task->cpuset pointer.
Add a cpuset_init_early() routine, just before the mem_init() call in
init/main.c, that sets up just enough of the init tasks cpuset structure to
render cpuset_update_task_memory_state() calls harmless.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Given the mechanism in the previous patch to handle rebinding the per-vma
mempolicies of all tasks in a cpuset that changes its memory placement, it is
now easier to handle the page migration requirements of such tasks at the same
time.
The previous code didn't actually attempt to migrate the pages of the tasks in
a cpuset whose memory placement changed until the next time each such task
tried to allocate memory. This was undesirable, as users invoking memory page
migration exected to happen when the placement changed, not some unspecified
time later when the task needed more memory.
It is now trivial to handle the page migration at the same time as the per-vma
rebinding is done.
The routine cpuset.c:update_nodemask(), which handles changing a cpusets
memory placement ('mems') now checks for the special case of being asked to
write a placement that is the same as before. It was harmless enough before
to just recompute everything again, even though nothing had changed. But page
migration is a heavy weight operation - moving pages about. So now it is
worth avoiding that if asked to move a cpuset to its current location.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix more of longstanding bug in cpuset/mempolicy interaction.
NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
to just the Memory Nodes allowed by that cpuset. The kernel maintains
internal state for each mempolicy, tracking what nodes are used for the
MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
When a tasks cpuset memory placement changes, whether because the cpuset
changed, or because the task was attached to a different cpuset, then the
tasks mempolicies have to be rebound to the new cpuset placement, so as to
preserve the cpuset-relative numbering of the nodes in that policy.
An earlier fix handled such mempolicy rebinding for mempolicies attached to a
task.
This fix rebinds mempolicies attached to vma's (address ranges in a tasks
address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
updating vma's, the rebinding of vma mempolicies has to be done when the
cpuset memory placement is changed, at which time mmap_sem can be safely
acquired. The tasks mempolicy is rebound later, when the task next attempts
to allocate memory and notices that its task->cpuset_mems_generation is
out-of-date with its cpusets mems_generation.
Because walking the tasklist to find all tasks attached to a changing cpuset
requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
affected tasks while doing the tasklist scan. In general, one cannot acquire
a semaphore (which can sleep) while already holding a spinlock (such as
tasklist_lock). So a list of mm references has to be built up during the
tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
acquired, and the vma's in that mm rebound.
Once the tasklist lock is dropped, affected tasks may fork new tasks, before
their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
point to the cpuset being rebound (there can only be one; cpuset modifications
are done under a global 'manage_sem' semaphore), and the mpol_copy code that
is used to copy a tasks mempolicies during fork catches such forking tasks,
and ensures their children are also rebound.
When a task is moved to a different cpuset, it is easier, as there is only one
task involved. It's mm->vma's are scanned, using the same
mpol_rebind_policy() as used above.
It may happen that both the mpol_copy hook and the update done via the
tasklist scan update the same mm twice. This is ok, as the mempolicies of
each vma in an mm keep track of what mems_allowed they are relative to, and
safely no-op a second request to rebind to the same nodes.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Easy little optimization hack to avoid actually having to call
cpuset_zone_allowed() and check mems_allowed, in the main page allocation
routine, __alloc_pages(). This saves several CPU cycles per page allocation
on systems not using cpusets.
A counter is updated each time a cpuset is created or removed, and whenever
there is only one cpuset in the system, it must be the root cpuset, which
contains all CPUs and all Memory Nodes. In that case, when the counter is
one, all allocations are allowed.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cleanup, reorganize and make more robust the mempolicy.c code to rebind
mempolicies relative to the containing cpuset after a tasks memory placement
changes.
The real motivator for this cleanup patch is to lay more groundwork for the
upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
after the containing cpuset memory placement changes.
NUMA mempolicies are constrained by the cpuset their task is a member of.
When either (1) a task is moved to a different cpuset, or (2) the 'mems'
mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
be recalculated, relative to their new cpuset placement.
The old code used an unreliable method of determining what was the old
mems_allowed constraining the mempolicy. It just looked at the tasks
mems_allowed value. This sort of worked with the present code, that just
rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
referring to the old nodes. But in an upcoming patch, the vma mempolicies
will be rebound as well. Then the order in which the various task and vma
mempolicies are updated will no longer be deterministic, and one can no longer
count on the task->mems_allowed holding the old value for as long as needed.
It's not even clear if the current code was guaranteed to work reliably for
task mempolicies.
So I added a mems_allowed field to each mempolicy, stating exactly what
mems_allowed the policy is relative to, and updated synchronously and reliably
anytime that the mempolicy is rebound.
Also removed a useless wrapper routine, numa_policy_rebind(), and had its
caller, cpuset_update_task_memory_state(), call directly to the rewritten
policy_rebind() routine, and made that rebind routine extern instead of
static, and added a "mpol_" prefix to its name, making it
mpol_rebind_policy().
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
needed, to obtain the mems_allowed vector of a cpuset, and replaced the
workaround in sys_migrate_pages() to call this new method.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The important code paths through alloc_pages_current() and alloc_page_vma(),
by which most kernel page allocations go, both called
cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
-Both- of these latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.
That was a silly duplication of code and waste of CPU cycles on an important
code path.
Consolidated those two routines into a single routine, called
cpuset_update_task_memory_state(), since it updates more than just
mems_allowed.
Changed all callers of either routine to call the new consolidated routine.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix obscure, never seen in real life, cpuset fork race. The cpuset_fork()
call in fork.c was setting up the correct task->cpuset pointer after the
tasklist_lock was dropped, which briefly exposed the newly forked process with
an unsafe (copied from parent without locks or usage counter increment) cpuset
pointer.
In theory, that exposed cpuset pointer could have been pointing at a cpuset
that was already freed and removed, and in theory another task that had been
sitting on the tasklist_lock waiting to scan the task list could have raced
down the entire tasklist, found our new child at the far end, and dereferenced
that bogus cpuset pointer.
To fix, setup up the correct cpuset pointer in the new child by calling
cpuset_fork() before the new task is linked into the tasklist, and with that,
add a fork failure case, to dereference that cpuset, if the fork fails along
the way, after cpuset_fork() was called.
Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
the call to cpuset_exit() from a failed fork would not have PF_EXITING set.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
removing embedded returns and nested if's in favor of goto completion labels.
This is being done in anticipation of adding more logic to this routine, which
will favor the goto style structure.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
mark one __read_mostly.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
that the tasks in a cpuset call try_to_free_pages(), the synchronous
(direct) memory reclaim code.
This enables batch managers monitoring jobs running in dedicated cpusets to
efficiently detect what level of memory pressure that job is causing.
This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or reprioritize jobs that are
trying to use more memory than allowed on the nodes assigned them, and with
tightly coupled, long running, massively parallel scientific computing jobs
that will dramatically fail to meet required performance goals if they
start to use more memory than allowed to them.
This patch just provides a very economical way for the batch manager to
monitor a cpuset for signs of memory pressure. It's up to the batch
manager or other user code to decide what to do about it and take action.
==> Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only
systems that enable this feature will compute the metric.
Why a per-cpuset, running average:
Because this meter is per-cpuset, rather than per-task or mm, the
system load imposed by a batch scheduler monitoring this metric is
sharply reduced on large systems, because a scan of the tasklist can be
avoided on each set of queries.
Because this meter is a running average, instead of an accumulating
counter, a batch scheduler can detect memory pressure with a single
read, instead of having to read and accumulate results for a period of
time.
Because this meter is per-cpuset rather than per-task or mm, the
batch scheduler can obtain the key information, memory pressure in a
cpuset, with a single read, rather than having to query and accumulate
results over all the (dynamically changing) set of tasks in the cpuset.
A per-cpuset simple digital filter (requires a spinlock and 3 words of data
per-cpuset) is kept, and updated by any task attached to that cpuset, if it
enters the synchronous (direct) page reclaim code.
A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by the tasks
in the cpuset, in units of reclaims attempted per second, times 1000.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
conversion had left one routine using bitmaps, since it involved a
corresponding change to kernel/cpuset.c
Fix that interface by replacing with a simple macro that calls nodes_subset(),
or if !CONFIG_CPUSET, returns (1).
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some simplification in checking signal delivery against concurrent exit.
Instead of using get_task_struct_rcu(), which increments the task_struct
reference count, check the reference count after acquiring sighand lock.
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
instead of tasklist_lock read-locked. This is a scalability improvement on
SMP and a preemption-latency improvement under PREEMPT_RCU.
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: William Irwin <wli@holomorphy.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.
However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size. As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp
With this patch, L1_CACHE_SHIFT_MAX can be killed
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a boolean "memory_migrate" to each cpuset, represented by a file
containing "0" or "1" in each directory below /dev/cpuset.
It defaults to false (file contains "0"). It can be set true by writing
"1" to the file.
If true, then anytime that a task is attached to the cpuset so marked, the
pages of that task will be moved to that cpuset, preserving, to the extent
practical, the cpuset-relative placement of the pages.
Also anytime that a cpuset so marked has its memory placement changed (by
writing to its "mems" file), the tasks in that cpuset will have their pages
moved to the cpusets new nodes, preserving, to the extent practical, the
cpuset-relative placement of the moved pages.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_migrate_pages implementation using swap based page migration
This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
The intent of sys_migrate is to migrate memory of a process. A process may
have migrated to another node. Memory was allocated optimally for the prior
context. sys_migrate_pages allows to shift the memory to the new node.
sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.
This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory. sys_migrate_pages does not modify policies in contrast to Ray's
implementation.
The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset). When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets. The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.
Patch supports ia64, i386 and x86_64.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
swap migration's isolate_lru_page() currently uses an IPI to notify other
processors that the lru caches need to be drained if the page cannot be
found on the LRU. The IPI interrupt may interrupt a processor that is just
processing lru requests and cause a race condition.
This patch introduces a new function run_on_each_cpu() that uses the
keventd() to run the LRU draining on each processor. Processors disable
preemption when dealing the LRU caches (these are per processor) and thus
executing LRU draining from another process is safe.
Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race
condition.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.
A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.
The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.
It won't drop dirty data, so the user should run `sync' first.
Caveats:
a) Holds inode_lock for exorbitant amounts of time.
b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.
This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
I've spent the past 3 days digging into a glibc testsuite failure in
current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
timers fire too early in the second pass of this test. The second
pass is noteworthy because it makes use of intervals, whereas the
first pass does not.
All throughout the posix-cpu-timers.c code, the calculation of the
process sched_time sum is implemented roughly as:
unsigned long long sum;
sum = tsk->signal->sched_time;
t = tsk;
do {
sum += t->sched_time;
t = next_thread(t);
} while (t != tsk);
In fact this is the exact scheme used by check_process_timers().
In the case of check_process_timers(), current->sched_time has just
been updated (via scheduler_tick(), which is invoked by
update_process_times(), which subsequently invokes
run_posix_cpu_timers()) So there is no special processing necessary
wrt. that.
In other contexts, we have to allot for the fact that tsk->sched_time
might be a bit out of date if we are current. And the
posix-cpu-timers.c code uses current_sched_time() to deal with that.
Unfortunately it does so in an erroneous and inconsistent manner in
one spot which is what results in the early timer firing.
In cpu_clock_sample_group_locked(), it does this:
cpu->sched = p->signal->sched_time;
/* Add in each other live thread. */
while ((t = next_thread(t)) != p) {
cpu->sched += t->sched_time;
}
if (p->tgid == current->tgid) {
/*
* We're sampling ourselves, so include the
* cycles not yet banked. We still omit
* other threads running on other CPUs,
* so the total can always be behind as
* much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
*/
cpu->sched += current_sched_time(current);
} else {
cpu->sched += p->sched_time;
}
The problem is the "p->tgid == current->tgid" test. If "p" is
not current, and the tgids are the same, we will add the process
t->sched_time twice into cpu->sched and omit "p"'s sched_time
which is very very very wrong.
posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
of this, so my fix is to use that here instead of this special tgid
test.
The fact that current can be one of the sub-threads of "p" points out
that we could make things a little bit more accurate, perhaps by using
sched_ns() on every thread we process in these loops. It also points
out that we don't use the most accurate value for threads in the group
actively running other cpus (and this is mentioned in the comment).
But that is a future enhancement, and this fix here definitely makes
sense.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes an issue reported by Coverity in kernel/module.c
Error reported: Cannot reach this line of code "else return ptr;"
Patch description:
This is the error path, so 'err' will be negative, the else case
is not required, this patch removes it.
Signed-off-by: Jayachandran C. <c.jayachandran@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X,
ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by
S390, 64BIT and COMPAT.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are some more places where the use of cputime_t instead of an integer
type and the associated macros is necessary for the virtual cputime accounting
on s390. Affected are the s390 specific appldata code and BSD process
accounting.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes the swsusp_info structure become the header of the image in the
literal sense (ie. it is saved to the swap and read before any other image
data with the help of the swsusp's swap map structure, so generally it is
treated in the same way as the rest of the image).
The main thing it does is to make swsusp_header contain the offset of the swap
map used to track the image data pages rather than the offset of swsusp_info.
Simultaneously, swsusp_info becomes the first image page written to the swap.
The other changes are generally consequences of the above with a few
exceptions (there's some consolidation in the image reading part as a few
functions turn into trivial wrappers around something else).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This changes the handling of swap partitions by swsusp to avoid locking of the
swap devices that are not used for suspend and, consequently, simplifies the
code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the suspend image size limit tunable via /sys/power/image_size.
It is necessary for systems on which there is a limited amount of swap
available for suspend. It can also be useful for optimizing performance of
swsusp on systems with 1 GB of RAM or more.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Limit the size of the suspend image to approx. 500 MB, which should
improve the overall performance of swsusp on systems with more than 1 GB of
RAM.
It introduces the constant IMAGE_SIZE that can be set to the preferred size
of the image (in MB) and modifies the memory-shrinking part of swsusp to
take this constant into account (500 is the default value of IMAGE_SIZE).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These two prototypes are already present in sched.h, remove duplicate
version.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a problem with the function enough_free_mem() used by
swsusp to verify if there is a sufficient number of memory pages available
to it to create and save the suspend image.
Namely, enough_free_mem() uses nr_free_pages() to obtain the number of free
memory pages, which is incorrect, because this function returns the total
number of free pages, including free highmem pages, and the highmem pages
cannot be used by swsusp for storing the image data.
The patch makes enough_free_mem() avoid counting the free highmem
pages as available to swsusp.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes swsusp free only as much memory as needed to complete the
suspend and not as much as possible. In the most of cases this should speed
up the suspend and make the system much more responsive after resume,
especially if a GUI (eg. X Windows) is used.
If needed, the old behavior (ie to free as much memory as possible during
suspend) can be restored by unsetting FAST_FREE in power.h
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces the swap map structure that can be used by swsusp for
keeping tracks of data pages written to the swap. The structure itself is
described in a comment within the patch.
The overall idea is to reduce the amount of metadata written to the swap and
to write and read the image pages sequentially, in a file-alike way. This
makes the swap-handling part of swsusp fairly independent of its
snapshot-handling part and will hopefully allow us to completely separate
these two parts in the future.
This patch is needed to remove the suspend image size limit imposed by the
limited size of the swsusp_info structure, which is essential for x86-64
systems with more than 512 MB of RAM.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the image encryption that is only used by swsusp instead of
zeroing the image after resume in order to prevent someone from reading some
confidential data from it in the future and it does not protect the image from
being read by an unauthorized person before resume. The functionality it
provides should really belong to the user space and will possibly be
reimplemented after the swap-handling functionality of swsusp is moved to the
user space.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks to Christoph for doing most of the work.
This allows automatic SMP IRQ affinity assignment other than default "all
interrupts on all CPUs" which is rather expensive. This might be useful if
the hardware can be programmed to distribute interrupts among different
CPUs, like Alpha does.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the futex code compilable and usable on NOMMU by making the attempt to
handle page faults conditional on CONFIG_MMU. If this is not enabled, then
we can assume that EFAULT returned from futex_atomic_op_inuser() is not
recoverable, and that the address lies outside of valid memory.
handle_mm_fault() is made to BUG if called on NOMMU without attempting to
invoke the actual handler (__handle_mm_fault).
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- This function returns -EINVAL all the time. Fix.
- Decruftify it a bit too.
- Writing to it doesn't seem to do what it's suppoed to do.
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
lib/lib.a(kobject_uevent.o)(.text+0x25f): In function `kobject_uevent':
: undefined reference to `__alloc_skb'
lib/lib.a(kobject_uevent.o)(.text+0x2a1): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x31d): In function `kobject_uevent':
: undefined reference to `skb_over_panic'
lib/lib.a(kobject_uevent.o)(.text+0x356): In function `kobject_uevent':
: undefined reference to `netlink_broadcast'
lib/lib.a(kobject_uevent.o)(.init.text+0x9): In function `kobject_uevent_init':
: undefined reference to `netlink_kernel_create'
make: *** [.tmp_vmlinux1] Error 1
Netlink is unconditionally enabled if CONFIG_NET, so that's OK.
kobject_uevent.o is compiled even if !CONFIG_HOTPLUG, which is lazy.
Let's compound the sin.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Leave the overloaded "hotplug" word to susbsystems which are handling
real devices. The driver core does not "plug" anything, it just exports
the state to userspace and generates events.
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This deprecates the /proc/sys/kernel/hotplug file, as all
this stuff should be in /sys some day, right? :)
In /sys/kernel/ we have now uevent_seqnum and uevent_helper.
The seqnum is no longer used by udev, as the version for this
kernel depends on netlink which events will never get
out-of-order.
Recent udev versions disable the /sbin/hotplug helper with
an init script, cause it leads to OOM on big boxes by running
hundreds of shells in parallel. It should be done now by:
echo "" > /sys/kernel/uevent_helper
(Note that "-n" does not work, cause neighter proc nor sysfs
support truncate().)
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It makes zero sense to have hotplug, but not the netlink
events enabled today. Remove this option and merge the
kobject_uevent.h header into the kobject.h header file.
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch removes all references to the bouncing address
rddunlap@osdl.org and one dead web page from the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
This is a slightly more complete fix for the previous minimal sysctl
string fix. It always terminates the returned string with a NUL, even
if the full result wouldn't fit in the user-supplied buffer.
The returned length is the full untruncated length, so that you can
tell when truncation has occurred.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For the sysctl syscall, if the user wants to get the old value of a
sysctl entry and set a new value for it in the same syscall, the old
value is always overwritten by the new value if the sysctl entry is of
string type and if the user sets its strategy to sysctl_string. This
issue lies in the strategy being run twice if the strategy is set to
sysctl_string, the general strategy sysctl_string always returns 0 if
success.
Such strategy routines as sysctl_jiffies and sysctl_jiffies_ms return 1
because they do read and write for the sysctl entry.
The strategy routine sysctl_string return 0 although it actually read
and write the sysctl entry.
According to my analysis, if a strategy routine do read and write, it
should return 1, if it just does some necessary check but not read and
write, it should return 0, for example sysctl_intvec.
Signed-off-by: Yi Yang <yang.y.yi@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the string was too long to fit in the user-supplied buffer,
the sysctl layer would zero-terminate it by writing past the
end of the buffer. Don't do that.
Noticed by Yi Yang <yang.y.yi@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr
in wake_futex() can become visible before wake_up_all() clears the lock in the
futex_q.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All the work was done to setup the file and maintain the file handles but
the access functions were zeroed out due to the #ifdef. Removing the
#ifdef allows full access to all the parameters when CONFIG_MODULES=n.
akpm: put it back again, but use CONFIG_SYSFS instead.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
http://bugzilla.kernel.org/show_bug.cgi?id=4320
Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
When multiple probes are registered at the same address and if due to some
recursion (probe getting triggered within a probe handler), we skip calling
pre_handlers and just increment nmissed field.
The below patch make sure it walks the list for multiple probes case.
Without the below patch we get incorrect results of nmissed count for
multiple probe case.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For Kprobes critical path is the path from debug break exception handler
till the control reaches kprobes exception code. No probes can be
supported in this path as we will end up in recursion.
This patch prevents this by moving the below function to safe __kprobes
section onto which no probes can be inserted.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kauditd was causing suspends to fail because it refused to freeze. Adding
a try_to_freeze() to its sleep loop solves the issue.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Acked-by: Pavel Machek <pavel@suse.cz>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When registering multiple kprobes at the same address, we leave a small
window where the kprobe hlist will not contain a reference to the
registered kprobe, leading to potentially, a system crash if the breakpoint
is hit on another processor.
Patch below now automically relpace the old kprobe with the new
kprobe from the hash list.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are several functions that might seem appropriate for a timestamp:
get_cycles()
current_kernel_time()
do_gettimeofday()
<read jiffies/jiffies_64>
Each has problems with combinations of SMP-safety, low resolution, and
monotonicity. This patch adds a new function that returns a monotonic SMP-safe
timestamp with nanosecond resolution where available.
Changes:
Split timestamp into separate patch
Moved to kernel/time.c
Renamed to getnstimestamp
Fixed unintended-pointer-arithmetic bug
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Accessing nohz_cpu_mask before incrementing rcp->cur is racy. It can cause
tickless idle CPUs to be included in rsp->cpumask, which will extend
graceperiods unnecessarily.
Fix this race. It has been tested using extensions to RCU torture module
that forces various CPUs to become idle.
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While doing some test of RCU torture module, I hit a OOPS in rcu_do_batch,
which was trying to processes callback of a module that was just removed.
This is because we weren't waiting long enough for all callbacks to fire.
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This introduces a new interface - rcu_barrier() which waits until all
the RCUs queued until this call have been completed.
Reiser4 needs this, because we do more than just freeing memory object
in our RCU callback: we also remove it from the list hanging off
super-block. This means, that before freeing reiser4-specific portion
of super-block (during umount) we have to wait until all pending RCU
callbacks are executed.
The only change of reiser4 made to the original patch, is exporting of
rcu_barrier().
Cc: Hans Reiser <reiser@namesys.com>
Cc: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a Kprobes are inserted/removed on a modules, the modules must be ref
counted so as not to allow to unload while probes are registered on that
module.
Without this patch, the probed module is free to unload, and when the
probing module unregister the probe, the kpobes code while trying to
replace the original instruction might crash.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Mao Bibo <bibo.mao@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On SMP resume from S3, we reset (INIT) the non-boot
processors to boot them cleanly. But the BIOS needs
to execute _WAK after INIT in order to properly
initialized these processors upon resume.
http://bugzilla.kernel.org/show_bug.cgi?id=5651
Signed-off-by: David Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Fix swsusp on machines not supporting S4. With recent changes, it is not
possible to trigger it using /sys filesystem. Swsusp does not really need
any support from low-level code, it is possible to reboot or halt at the
end of suspend.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
set_page_dirty() will not cope with being handed a page * which is part of
a compound page, but not the master page in that compound page. This case
can occur via access_process_vm() if you attemp to write to another
process's hugepage memory area using ptrace() (causing an oops or hang).
This patch fixes the bug by only calling set_page_dirty() from
access_process_vm() if the page is not a compound page. We already use a
similar fix in bio_set_pages_dirty() for the case of direct io to
hugepages.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the cpuset_fork() call below the write_unlock_irq call in
kernel/fork.c copy_process().
Since the cpuset-dual-semaphore-locking-overhaul.patch, the cpuset_fork()
routine acquires task_lock(), so cannot be called while holding the
tasklist_lock for write.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tracked this down on an Ultra Enterprise 3000. It's a 6-way machine. Odd
thing about this machine (and it's good for finding bugs like this) is that
the CPU id's are not 0 based. For instance, on my machine the CPU's are
6/7/10/11/14/15.
This caused some NULL pointer dereference in kernel/workqueue.c because for
single_threaded workqueue's, it hardcoded the cpu to 0.
I changed the 0's to any_online_cpu(cpu_online_mask), which cpumask.h
claims is "First cpu in mask". So this fits the same usage.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't do that - it does GFP_KERNEL allocations, for a start.
(Reported by Guillaume Thouvenin <guillaume.thouvenin@bull.net>)
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are some callers in cpufreq hotplug notify path that the lowest
function calls lock_cpu_hotplug(). The lock is already held during
cpu_up() and cpu_down() calls when the notify calls are broadcast to
registered clients.
Ideally if possible, we could disable_preempt() at the highest caller and
make sure we dont sleep in the path down in cpufreq->driver_target() calls
but the calls are so intertwined and cumbersome to cleanup.
Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in
all places.
- Removed export of cpucontrol semaphore and made it static.
- removed explicit uses of up/down with lock_cpu_hotplug()
so we can keep track of the the callers in same thread context and
just keep refcounts without calling a down() that causes a deadlock.
- Removed current_in_hotplug() uses
- Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug()
temporary workaround.
Tested with insmod of cpufreq_stat.ko, and logical online/offline
to make sure we dont have any hang situations.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: Zwane Mwaikambo <zwane@linuxpower.ca>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If unregister_console() is inadvertently called while no consoles are
registered, it will crash trying to dereference NULL pointer. It is
necessary to fix that because register_console() provides no indication
that it actually registered the console passed in. In fact, it may well
decide not to register it based on various things...
(akpm: It'd be better to make register_console() return something and fix the
callers. All 106 of them...)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The follow_page changes in get_futex_key have left it with two almost
identical blocks, when handling the rare case of a futex in a nonlinear vma.
get_user_pages will itself do that follow_page, and its additional
find_extend_vma is hardly any overhead since the vma is already cached. Let's
just delete the follow_page block and let get_user_pages do it.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most of the functions already check. Do the ones that didn't.
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For copy_one_pte's print_bad_pte to show the task correctly (instead of
"???"), dup_mmap must pass down parent vma rather than child vma.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One issue with the RCU torture test is that the current error flagging can
be lost in dmesg. This patch adds a "SUCCESS"/"FAILURE" string to the line
that flags the end of the test, where it can easily be seen with "dmesg |
tail" at the end of the test. Also adds tests of architecture-specific
memory barriers -- or, more likely, of the RCU torture test itself.
Cc: <vatsa@in.ibm.com>
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It would appear that the timespec normalize code has an off by one error.
Found in three places. Thanks to Ben for spotting.
Signed-off-by: George Anzinger<george@mvista.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sync iocbs have a life cycle that don't need a kioctx. Their retrying, if
any, is done in the context of their owner who has allocated them on the
stack.
The sole user of a sync iocb's ctx reference was aio_complete() checking for
an elevated iocb ref count that could never happen. No path which grabs an
iocb ref has access to sync iocbs.
If we were to implement sync iocb cancelation it would be done by the owner of
the iocb using its on-stack reference.
Removing this chunk from aio_complete allows us to remove the entire kioctx
instance from mm_struct, reducing its size by a third. On a i386 testing box
the slab size went from 768 to 504 bytes and from 5 to 8 per page.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes deadlock of stop_machine() vs. synchronous IPI send. The
problem is that stop_machine() disables interrupts before disabling
preemption on other CPUs. So if another CPU is preempted and then calls
something like flush_tlb_all() it will deadlock with CPU doing
stop_machine() and which can't process IPI due to disabled IRQs.
I changed stop_machine() to do the same things exactly as it does on other
CPUs, i.e. it should disable preemption first on _all_ CPUs including
itself and only after that disable IRQs.
Signed-off-by: Kirill Korotaev <dev@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Andrey Savochkin" <saw@sawoct.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the box usable for interactive work when running the RCU torture test,
by renicing the RCU torture-test threads to +19 by default. Kthreads run
at nice -5 by default.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch reverts commit c33880aadd since
it's not needed anymore. As pointed out by Roland McGrath the real fix
is to deliver all signals before returning to user space.
See http://www.ussg.iu.edu/hypermail/linux/kernel/0509.2/0683.html
A fix for s390 has been merged.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
encapsulates the rest of arch-dependent operations with thread_info access.
Two new helpers - setup_thread_stack() and end_of_stack(). For normal case
the former consists of copying thread_info of parent to new thread_info and
the latter returns pointer immediately past the end of thread_info.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
new helper - task_thread_info(task). On platforms that have thread_info
allocated separately (i.e. in default case) it simply returns
task->thread_info. m68k wants (and for good reasons) to embed its thread_info
into task_struct. So it will (in later patch) have task_thread_info() of its
own. For now we just add a macro for generic case and convert existing
instances of its body in core kernel to uses of new macro. Obviously safe -
all normal architectures get the same preprocessor output they used to get.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is wrong to acquire the semaphore and then return from
cpuset_zone_allowed without releasing it.
Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When ptrace_attach fails we need to drop the task_struct reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since few people need the support anymore, this moves the legacy
pm_xxx functions to CONFIG_PM_LEGACY, and include/linux/pm_legacy.h.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a task is being traced we never auto-reap it even if it might look
like its parent doesn't care. The tracer obviously _does_ care.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
recalc_task_prio() is called from activate_task() to calculate dynamic
priority and interactive credit for the activating task. For real-time
scheduling process, all that dynamic calculation is thrown away at the end
because rt priority is fixed. Patch to optimize recalc_task_prio() away
for rt processes.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <piggin@cyberone.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Before we did CLONE_THREAD, the way to check whether we were attaching
to ourselves was to just check "current == task", but with CLONE_THREAD
we should check that the thread group ID matches instead.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce
confusion, and make their semantics rigid. Improves efficiency of
resched_task and some cpu_idle routines.
* In resched_task:
- TIF_NEED_RESCHED is only cleared with the task's runqueue lock held,
and as we hold it during resched_task, then there is no need for an
atomic test and set there. The only other time this should be set is
when the task's quantum expires, in the timer interrupt - this is
protected against because the rq lock is irq-safe.
- If TIF_NEED_RESCHED is set, then we don't need to do anything. It
won't get unset until the task get's schedule()d off.
- If we are running on the same CPU as the task we resched, then set
TIF_NEED_RESCHED and no further action is required.
- If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set
after TIF_NEED_RESCHED has been set, then we need to send an IPI.
Using these rules, we are able to remove the test and set operation in
resched_task, and make clear the previously vague semantics of
POLLING_NRFLAG.
* In idle routines:
- Enter cpu_idle with preempt disabled. When the need_resched() condition
becomes true, explicitly call schedule(). This makes things a bit clearer
(IMO), but haven't updated all architectures yet.
- Many do a test and clear of TIF_NEED_RESCHED for some reason. According
to the resched_task rules, this isn't needed (and actually breaks the
assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock
held). So remove that. Generally one less locked memory op when switching
to the idle thread.
- Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner
most polling idle loops. The above resched_task semantics allow it to be
set until before the last time need_resched() is checked before going into
a halt requiring interrupt wakeup.
Many idle routines simply never enter such a halt, and so POLLING_NRFLAG
can be always left set, completely eliminating resched IPIs when rescheduling
the idle task.
POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The intermittent scheduling of the migration thread at ultra high priority
makes the smp nice handling see that runqueue as being heavily loaded. The
migration thread itself actually handles the balancing so its influence on
priority balancing should be ignored.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The priority biasing was off by mutliplying the total load by the total
priority bias and this ruins the ratio of loads between runqueues. This
patch should correct the ratios of loads between runqueues to be proportional
to overall load. -2nd attempt.
From: Dave Kleikamp <shaggy@austin.ibm.com>
This patch fixes a divide-by-zero error that I hit on a two-way i386
machine. rq->nr_running is tested to be non-zero, but may change by the
time it is used in the division. Saving the value to a local variable
ensures that the same value that is checked is used in the division.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To intensify the 'nice' support across physical cpus on SMP we can bias the
loads on idle rebalancing. To prevent idle rebalance from trying to pull tasks
from queues that appear heavily loaded we only bias the load if there is more
than one task running.
Add some minor micro-optimisations and have only one return from __source_load
and __target_load functions.
Fix the fact that target_load was not biased by priority when type == 0.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Real time tasks' effect on prio_bias should be based on their real time
priority level instead of their static_prio which is based on nice.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
prio_bias should only be adjusted in set_user_nice if p is actually currently
queued.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements 'nice' support across physical cpus on SMP.
It introduces an extra runqueue variable prio_bias which is the sum of the
(inverted) static priorities of all the tasks on the runqueue.
This is then used to bias busy rebalancing between runqueues to obtain good
distribution of tasks of different nice values. By biasing the balancing only
during busy rebalancing we can avoid having any significant loss of throughput
by not affecting the carefully tuned idle balancing already in place. If all
tasks are running at the same nice level this code should also have minimal
effect. The code is optimised out in the !CONFIG_SMP case.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes only the functions in swsusp.c call functions in snapshot.c
and not both ways. It also moves the check for available swap out of
swsusp_suspend() which is necessary for separating the swap-handling functions
in swsusp from the core code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch simplifies the relocation of the page backup list (aka pagedir)
during resume.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The changes made by this patch are necessary for the pagedir relocation
simplification in the next patch. Additionally, these changes allow us to
drop check_pagedir() and make get_safe_page() be a one-line wrapper around
alloc_image_page() (get_safe_page() goes to snapshot.c, because
alloc_image_page() is static and it does not make sense to export it).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If ACPI sleep is not configured, but someone still wants to run swsusp,
he'd get oops in enter_state. This is regression since 2.6.14 and this
fixes it.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On a large SMP box we get a lot of softlockup thread XX started lines.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When calling target drivers to set frequency, we take cpucontrol lock.
When we modified the code to accomodate CPU hotplug, there was an attempt
to take a double lock of cpucontrol leading to a deadlock. Since the
current thread context is already holding the cpucontrol lock, we dont need
to make another attempt to acquire it.
Now we leave a trace in current->flags indicating current thread already is
under cpucontrol lock held, so we dont attempt to do this another time.
Thanks to Andrew Morton for the beating:-)
From: Brice Goglin <Brice.Goglin@ens-lyon.org>
Build fix
(akpm: this patch is still unpleasant. Ashok continues to look for a cleaner
solution, doesn't he? ;))
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
You could open the /proc/sys/net/ipv4/conf/<if>/<whatever> file, then
wait for interface to go away, try to grab as much memory as possible in
hope to hit the (kfreed) ctl_table. Then fill it with pointers to your
function. Then do read from file you've opened and if you are lucky,
you'll get it called as ->proc_handler() in kernel mode.
So this is at least an Oops and possibly more. It does depend on an
interface going away though, so less of a security risk than it would
otherwise be.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The way we currently deal with quota and process accounting that might
keep vfsmount busy at umount time is inherently broken; we try to turn
them off just in case (not quite correctly, at that) and
a) pray umount doesn't fail (otherwise they'll stay turned off)
b) pray nobody doesn anything funny just as we turn quota off
Moreover, LSM provides hooks for doing the same sort of broken logics.
The proper way to deal with that is to introduce the second kind of
reference to vfsmount. Semantics:
- when the last normal reference is dropped, all special ones are
converted to normal ones and if there had been any, cleanup is done.
- normal reference can be cloned into a special one
- special reference can be converted to normal one; that's a no-op if
we'd already passed the point of no return (i.e. mntput() had
converted special references to normal and started cleanup).
The way it works: e.g. starting process accounting converts the vfsmount
reference pinned by the opened file into special one and turns it back
to normal when it gets shut down; acct_auto_close() is done when no
normal references are left. That way it does *not* obstruct umount(2)
and it silently gets turned off when the last normal reference to
vfsmount is gone. Which is exactly what we want...
The same should be done by LSM module that holds some internal
references to vfsmount and wants to shut them down on umount - it should
make them special and security_sb_umount_close() will be called exactly
when the last normal reference to vfsmount is gone.
quota handling is even simpler - we don't use normal file IO anymore, so
there's no need to hold vfsmounts at all. DQUOT_OFF() is done from
deactivate_super(), where it really belongs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sparc64 is unique among architectures in taking the page_table_lock in
its context switch (well, cris does too, but erroneously, and it's not
yet SMP anyway).
This seems to be a private affair between switch_mm and activate_mm,
using page_table_lock as a per-mm lock, without any relation to its uses
elsewhere. That's fine, but comment it as such; and unlock sooner in
switch_mm, more like in activate_mm (preemption is disabled here).
There is a block of "if (0)"ed code in smp_flush_tlb_pending which would
have liked to rely on the page_table_lock, in switch_mm and elsewhere;
but its comment explains how dup_mmap's flush_tlb_mm defeated it. And
though that could have been changed at any time over the past few years,
now the chance vanishes as we push the page_table_lock downwards, and
perhaps split it per page table page. Just delete that block of code.
Which leaves the mysterious spin_unlock_wait(&oldmm->page_table_lock)
in kernel/fork.c copy_mm. Textual analysis (supported by Nick Piggin)
suggests that the comment was written by DaveM, and that it relates to
the defeated approach in the sparc64 smp_flush_tlb_pending. Just delete
this block too.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I didn't find any possible modular usage in the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I didn't find any possible modular usage in the kernel.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I didn't find any possible modular usage of console_unblank in the
kernel.
This patch was already ACK'ed by Alan Cox.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert to proper kernel-doc format.
Some have extra blank lines (not allowed immed. after the function name)
or need blank lines (after all parameters). Function summary must be only
one line.
Colon (":") in a function description does weird things (causes kernel-doc
to think that it's a new section head sadly).
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Various core kernel-doc cleanups:
- add missing function parameters in ipc, irq/manage, kernel/sys,
kernel/sysctl, and mm/slab;
- move description to just above function for kernel_restart()
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reorganize the preempt_disable/enable calls to eliminate the extra preempt
depth. Changes based on Paul McKenney's review suggestions for the kprobes
RCU changeset.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the base kprobes infrastructure to use RCU for synchronization
during kprobe registration and unregistration. These changes coupled with the
arch kprobe changes (next in series):
a. serialize registration and unregistration of kprobes.
b. enable lockless execution of handlers. Handlers can now run in parallel.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the base kprobe infrastructure to track kprobe execution on a
per-cpu basis.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sys_ptrace boilerplate code (everything outside the big switch
statement for the arch-specific requests) is shared by most architectures.
This patch moves it to kernel/ptrace.c and leaves the arch-specific code as
arch_ptrace.
Some architectures have a too different ptrace so we have to exclude them.
They continue to keep their implementations. For sh64 I had to add a
sh64_ptrace wrapper because it does some initialization on the first call.
For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but
SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-By: David Howells <dhowells@redhat.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code for FUTEX_WAKE_OP calls an arch callback,
futex_atomic_op_inuser(). That callback can return an error code, but
currently the caller assumes any error is EFAULT, and will try various
things to resolve the fault before eventually giving up with EFAULT
(regardless of the original error code). This is not a theoretical case -
arch callbacks currently return -ENOSYS if the opcode they are given is
bogus.
This patch alters the code to detect non-EFAULT errors and return them
directly to the user.
Of course, whether -ENOSYS is the correct return value for the bogus opcode
case, or whether EINVAL would be more appropriate is another question.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jamie Lokier <jamie@shareable.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AIO was adding a new context's max requests to the global total before
testing if that resulting total was over the global limit. This let
innocent tasks get their new limit tested along with a racing guilty task
that was crossing the limit. This serializes the _nr accounting with a
spinlock It also switches to using unsigned long for the global totals.
Individual contexts are still limited to an unsigned int's worth of
requests by the syscall interface.
The problem and fix were verified with a simple program that spun creating
and destroying a context while holding on to another long lived context.
Before the patch a task creating a tiny context could get a spurious EAGAIN
if it raced with a task creating a very large context that overran the
limit.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a connector that reports fork, exec, id change, and exit
events for all processes to userspace. It replaces the fork_advisor patch
that ELSA is currently using. Applications that may find these events
useful include accounting/auditing (e.g. ELSA), system activity monitoring
(e.g. top), security, and resource management (e.g. CKRM).
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove unused variable, and make code less evil that way. Fix whitespace
around for-loop-like macro.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This cleans spaces between * and pointer up, and adds "int" in "unsigned
int".
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace smp_processor_id() with any_online_cpu(cpu_online_map) in order to
avoid lots of "BUG: using smp_processor_id() in preemptible [00000001]
code:..." messages in case taking a cpu online fails.
All the traces start at the last notifier_call_chain(...) in kernel/cpu.c.
Since we hold the cpu_control semaphore it shouldn't be any problem to access
cpu_online_map.
The reason why cpu_up failed is simply that the cpu that was supposed to be
taken online wasn't even there. That is because on s390 we never know when a
new cpu comes and therefore cpu_possible_map consists of only ones and doesn't
reflect reality.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
!unlikely(expr) hurts my brain. likely(!expr) is more straightforward.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>