The namespace's proc_mnt must be kern_mount-ed to make this pointer always
valid, independently of whether the user space mounted the proc or not. This
solves raced in proc_flush_task, etc. with the proc_mnt switching from NULL
to not-NULL.
The initialization is done after the init's pid is created and hashed to make
proc_get_sb() finr it and get for root inode.
Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
superblock holds the namespace we must explicitly break this circle to destroy
all the stuff. This is done after the init of the namespace dies. Running a
few steps forward - when init exits it will kill all its children, so no
proc_mnt will be needed after its death.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
create a new struct pid for the new process. Allocate pid_t's for the new
process in the new pid namespace and all ancestor pid namespaces. Make the
newly cloned process the session and process group leader.
Since the active pid namespace is special and expected to be the first entry
in pid->upid_list, preserve the order of pid namespaces.
The size of 'struct pid' is dependent on the the number of pid namespaces the
process exists in, so we use multiple pid-caches'. Only one pid cache is
created during system startup and this used by processes that exist only in
init_pid_ns.
When a process clones its pid namespace, we create additional pid caches as
necessary and use the pid cache to allocate 'struct pids' for that depth.
Note, that with this patch the newly created namespace won't work, since the
rest of the kernel still uses global pids, but this is to be fixed soon. Init
pid namespace still works.
[oleg@tv-sign.ru: merge fix]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we create new namespace we will need to allocate the struct pid, that
will have one extra struct upid in array, comparing to the parent.
Thus we need to know the new namespace (if any) in alloc_pid() to init this
struct upid properly, so move the alloc_pid() call lower in copy_process().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When searching the task by numerical id on may need to find it using global
pid (as it is done now in kernel) or by its virtual id, e.g. when sending a
signal to a task from one namespace the sender will specify the task's virtual
id and we should find the task by this value.
[akpm@linux-foundation.org: fix gfs2 linkage]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When showing pid to user or getting the pid numerical id for in-kernel use the
value of this id may differ depending on the namespace.
This set of helpers is used to get the global pid nr, the virtual (i.e. seen
by task in its namespace) nr and the nr as it is seen from the specified
namespace.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each struct upid element of struct pid has to be initialized properly, i.e.
its nr mst be allocated from appropriate pidmap and ns set to appropriate
namespace.
When allocating a new pid, we need to know the namespace this pid will live
in, so the additional argument is added to alloc_pid().
On the other hand, the rest of the kernel still uses the pid->nr and
pid->pid_chain fields, so these ones are still initialized, but this will be
removed soon.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each namespace has a parent and is characterized by its "level". Level is the
number of the namespace generation. E.g. init namespace has level 0, after
cloning new one it will have level 1, the next one - 2 and so on and so forth.
This level is not explicitly limited.
True hierarchy must have some way to find each namespace's children, but it is
not used in the patches, so this ability is not added (yet).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.
The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.
When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.
This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make task release its namespaces after it has reparented all his children to
child_reaper, but before it notifies its parent about its death.
The reason to release namespaces after reparenting is that when task exits it
may send a signal to its parent (SIGCHLD), but if the parent has already
exited its namespaces there will be no way to decide what pid to dever to him
- parent can be from different namespace.
The reason to release namespace before notifying the parent it that when task
sends a SIGCHLD to parent it can call wait() on this taks and release it. But
releasing the mnt namespace implies dropping of all the mounts in the mnt
namespace and NFS expects the task to have valid sighand pointer.
Thanks to Oleg for pointing out some races that can apear and helping with
patches and fixes.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A pid namespace is a "view" of a particular set of tasks on the system. They
work in a similar way to filesystem namespaces. A file (or a process) can be
accessed in multiple namespaces, but it may have a different name in each. In
a filesystem, this name might be /etc/passwd in one namespace, but
/chroot/etc/passwd in another.
For processes, a process may have pid 1234 in one namespace, but be pid 1 in
another. This allows new pid namespaces to have basically arbitrary pids, and
not have to worry about what pids exist in other namespaces. This is
essential for checkpoint/restart where a restarted process's pid might collide
with an existing process on the system's pid.
In this particular implementation, pid namespaces have a parent-child
relationship, just like processes. A process in a pid namespace may see all
of the processes in the same namespace, as well as all of the processes in all
of the namespaces which are children of its namespace. Processes may not,
however, see others which are in their parent's namespace, but not in their
own. The same goes for sibling namespaces.
The know issue to be solved in the nearest future is signal handling in the
namespace boundary. That is, currently the namespace's init is treated like
an ordinary task that can be killed from within an namespace. Ideally, the
signal handling by the namespace's init should have two sides: when signaling
the init from its namespace, the init should look like a real init task, i.e.
receive only those signals, that is explicitly wants to; when signaling the
init from one of the parent namespaces, init should look like an ordinary
task, i.e. receive any signal, only taking the general permissions into
account.
The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
we eventually came to almost the same implementation, which differed in some
details. This set is based on Pavel's patches, but it includes comments and
patches that from Sukadev.
Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
valuable advises on how to make this set cleaner.
This patch:
We have to call exit_task_namespaces() only after the exiting task has
reparented all his children and is sure that no other threads will reparent
theirs for it. Why this is needed is explained in appropriate patch. This
one only reworks the forget_original_parent() so that after calling this a
task cannot be/become parent of any other task.
We check PF_EXITING instead of ->exit_state while choosing the new parent.
Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
us (when forget_original_parent() drops it) must see PF_EXITING.
The other changes are just cleanups. They just move some code from
exit_notify to forget_original_parent(). It is a bit silly to declare
ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
forget_original_parent(), unlock-lock-unlock tasklist, and then use
ptrace_dead.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/time/clocksource.c: Convert list_for_each to
list_for_each_entry in clocksource_resume(),
sysfs_override_clocksource() and show_available_clocksources()
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the following scenario:
code path 1:
my_function() -> lock(L1); ...; flush_workqueue(); ...
code path 2:
run_workqueue() -> my_work() -> ...; lock(L1); ...
you can get a deadlock when my_work() is queued or running
but my_function() has acquired L1 already.
This patch adds a pseudo-lock to each workqueue to make lockdep
warn about this scenario.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists. This is
slow on read-only paths and may be impossible in some cases.
E.g. Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.
On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.
The access to other task namespaces is proposed to be performed
like this:
rcu_read_lock();
nsproxy = task_nsproxy(tsk);
if (nsproxy != NULL) {
/ *
* work with the namespaces here
* e.g. get the reference on one of them
* /
} / *
* NULL task_nsproxy() means that this task is
* almost dead (zombie)
* /
rcu_read_unlock();
This patch has passed the review by Eric and Oleg :) and,
of course, tested.
[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move alloc_pid() into copy_process(). This will keep all pid and pid
namespace code together and simplify error handling when we support multiple
pid namespaces.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().
A cgroup init has it's tsk->pid == 1.
A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.
Changelog:
2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().
2.6.21-mm2-pidns2:
- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.
[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename the child_reaper() function to task_child_reaper() to be similar to
other task_* functions and to distinguish the function from 'struct
pid_namspace.child_reaper'.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With multiple pid namespaces, a process is known by some pid_t in every
ancestor pid namespace. Every time the process forks, the child process also
gets a pid_t in every ancestor pid namespace.
While a process is visible in >=1 pid namespaces, it can see pid_t's in only
one pid namespace. We call this pid namespace it's "active pid namespace",
and it is always the youngest pid namespace in which the process is known.
This patch defines and uses a wrapper to find the active pid namespace of a
process. The implementation of the wrapper will be changed in when support
for multiple pid namespaces are added.
Changelog:
2.6.22-rc4-mm2-pidns1:
- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
task_active_pid_ns() in child_reaper() since task->nsproxy
can be NULL during task exit (so child_reaper() continues to
use init_pid_ns).
to implement child_reaper() since init_pid_ns.child_reaper to
implement child_reaper() since tsk->nsproxy can be NULL during exit.
2.6.21-rc6-mm1:
- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
process can have multiple pid namespaces.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Pavel Emelianov <xemul@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzel <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add kmem_cache to pid_namespace to allocate pids from.
Since both implementations expand the struct pid to carry more numerical
values each namespace should have separate cache to store pids of different
sizes.
Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids
on the pid. Different namespaces with same level of nesting will have same
caches.
This patch has two FIXMEs that are to be fixed after we reach the consensus
about the struct pid itself.
The first one is that the namespace to free the pid from in free_pid() must be
taken from pid. Now the init_pid_ns is used.
The second FIXME is about the cache allocation. When we do know how long the
object will be then we'll have to calculate this size in create_pid_cachep.
Right now the sizeof(struct pid) value is used.
[akpm@linux-foundation.org: coding-style repair]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The set of functions process_session, task_session, process_group and
task_pgrp is confusing, as the names can be mixed with each other when looking
at the code for a long time.
The proposals are to
* equip the functions that return the integer with _nr suffix to
represent that fact,
* and to make all functions work with task (not process) by making
the common prefix of the same name.
For monotony the routines signal_session() and set_signal_session() are
replaced with task_session_nr() and set_task_session(), especially since they
are only used with the explicit task->signal dereference.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.
This version names cgroups which are automatically created using
cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
cloned process. (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create
/cgroups/(...)/node_<pid>/node_<pid>
The only possibilities (AFAICT) for a -EEXIST on unshare are
1. pid wraparound
2. a process fails an unshare, then tries again.
Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the
node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.
Changelog:
Name cloned cgroups as "node_<pid>".
[clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)
This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface. A new set of cgroup operations are
registered with commands and attributes. It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.
The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add. Currently
user space requests for statistics by passing the cgroup file
descriptor. Statistics about the state of all the tasks in the cgroup
is returned to user space.
TODO's/NOTE:
This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.
Sample output
# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0
# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0
If the approach looks good, I'll enhance and post the user space utility for
the same
Feedback, comments, test results are always welcome!
[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example subsystem exports debugging information as an aid to diagnosing
refcount leaks, etc, in the cgroup framework.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.
Portions contributed by Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the filesystem support logic from the cpusets system and makes cpusets
a cgroup subsystem
The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
passed through to the cgroup filesystem with the appropriate options to
emulate the old cpuset filesystem behaviour.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the following files to the cgroup filesystem:
notify_on_release - configures/reports whether the cgroup subsystem should
attempt to run a release script when this cgroup becomes unused
release_agent - configures/reports the release agent to be used for this
hierarchy (top level in each hierarchy only)
releasable - reports whether this cgroup would have been auto-released if
notify_on_release was true and a release agent was configured (mainly useful
for debugging)
To avoid locking issues, invoking the userspace release agent is done via a
workqueue task; cgroups that need to have their release agents invoked by
the workqueue task are linked on to a list.
[pj@sgi.com: Need to include kmod.h]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the struct css_set embedded in task_struct with a pointer; all tasks
that have the same set of memberships across all hierarchies will share a
css_set object, and will be linked via their css_sets field to the "tasks"
list_head in the css_set.
Assuming that many tasks share the same cgroup assignments, this reduces
overall space usage and keeps the size of the task_struct down (three pointers
added to task_struct compared to a non-cgroups kernel, no matter how many
subsystems are registered).
[akpm@linux-foundation.org: fix a printk]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for cgroup_clone(), a way to create new cgroups intended to
be used for systems such as namespace unsharing. A new subsystem callback,
post_clone(), is added to allow subsystems to automatically configure cloned
cgroups.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the necessary hooks to the fork() and exit() paths to ensure
that new children inherit their parent's cgroup assignments, and that
exiting processes release reference counts on their cgroups.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add write_uint() helper method for cgroup subsystems
This helper is analagous to the read_uint() helper method for
reporting u64 values to userspace. It's designed to reduce the amount
of boilerplate requierd for creating new cgroup subsystems.
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the per-directory "tasks" file for cgroupfs mounts; this allows the
user to determine which tasks are members of a cgroup by reading a
cgroup's "tasks", and to move a task into a cgroup by writing its pid to
its "tasks".
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.
The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.
This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.
Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.
With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign. This cpuset code would occassionally trigger that warning.
The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is separate notifier header, but no separate notifier .c file.
Extract notifier code out of kernel/sys.c which will remain for
misc syscalls I hope. Merge kernel/die_notifier.c into kernel/notifier.c.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
hrtimer: hook compat_sys_nanosleep up to high res timer code
hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier
Get rid of sparse related warnings from places that use integer as NULL
pointer.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds items to the taststats struct to account for user and system
time based on scaling the CPU frequency and instruction issue rates.
Adds account_(user|system)_time_scaled callbacks which architectures
can use to account for time using this mechanism.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just removing white space at the end of lines.
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Large chunks of 5 spaces instead of tabs.
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lots of converting spaces to tabs.
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
can change the capabilities of another process, p2. This is not the
meaning that was intended for this capability at all, and this
implementation came about purely because, without filesystem capabilities,
there was no way to use capabilities without one process bestowing them on
another.
Since we now have a filesystem support for capabilities we can fix the
implementation of CAP_SETPCAP.
The most significant thing about this change is that, with it in effect, no
process can set the capabilities of another process.
The capabilities of a program are set via the capability convolution
rules:
pI(post-exec) = pI(pre-exec)
pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
pE(post-exec) = fE ? pP(post-exec) : 0
at exec() time. As such, the only influence the pre-exec() program can
have on the post-exec() program's capabilities are through the pI
capability set.
The correct implementation for CAP_SETPCAP (and that enabled by this patch)
is that it can be used to add extra pI capabilities to the current process
- to be picked up by subsequent exec()s when the above convolution rules
are applied.
Here is how it works:
Let's say we have a process, p. It has capability sets, pE, pP and pI.
Generally, p, can change the value of its own pI to pI' where
(pI' & ~pI) & ~pP = 0.
That is, the only new things in pI' that were not present in pI need to
be present in pP.
The role of CAP_SETPCAP is basically to permit changes to pI beyond
the above:
if (pE & CAP_SETPCAP) {
pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */
}
This capability is useful for things like login, which (say, via
pam_cap) might want to raise certain inheritable capabilities for use
by the children of the logged-in user's shell, but those capabilities
are not useful to or needed by the login program itself.
One such use might be to limit who can run ping. You set the
capabilities of the 'ping' program to be "= cap_net_raw+i", and then
only shells that have (pI & CAP_NET_RAW) will be able to run
it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
would have to also have (pP & CAP_NET_RAW) in order to raise this
capability and pass it on through the inheritable set.
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After adding checking to register_sysctl_table and finding a whole new set
of bugs. Missed by countless code reviews and testers I have finally lost
patience with the binary sysctl interface.
The binary sysctl interface has been sort of deprecated for years and
finding a user space program that uses the syscall is more difficult then
finding a needle in a haystack. Problems continue to crop up, with the in
kernel implementation. So since supporting something that no one uses is
silly, deprecate sys_sysctl with a sufficient grace period and notice that
the handful of user space applications that care can be fixed or replaced.
The /proc/sys sysctl interface that people use will continue to be
supported indefinitely.
This patch moves the tested warning about sysctls from the path where
sys_sysctl to a separate path called from both implementations of
sys_sysctl, and it adds a proper entry into
Documentation/feature-removal-schedule.
Allowing us to revisit this in a couple years time and actually kill
sys_sysctl.
[lethal@linux-sh.org: sysctl: Fix syscall disabled build]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that the net/irda code didn't register any of it's binary paths
in the global sysctl.h header file so I missed them completely when making an
authoritative list of binary sysctl paths in the kernel. So add them to the
list of valid binary sysctl paths.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Well it turns out after I dug into the problems a little more I was returning
a few false positives so this patch updates my logic to remove them.
- Don't complain about 0 ctl_names in sysctl_check_binary_path
It is valid for someone to remove the sysctl binary interface
and still keep the same sysctl proc interface.
- Count ctl_names and procnames as matching if they both don't
exist.
- Only warn about missing min&max when the generic functions care.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After going through the kernels sysctl tables several times it has become
clear that code review and testing is just not effective in prevent
problematic sysctl tables from being used in the stable kernel. I certainly
can't seem to fix the problems as fast as they are introduced.
Therefore this patch adds sysctl_check_table which is called when a sysctl
table is registered and checks to see if we have a problematic sysctl table.
The biggest part of the code is the table of valid binary sysctl entries, but
since we have frozen our set of binary sysctls this table should not need to
change, and it makes it much easier to detect when someone unintentionally
adds a new binary sysctl value.
As best as I can determine all of the several hundred errors spewed on boot up
now are legitimate.
[bunk@kernel.org: kernel/sysctl_check.c must #include <linux/string.h>]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It looks like we inadvertently killed the cad_pid binary sysctl support when
cap_pid was changed to be a struct pid. Since no one has complained just
remove the binary path.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of having a bunch of ifdefs in sysctl.c move all of the pty sysctl
logic into drivers/char/pty.c
As well as cleaning up the logic this prevents sysctl_check_table from
complaining that the root table has a NULL data pointer on something with
generic methods.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio-nr, aio-max-nr, acpi_video_flags are unsigned long values which sysctl
does not handle properly with a 64bit kernel and a 32bit user space.
Since no one is likely to be using the binary sysctl values and the ascii
interface still works, this patch just removes support for the binary sysctl
interface from the kernel.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions are all wrapper functions for the proc interface that are
needed for them to work correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There as been no easy way to wrap the default sysctl strategy routine except
for returning 0. Which is not always what we want. The few instances I have
seen that want different behaviour have written their own version of
sysctl_data. While not too hard it is unnecessary code and has the potential
for extra bugs.
So to make these situations easier and make that part of sysctl more symetric
I have factord sysctl_data out of do_sysctl_strategy and exported as a
function everyone can use.
Further having sysctl_data be an explicit function makes checking for badly
formed sysctl tables much easier.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
needed and is a bit of a nuisance because it makes it harder to recognize
ctl_table is a type name.
So this patch removes it from the generic sysctl code. Hopefully I will have
enough energy to send the rest of my patches will follow and to remove it from
the rest of the kernel.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The functions in a CPU notifier chain is called with CPU_UP_PREPARE event
before making the CPU online. If one of the callback returns NOTIFY_BAD, it
stops to deliver CPU_UP_PREPARE event, and CPU online operation is canceled.
Then CPU_UP_CANCELED event is delivered to the functions in a CPU notifier
chain again.
This CPU_UP_CANCELED event is delivered to the functions which have been
called with CPU_UP_PREPARE, not delivered to the functions which haven't been
called with CPU_UP_PREPARE.
The problem that makes existing cpu hotplug error handlings complex is that
the CPU_UP_CANCELED event is delivered to the function that has returned
NOTIFY_BAD, too.
Usually we don't expect to call destructor function against the object that
has failed to initialize. It is like:
err = register_something();
if (err) {
unregister_something();
return err;
}
So it is natural to deliver CPU_UP_CANCELED event only to the functions that
have returned NOTIFY_OK with CPU_UP_PREPARE event and not to call the function
that have returned NOTIFY_BAD. This is what this patch is doing.
Otherwise, every cpu hotplug notifiler has to track whether notifiler event is
failed or not for each cpu. (drivers/base/topology.c is doing this with
topology_dev_map)
Similary this patch makes same thing with CPU_DOWN_PREPARE and CPU_DOWN_FAILED
evnets.
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If memchr argument is longer than strlen(kp->name), there will be some
weird result.
It will casuse duplicate filenames in sysfs for the "nousb". kernel
warning messages are as bellow:
sysfs: duplicate filename 'usbcore' can not be created
WARNING: at fs/sysfs/dir.c:416 sysfs_add_one()
[<c01c4750>] sysfs_add_one+0xa0/0xe0
[<c01c4ab8>] create_dir+0x48/0xb0
[<c01c4b69>] sysfs_create_dir+0x29/0x50
[<c024e0fb>] create_dir+0x1b/0x50
[<c024e3b6>] kobject_add+0x46/0x150
[<c024e2da>] kobject_init+0x3a/0x80
[<c053b880>] kernel_param_sysfs_setup+0x50/0xb0
[<c053b9ce>] param_sysfs_builtin+0xee/0x130
[<c053ba33>] param_sysfs_init+0x23/0x60
[<c024d062>] __next_cpu+0x12/0x20
[<c052aa30>] kernel_init+0x0/0xb0
[<c052aa30>] kernel_init+0x0/0xb0
[<c052a856>] do_initcalls+0x46/0x1e0
[<c01bdb12>] create_proc_entry+0x52/0x90
[<c0158d4c>] register_irq_proc+0x9c/0xc0
[<c01bda94>] proc_mkdir_mode+0x34/0x50
[<c052aa30>] kernel_init+0x0/0xb0
[<c052aa92>] kernel_init+0x62/0xb0
[<c0104f83>] kernel_thread_helper+0x7/0x14
=======================
kobject_add failed for usbcore with -EEXIST, don't try to register things with the same name in the same directory.
[<c024e466>] kobject_add+0xf6/0x150
[<c053b880>] kernel_param_sysfs_setup+0x50/0xb0
[<c053b9ce>] param_sysfs_builtin+0xee/0x130
[<c053ba33>] param_sysfs_init+0x23/0x60
[<c024d062>] __next_cpu+0x12/0x20
[<c052aa30>] kernel_init+0x0/0xb0
[<c052aa30>] kernel_init+0x0/0xb0
[<c052a856>] do_initcalls+0x46/0x1e0
[<c01bdb12>] create_proc_entry+0x52/0x90
[<c0158d4c>] register_irq_proc+0x9c/0xc0
[<c01bda94>] proc_mkdir_mode+0x34/0x50
[<c052aa30>] kernel_init+0x0/0xb0
[<c052aa92>] kernel_init+0x62/0xb0
[<c0104f83>] kernel_thread_helper+0x7/0x14
=======================
Module 'usbcore' failed to be added to sysfs, error number -17
The system will be unstable now.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On platforms that copy sys_tz into the vdso (currently only x86_64, soon to
include powerpc), it is possible for the vdso to get out of sync if a user
calls (admittedly unusual) settimeofday(NULL, ptr).
This patch adds a hook for architectures that set
CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also
updatee their copy in the vdso.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
during 2.6.9 development. Commit introduced io_wait field which remained
write-only than and still remains write-only.
Also garbage collect macros which "use" io_wait.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make hibernation_platform_enter() execute the enter-a-sleep-state sequence
instead of the mixed shutdown-with-entering-S4 thing.
Replace the shutting down of devices done by kernel_shutdown_prepare(), before
entering the ACPI S4 sleep state, with suspending them and the shutting down
of sysdevs with calling device_power_down(PMSG_SUSPEND) (just like before
entering S1 or S3, but the target state is now S4). Also, disable the
nonboot CPUs before entering the sleep state (S4), which generally always is a
good idea.
This is known to fix the "double disk spin down during hibernation" on some
machines, eg. HPC nx6325 (ref. http://lkml.org/lkml/2007/8/7/316 and the
following thread). Moreover, it has been reported to make
/sys/class/rtc/rtc0/wakealarm work correctly with hibernation for some users.
It also generally causes the hibernation state (ACPI S4) to be entered faster.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following scenario leads to total confusion of the platform firmware on
some boxes (eg. HPC nx6325):
* Hibernate with ACPI enabled
* Resume passing "acpi=off" to the boot kernel
To prevent this from happening it's necessary to check if ACPI is enabled (and
enable it if that's not the case) _right_ _after_ control has been transfered
from the boot kernel to the image kernel, before device_power_up() is called
(ie. with interrupts disabled). Enabling ACPI after calling
device_power_up() turns out to be insufficient.
For this reason, introduce new hibernation callback ->leave() that will be
executed before device_power_up() by the restored image kernel. To make it
work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c
(it's name is changed to "create_image", which is more up to the point).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the bits needed for supporting arbitrary boot kernels to the common
hibernation code.
To support arbitrary boot kernels, make it possible to replace the 'struct
new_utsname' and the kernel version in the hibernation image header by some
architecture specific data that will be used to verify if the image is valid
and to restore the image.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop
the serial console from being suspended when the rest of the machine goes
to sleep. This is incredibly useful for debugging power management-related
things; however, having it as a compile-time option has proved to be
incredibly inconvenient for us (OLPC). There are plenty of times that we
want serial console to not suspend, but for the most part we'd like serial
console to be suspended.
This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel
boot parameter (no_console_suspend). By default, the serial console will
be suspended along with the rest of the system; by passing
'no_console_suspend' to the kernel during boot, serial console will remain
alive during suspend.
For now, this is pretty serial console specific; further fixes could be
applied to make this work for things like netconsole.
Signed-off-by: Andres Salomon <dilinger@debian.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Measure the time of the freezing of tasks, even if it doesn't fail.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Increase the freezer's verbosity a bit, so that it's easier to read problem
reports related to it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the unused EXPORT_SYMBOL(pm_power_off_prepare).
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The freezer should not send signals to kernel threads, since that may lead to
subtle problems. In particular, commit
b74d0deb96 has changed recalc_sigpending_tsk()
so that it doesn't clear TIF_SIGPENDING. For this reason, if the freezer
continues to send fake signals to kernel threads and the freezing of kernel
threads fails, some of them may be running with TIF_SIGPENDING set forever.
Accordingly, recalc_sigpending_tsk() shouldn't set the task's TIF_SIGPENDING
flag if TIF_FREEZE is set.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tasks should go to the refrigerator only if explicitly requested to do that by
the freezer and not as a result of inheriting the TIF_FREEZE flag set from the
parent. Make it happen.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The syncing of filesystems from within the freezer is generally not needed.
Also, if there's an ext3 filesystem loopback-mounted from a FUSE one, the
syncing results in writes to it and deadlocks. Similarly, it will deadlock if
FUSE implements sync.
Change freeze_processes() so that it doesn't execute sys_sync() and make the
suspend and hibernation code path sync filesystems independently of the
freezer.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in
analogy with 'struct platform_suspend_ops'.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During hibernation we also need to tell the ACPI core that we're going to put
the system into the S4 sleep state. For this reason, an additional method in
'struct hibernation_ops' is needed, playing the role of set_target() in
'struct platform_suspend_operations'. Moreover, the role of the .prepare()
method is now different, so it's better to introduce another method, that in
general may be different from .prepare(), that will be used to prepare the
platform for creating the hibernation image (.prepare() is used anyway to
notify the platform that we're going to enter the low power state after the
image has been saved).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable suspend_ops representing the set of global platform-specific
suspend-related operations, used by the PM core, need not be exported outside
of kernel/power/main.c . Make it static.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no reason why the .prepare() and .finish() methods in 'struct
platform_suspend_ops' should take any arguments, since architectures don't use
these methods' argument in any practically meaningful way (ie. either the
target system sleep state is conveyed to the platform by .set_target(), or
there is only one suspend state supported and it is indicated to the PM core
by .valid(), or .prepare() and .finish() aren't defined at all). There also
is no reason why .finish() should return any result.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name of 'struct pm_ops' suggests that it is related to the power
management in general, but in fact it is only related to suspend. Moreover,
its name should indicate what this structure is used for, so it seems
reasonable to change it to 'struct platform_suspend_ops'. In that case, the
name of the global variable of this type used by the PM core and the names of
related functions should be changed accordingly.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
suspend_enter() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we have high res timers on ppc64 I thought Id test them. It turns
out compat_sys_nanosleep hasnt been converted to the hrtimer code and so
is limited to HZ resolution.
The follow patch converts compat_sys_nanosleep to use high res timers.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull the copy_to_user out of hrtimer_nanosleep and into the callers
(common_nsleep, sys_nanosleep) in preparation for converting
compat_sys_nanosleep to use hrtimers.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
schedstat is useful in investigating CPU scheduler behavior. Ideally,
I think it is beneficial to have it on all the time. However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.
Most of the fields probably don't need to be full 64-bit on 64-bit
arch. Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that. I'm proposing
kernel to cut back most of variable width on 64-bit system. (note,
the following patch doesn't affect 32-bit system).
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The recent wait_for_completion() cleanups:
commit 8cbbe86dfc
Author: Andi Kleen <ak@suse.de>
Date: Mon Oct 15 17:00:14 2007 +0200
sched: cleanup: refactor common code of sleep_on / wait_for_completion
Refactor common code of sleep_on / wait_for_completion
broke the return value of wait_for_completion_interruptible().
Previously it returned 0 on success, now -1. Fix that.
Problem found by Geert Uytterhoeven.
[ mingo: fixed whitespace damage ]
Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Doh, I completely missed that devices marked DUMMY are not running
the set_mode function. So we force broadcasting, but we keep the
local APIC timer running.
Let the clock event layer mark the device _after_ switching it off.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exports
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For those who don't care about CONFIG_SECURITY.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
task_rq_lock(rq->idle), rq->idle can't change its task_rq().
This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
which uses spin_lock_irq/spin_unlock_irq.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>