Adds a function for returning the value of a resource counter member, in a
form suitable for use in a cgroup read_u64 control file method.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.
[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the externs its global
functions (in this case for ns_cgroup_clone()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a code warning: symbol 'p' shadows an earlier one
This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.c
Independently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revision
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the needlessly global cgroup_enable_task_cg_lists() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make eCryptfs key module subsystem respect namespaces.
Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build. With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.
[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some
serious complaining of kernel, we need repeat the warnings, so here I isolate
the ratelimit part of printk.c to a standalone file.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following an experimental deletion of the unnecessary directive
#include <linux/slab.h>
from the header file <linux/percpu.h>, these files under kernel/ were exposed
as needing to include one of <linux/slab.h> or <linux/gfp.h>, so explicit
includes were added where necessary.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From the POV of synchronization, there should be no need to call
wake_up_process() with the 'kthread_create_lock' being held.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0xc60): Section mismatch in reference from the function kvm_init() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x33869a): Section mismatch in reference from the function xfs_icsb_init_counters() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x5556a1): Section mismatch in reference from the function acpi_processor_install_hotplug_notify() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0xfe6b28): Section mismatch in reference from the function cpufreq_register_driver() to the function .cpuinit.text:register_cpu_notifier()
register_cpu_notifier() are only really defined when HOTPLUG_CPU is enabled.
So references to the function are OK.
Annotate it with __ref so we do not get warnings from callers and do not get
warnings for the functions/data used by register_cpu_notifier().
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0x75c8d): Section mismatch in reference from the function take_cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d2a): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d4d): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75de4): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75e33): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
cpu_down is only used from code surrounded by HOTPLUG_CPU so any references to
__cpuinit is OK.
Add a few __ref to tech modpost to ignore the references.
This is just papering over the fact that the cpu hotplug code is fragile with
respect to use of HOTPLUG_CPU and in many cases rely on __cpuinit to get rid
of code when HOTPLUG_CPU is not enabled. For now this is the least invasive
change.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warning:
WARNING: vmlinux.o(.text+0x75f4e): Section mismatch in reference from the function unregister_cpu_notifier() to the variable .cpuinit.data:cpu_chain
We know that unregister_cpu_notifier is using HOTPLUG_CPU
stuff - so ignore these references.
Annotating unregister_cpu_notifier had been another option
but this caused far more warnings since not all callers were
annotated __cpuinit.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the RUSAGE_THREAD option for the getrusage system call. This is
essentially Roland's patch from http://lkml.org/lkml/2008/1/18/589, but the
line about RUSAGE_LWP line has been removed, as suggested by Ulrich and
Christoph.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel is sent to tainted within the warn_on_slowpath() function, and
whenever a warning occurs the new taint flag 'W' is set. This is useful to
know if a warning occurred before a BUG by preserving the warning as a flag
in the taint state.
This does not work on architectures where WARN_ON has its own definition.
These archs are:
1. s390
2. superh
3. avr32
4. parisc
The maintainers of these architectures have been added in the Cc: list
in this email to alert them to the situation.
The documentation in oops-tracing.txt has been updated to include the
new flag.
Signed-off-by: Nur Hussein <nurhussein@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer
kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Reviewed-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My recent additions to compat_ptrace_request made it mandatory
for CONFIG_COMPAT arch's to define copy_siginfo_from_user32.
This broke some builds, though they all really should get cleaned
up in that way.
Since all the arch's that actually call compat_ptrace_request have
now been cleaned up to use the generic compat_sys_ptrace, we can
avoid the build problems on the crufty arch's by changing the
conditionals on the definition.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The scheduler hrtimer bits in 2.6.25 introduced a circular lock
dependency in a rare code path:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.25-sched-devel.git-x86-latest.git #19
-------------------------------------------------------
X/2980 is trying to acquire lock:
(&rq->rq_lock_key#2){++..}, at: [<ffffffff80230146>] task_rq_lock+0x56/0xa0
but task is already holding lock:
(&cpu_base->lock){++..}, at: [<ffffffff80257ae1>] lock_hrtimer_base+0x31/0x60
which lock already depends on the new lock.
The scenario which leads to this is:
posix-timer signal is delivered
-> posix-timer is rearmed
timer is already expired in hrtimer_enqueue()
-> softirq is raised
To prevent this we need to move the raise of the softirq out of the
base->lock protected code path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Prior to suspend, we allocate and switch to a new VT; after suspend, we switch
back to the original VT. This can be slow, and is completely unnecessary if
the framebuffer we're using can restore video properly.
This adds a hook that allows drivers to select whether or not to do this vt
switch, and changes the gxfb driver to call this hook. It also adds a module
param to gxfb to allow controlling of the vt switch (defaulting to no switch).
(Note: I'm not convinced that console_sem is the best way to protect this, but
we should probably have some form of locking..)
[akpm@linux-foundation.org: build fix]
Signed-off-by: Andres Salomon <dilinger@debian.org>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce unregister_/register_kprobes() for kprobe batch registration. This
can reduce waiting time for synchronized_sched() when a lot of probes have to
be unregistered at once.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prohibit users from probing preempt_schedule(). One way of prohibiting the
user from probing functions is by marking such functions with __kprobes. But
this method doesn't work for those functions, which are already marked to
different section like preempt_schedule() (belongs to __sched section). So we
use blacklist approach to refuse user from probing these functions.
In blacklist approach we populate the blacklisted function's starting address
and its size in kprobe_blacklist structure. Then we verify the user specified
address against start and end of the blacklisted function. So any attempt to
register probe on blacklisted functions will be rejected.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead. That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.
Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege. Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.
Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege. In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0. This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()). This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:
prctl(PR_SET_SECUREBITS, 0x2f);
This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.
[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.
In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.
Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...
The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add some values of page flags to the vmcoreinfo data.
The vmcoreinfo data has the minimum debugging information only for dump
filtering. makedumpfile (dump filtering command) gets it to distinguish
unnecessary pages, and makedumpfile creates a small dumpfile.
An old makedumpfile (v1.2.4 or before) had assumed some values of page flags
internally, and this implementation could not follow the change of these
values. For example, Christoph Lameter is changing these values by the
follwing patch: http://lkml.org/lkml/2008/2/29/463
So a new makedumpfile (v1.2.5) came to need these values and I created this
patch to let the kernel output them.
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was used to compensate because MAX_NR_ZONES was not available to the
#ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of
__ZONE_COUNT.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_PAGEFLAGS specifies the number of page flags we are using. From that we
can calculate the number of bits leftover that can be used for zone, node (and
maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use
NR_PAGEFLAGS.
Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
These field widths have to be available to the preprocessor.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of enums create constants that are not available to the preprocessor
when building the kernel (f.e. MAX_NR_ZONES).
Arch code already has a way to export constants calculated to the preprocessor
through the asm-offsets.c file. Generate something similar for the core
kernel through kbuild.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.
A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.
[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.
This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.
[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using hrtimer with timer->cb_mode == HRTIMER_CB_SOFTIRQ
in some cases the clockevent is not programmed.
This happens, if:
- a timer is rearmed while it's state is HRTIMER_STATE_CALLBACK
- hrtimer_reprogram() returns -ETIME, when it is called after
CALLBACK is finished. This occurs if the new timer->expires
is in the past when CALLBACK is done.
In this case, the timer needs to be removed from the tree and put
onto the pending list again.
The patch is against 2.6.22.5, but AFAICS, it is relevant
for 2.6.25 also (in run_hrtimer_pending()).
Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.
Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
This patch has a small common code hit, namely making dup_mm non-static.
Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Arrgghhh...
Sorry about that, I'd been sure I'd folded that one, but it actually got
lost. Please apply - that breaks execve().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.
As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no guarantee that there is physical ram below 4GB, and in
fact many boxes don't have exactly that.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix __aggregate_redistribute_shares() related lockup reported by
David S. Miller.
The problem this code tries to solve is 'accurately' calculating the 'fair'
share of the group weight for each cpu. The current code falls back to a global
group rebalance in case the sched_domain's span it looks at has no shares, but
does have tasks.
The reason it gets stuck here, is because its inherently racy - if someone
steals the last task after we compute the agg->rq_weight, but before we
rebalance, we'll never get out of the loop.
We could of course go fix that, but while looking at this issue I found that
this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite
common - and given it walks the whole machine, thats very bad.
The new approach is simple (why didn't I think of it before?), we set the
aggregate shares to the full task group weight, and each larger sched domain
that encounters an aggregate shares larger than the weight, clips it (it
already re-distributes anyway).
This nicely converges to the desired global picture where the sum of all
shares equals the task group weight.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Miller reported:
|--------------->
the following commit:
| commit 27ec440779
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Thu Feb 28 21:00:21 2008 +0100
|
| sched: make cpu_clock() globally synchronous
|
| Alexey Zaytsev reported (and bisected) that the introduction of
| cpu_clock() in printk made the timestamps jump back and forth.
|
| Make cpu_clock() more reliable while still keeping it fast when it's
| called frequently.
|
| Signed-off-by: Ingo Molnar <mingo@elte.hu>
causes watchdog triggers when a cpu exits NOHZ state when it has been
there for >= the soft lockup threshold, for example here are some
messages from a 128 cpu Niagara2 box:
[ 168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239]
[ 168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0]
[ 168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511]
[ 168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0]
[ 169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0]
[ 169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515]
[ 169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0]
[ 169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0]
[ 169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#109 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#117 stuck for 92s! [swapper:0]
[ 169.171483] BUG: soft lockup - CPU#40 stuck for 80s! [dd:3239]
[ 169.331483] BUG: soft lockup - CPU#13 stuck for 86s! [swapper:0]
[ 169.351500] BUG: soft lockup - CPU#43 stuck for 101s! [dd:3239]
[ 169.531482] BUG: soft lockup - CPU#9 stuck for 129s! [mkdir:4565]
[ 169.595754] BUG: soft lockup - CPU#20 stuck for 93s! [swapper:0]
[ 169.626787] BUG: soft lockup - CPU#52 stuck for 93s! [swapper:0]
[ 169.626787] BUG: soft lockup - CPU#84 stuck for 92s! [swapper:0]
[ 169.636812] BUG: soft lockup - CPU#116 stuck for 94s! [swapper:0]
It's simple enough to trigger this by doing a 10 minute sleep after a
fresh bootup then starting a parallel kernel build.
I suspect this might be reintroducing a problem we've had and fixed
before, see the thread:
http://marc.info/?l=linux-kernel&m=119546414004065&w=2
<---------------|
touch the softlockup watchdog when exiting NOHZ state - we are
obviously not locked up.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A recent change prevents SGI Altix from booting.
This patch fixes the problem.
The regresson was introduced in commit 434d53b00d
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
iwlwifi: Fix built-in compilation of iwlcore
net: Unexport move_addr_to_{kernel,user}
rt2x00: Select LEDS_CLASS.
iwlwifi: Select LEDS_CLASS.
leds: Do not guard NEW_LEDS with HAS_IOMEM
[IPSEC]: Fix catch-22 with algorithm IDs above 31
time: Export set_normalized_timespec.
tcp: Make use of before macro in tcp_input.c
hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c
[NETNS]: Remove empty ->init callback.
[DCCP]: Convert do_gettimeofday() to getnstimeofday().
[NETNS]: Don't initialize err variable twice.
[NETNS]: The ip6_fib_timer can work with garbage on net namespace stop.
[IPV4]: Convert do_gettimeofday() to getnstimeofday().
[IPV4]: Make icmp_sk_init() static.
[IPV6]: Make struct ip6_prohibit_entry_template static.
tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c
[NET]: Expose netdevice dev_id through sysfs
skbuff: fix missing kernel-doc notation
[ROSE]: Fix soft lockup wrt. rose_node_list_lock
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()
[PATCH] proc_readfd_common() race fix
[PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()
[PATCH] teach seq_file to discard entries
[PATCH] umount_tree() will unhash everything itself
[PATCH] get rid of more nameidata passing in namespace.c
[PATCH] switch a bunch of LSM hooks from nameidata to path
[PATCH] lock exclusively in collect_mounts() and drop_collected_mounts()
[PATCH] move a bunch of declarations to fs/internal.h
The only reason to have separated __...() for those was to keep them inlined
for local users in exit.c. Since Alexey removed the inline on those, there's
no reason whatsoever to keep them around; just collapse with normal variants.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add missing kernel-doc in kernel/sched.c:
Warning(linux-2.6.25-git3//kernel/sched.c:7044): No description found for parameter 'span'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>