Fix some issues Steve Grubb had with the way NetLabel was using the audit
subsystem. This should make NetLabel more consistent with other kernel
generated audit messages specifying configuration changes.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Accepted connections of types other than AF_INET, AF_INET6, AF_UNIX won't
have an appropriate label derived from the peer, so don't use it.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: (33 commits)
IB/ipath: Fix lockdep error upon "ifconfig ibN down"
IB/ipath: Fix races with ib_resize_cq()
IB/ipath: Support new PCIE device, QLE7142
IB/ipath: Set CPU affinity early
IB/ipath: Fix EEPROM read when driver is compiled with -Os
IB/ipath: Fix and recover TXE piobuf and PBC parity errors
IB/ipath: Change HT CRC message to indicate how to resolve problem
IB/ipath: Clean up module exit code
IB/ipath: Call mtrr_del with correct arguments
IB/ipath: Flush RWQEs if access error or invalid error seen
IB/ipath: Improved support for PowerPC
IB/ipath: Drop unnecessary "(void *)" casts
IB/ipath: Support multiple simultaneous devices of different types
IB/ipath: Fix mismatch in shifts and masks for printing debug info
IB/ipath: Fix compiler warnings and errors on non-x86_64 systems
IB/ipath: Print more informative parity error messages
IB/ipath: Ensure that PD of MR matches PD of QP checking the Rkey
IB/ipath: RC and UC should validate SLID and DLID
IB/ipath: Only allow complete writes to flash
IB/ipath: Count SRQs properly
...
* git://oss.sgi.com:8090/xfs/xfs-2.6: (49 commits)
[XFS] Remove v1 dir trace macro - missed in a past commit.
[XFS] 955947: Infinite loop in xfs_bulkstat() on formatter() error
[XFS] pv 956241, author: nathans, rv: vapo - make ino validation checks
[XFS] pv 956240, author: nathans, rv: vapo - Minor fixes in
[XFS] Really fix use after free in xfs_iunpin.
[XFS] Collapse sv_init and init_sv into just the one interface.
[XFS] standardize on one sema init macro
[XFS] Reduce endian flipping in alloc_btree, same as was done for
[XFS] Minor cleanup from dio locking fix, remove an extra conditional.
[XFS] Fix kmem_zalloc_greedy warnings on 64 bit platforms.
[XFS] pv 955157, rv bnaujok - break the loop on EFAULT formatter() error
[XFS] pv 955157, rv bnaujok - break the loop on formatter() error
[XFS] Fixes the leak in reservation space because we weren't ungranting
[XFS] Add lock annotations to xfs_trans_update_ail and
[XFS] Fix a porting botch on the realtime subvol growfs code path.
[XFS] Minor code rearranging and cleanup to prevent some coverity false
[XFS] Remove a no-longer-correct debug assert from dio completion
[XFS] Add a greedy allocation interface, allocating within a min/max size
[XFS] Improve error handling for the zero-fsblock extent detection code.
[XFS] Be more defensive with page flags (error/private) for metadata
...
Fix undefined reference in i2c_sibyte_exit().
drivers/built-in.o: In function `i2c_sibyte_exit':
i2c-sibyte.c:(.exit.text+0x368): undefined reference to `i2c_del_bus'
i2c-sibyte.c:(.exit.text+0x368): relocation truncated to fit: R_MIPS_26 against `i2c_del_bus'
i2c-sibyte.c:(.exit.text+0x38c): undefined reference to `i2c_del_bus'
i2c-sibyte.c:(.exit.text+0x38c): relocation truncated to fit: R_MIPS_26 against `i2c_del_bus'
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix obscure race condition in kernel/cpuset.c attach_task() code.
There is basically zero chance of anyone accidentally being harmed by this
race.
It requires a special 'micro-stress' load and a special timing loop hacks
in the kernel to hit in less than an hour, and even then you'd have to hit
it hundreds or thousands of times, followed by some unusual and senseless
cpuset configuration requests, including removing the top cpuset, to cause
any visibly harm affects.
One could, with perhaps a few days or weeks of such effort, get the
reference count on the top cpuset below zero, and manage to crash the
kernel by asking to remove the top cpuset.
I found it by code inspection.
The race was introduced when 'the_top_cpuset_hack' was introduced, and one
piece of code was not updated. An old check for a possibly null task
cpuset pointer needed to be changed to a check for a task marked
PF_EXITING. The pointer can't be null anymore, thanks to
the_top_cpuset_hack (documented in kernel/cpuset.c). But the task could
have gone into PF_EXITING state after it was found in the task_list scan.
If a task is PF_EXITING in this code, it is possible that its task->cpuset
pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
than because the top_cpuset was that tasks last valid cpuset. In that
case, the wrong cpuset reference counter would be decremented.
The fix is trivial. Instead of failing the system call if the tasks cpuset
pointer is null here, fail it if the task is in PF_EXITING state.
The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
the top_cpuset is done without locking, so could happen at anytime. But it
is done during the exit handling, after the PF_EXITING flag is set. So if
we verify that a task is still not PF_EXITING after we copy out its cpuset
pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
hack references to the top_cpuset.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a note about "format=flowed" when sending patches and explain how to
fix mozilla. Thunderbird has the similar options.
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
the locking self-test:
------------>
| Locking API testsuite:
----------------------------------------------------------------------------
| spin |wlock |rlock |mutex | wsem | rsem |
--------------------------------------------------------------------------
A-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok |
A-B-B-C-C-D-D-A deadlock: ok |FAILED| ok | ok | ok | ok |
A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok |
A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok |FAILED|
after much debugging it turned out to be caused by accidental chain-hash
key collisions. The current hash is:
#define iterate_chain_key(key1, key2) \
(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
(key2))
where MAX_LOCKDEP_KEYS_BITS is 11. This hash is pretty good as it will
shift by 5 bits in every iteration, where every new ID 'mixed' into the
hash would have up to 11 bits. But because there was a 6 bits overlap
between subsequent IDs and their high bits tended to be similar, there was
a chance for accidental chain-hash collision for a low number of locks
held.
the solution is to shift by 11 bits:
#define iterate_chain_key(key1, key2) \
(((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
(key2))
This keeps the hash perfect up to 5 locks held, but even above that the
hash is still good because 11 bits is a relative prime to the total 64
bits, so a complete match will only occur after 64 held locks (which doesnt
happen in Linux). Even after 5 locks held, entropy of the 5 IDs mixed into
the hash is already good enough so that overlap doesnt generate a colliding
hash ID.
with this change the false positives went away.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o As per ELF specifications, it looks like that elf note "namesz" field
contains the length of "name" including the size of null character. And
currently we are filling "namesz" without taking into the consideration
the null character size.
o Kexec-tools performs this check deligently hence I ran into the issue
while trying to open /proc/kcore in kexec-tools for some info.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This unlock/lock on a super-unlikely path isn't worth the kernel text.
Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Perform a code cleanup against the expand_fdtable() and expand_files()
functions inside fs/file.c. It aims to make the flow of code within these
functions simpler and easier to understand, via added comments and modest
refactoring.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* fix copright typo
* remove trailing whitespace
* remove Kernel Traffic from Resources. Zack, it was great reading!
* Name Arjan by name and fix URL of "How to NOT" paper.
* Remove "Last updated" tag.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add tty locking around the audit and accounting code.
The whole current->signal-> locking is all deeply strange but it's for
someone else to sort out. Add rather than replace the lock for acct.c
Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If you send a priority character (as is done for flow control) then the tty
driver can either have its own method for "jumping the queue" or the characrer
can be queued normally. In the latter case we call the write method but
without the atomic_write_lock taken elsewhere.
Make this consistent. Note that the send_xchar method if implemented remains
outside of the lock as it can jump ahead of a current write so must not be
locked out by it.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The driver has no business doing this work itself any more and hasn't for some
years. When the new speed stuff goes in this will break entirely so fix it up
ready.
Also remove a #if 0 around a comment....
Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the chip detected "oscillator stop" condition, show an warning message.
And initialize it with the Epoch time instead of leaving it with unknown
date/time.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All sound/sound_firmware.c contains is mod_firmware_load() that is a legacy
API only used by some OSS drivers.
This patch builds it into an own sound_firmware module that is only built
depending on CONFIG_SOUND_PRIME making the kernel slightly smaller for ALSA
users.
[alan@lxorguk.ukuu.org.uk: comment fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I had to look back: this code was extracted from the module.c code in 2005.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patches solve the following problem: We want to grant access to devices
based on who is logged in from where, etc. This includes switching back and
forth between multiple user sessions, etc.
Using ACLs to define device access for logged-in users gives us all the
flexibility we need in order to fully solve the problem.
Device special files nowadays usually live on tmpfs, hence tmpfs ACLs.
Different distros have come up with solutions that solve the problem to
different degrees: SUSE uses a resource manager which tracks login sessions
and sets ACLs on device inodes as appropriate. RedHat uses pam_console, which
changes the primary file ownership to the logged-in user. Others use a set of
groups that users must be in in order to be granted the appropriate accesses.
The freedesktop.org project plans to implement a combination of a
console-tracker and a HAL-device-list based solution to grant access to
devices to users, and more distros will likely follow this approach.
These patches have first been posted here on 2 February 2005, and again
on 8 January 2006. We have been shipping them in SLES9 and SLES10 with
no problems reported. The previous submission is archived here:
http://lkml.org/lkml/2006/1/8/229http://lkml.org/lkml/2006/1/8/230http://lkml.org/lkml/2006/1/8/231
This patch:
Add some infrastructure for access control lists on in-memory
filesystems such as tmpfs.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
POSIX states that poll() shall fail with EINVAL if nfds > OPEN_MAX. In
this context, POSIX is referring to sysconf(OPEN_MAX), which is the value
of current->signal->rlim[RLIMIT_NOFILE].rlim_cur in the linux kernel, not
the compile-time constant which happens to also be named OPEN_MAX. In the
current code, an application may poll up to max_fdset file descriptors,
even if this exceeds RLIMIT_NOFILE. The current code also breaks
applications which poll more than max_fdset descriptors, which worked circa
2.4.18 when the check was against NR_OPEN, which is 1024*1024. This patch
enforces the limit precisely as POSIX defines, even if RLIMIT_NOFILE has
been changed at run time with ulimit -n.
To elaborate on the rationale for this, there are three cases:
1) RLIMIT_NOFILE is at the default value of 1024
In this (default) case, the patch changes nothing. Calls with nfds > 1024
fail with EINVAL both before and after the patch, and calls with nfds <=
1024 pass the check both before and after the patch, since 1024 is the
initial value of max_fdset.
2) RLIMIT_NOFILE has been raised above the default
In this case, poll() becomes more permissive, allowing polling up to
RLIMIT_NOFILE file descriptors even if less than 1024 have been opened.
The patch won't introduce new errors here. If an application somehow
depends on poll() failing when it polls with duplicate or invalid file
descriptors, it's already broken, since this is already allowed below 1024,
and will also work above 1024 if enough file descriptors have been open at
some point to cause max_fdset to have been increased above nfds.
3) RLIMIT_NOFILE has been lowered below the default
In this case, the system administrator or the user has gone out of their
way to protect the system from inefficient (or malicious) applications
wasting kernel memory. The current code allows polling up to 1024 file
descriptors even if RLIMIT_NOFILE is much lower, which is not what the user
or administrator intended. Well-written applications which only poll
valid, unique file descriptors will never notice the difference, because
they'll hit the limit on open() first. If an application gets broken
because of the patch in this case, then it was already poorly/maliciously
designed, and allowing it to work in the past was a violation of POSIX and
a DoS risk on low-resource systems.
With this patch, poll() will permit exactly what POSIX suggests, no more,
no less, and for any run-time value set with ulimit -n, not just 256 or
1024. There are existing apps which which poll a large number of file
descriptors, some of which may be invalid, and if those numbers stradle
1024, they currently fail with or without the patch in -mm, though they
worked fine under 2.4.18.
Signed-off-by: Chris Snook <csnook@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For len equal to 4, we never call sppp_lcp_conf_parse_options(),
therefore rmagic does not get initialized.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Acked-by: Paul Fulghum <paulkf@microgate.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I've been using systemtap for some debugging and I noticed that it can't
probe a lot of modules. Turns out it's kind of silly, the sections section
of /sys/module is limited to 32byte filenames and many of the actual
sections are a a bit longer than that.
[akpm@osdl.org: rewrite to use dymanic allocation]
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I just got a bounce telling me my contributions aren't welcome.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
People search maintainers for NBD and then decide it is not
maintained.
(akpm: ditto LVM. And other things, but I forget what they were)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This cleans up SubmittingPatches a bit.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Constify two structs.
Correct some typos.
Compile-tested and run-tested (module inserted) on 2.6.18-rc4-mm3.
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch limits the messages when ldisc open faulures happen. It happens
under memory pressure.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
akpm draws my attention to the fact that sysctl(VM_PAGE_CLUSTER) might
conceivably change page_cluster to 0 while valid_swaphandles() is in the
middle of using it, leading to an embarrassingly long loop: take a local
snapshot of page_cluster and work with that.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ratelimit_pages in page-writeback.c is recalculated (in set_ratelimit())
every time a CPU is hot-added/removed. But this value is not recalculated
when new pages are hot-added.
This patch fixes that problem by calling set_ratelimit() when new pages
are hot-added.
[akpm@osdl.org: cleanups]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
page-writeback.c has a static local variable "total_pages", which is the
total number of pages in the system.
There is a global variable "vm_total_pages", which is the total number of
pages the VM controls.
Both are assigned from the return value of nr_free_pagecache_pages().
This patch removes the local variable and uses the global variable in that
place.
One more issue with the local static variable "total_pages" is that it is
not updated when new pages are hot-added. Since vm_total_pages is updated
when new pages are hot-added, this patch fixes that problem too.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This change corrects the logic on the preprocessor conditionals that
include support for ISA port i/o (/dev/ioports) into the mem character
driver.
This fixes the following error when building for powerpc platforms with
CONFIG_PCI=n.
drivers/built-in.o: undefined reference to `pci_io_base'
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Acked-by: Linas Vepstas <lins@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I think this is a time to step down from my SUPERH architecture
maintainerships. The major development issues for this port seem to shift
on the hardwares I can't access and I have no recent activity on kernel. I
shouldn't qualify as a maintainer of SUPERH port now and there is no
problem because Paul is actively maintaining it. The attached patch drops
my name, address and web URL from MAINTAINERS file.
Signed-off-by: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace current->fs by fs helper variable to reduce some indirection
overhead and (at least at the moment, before the current_thread_info() %gs
PDA improvement is available) get rid of more costly current references.
Reduces fs/namei.o from 37786 to 37082 Bytes (704 Bytes saved).
[akpm@osdl.org: cleanup]
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch (as776) adds a new chapter to Documentation/CodingStyle,
explaining the circumstances under which a function should return 0 for
failure and non-zero for success as opposed to a negative error code for
failure and 0 for success.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Un-inlining rwsem_down_failed_common() (two callsites) reduced lib/rwsem.o
on my Athlon, gcc 4.1.2 from 5935 to 5480 Bytes (455 Bytes saved).
I thus guess that reduced icache footprint (and better function caching) is
worth more than any function call overhead.
Signed-off-by: Andreas Mohr <andi@lisas.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds Jim Lewis to the MAINTAINERS file for the Spidernet
network driver.
Signed-off-by: James K Lewis <jklewis@us.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Forward port of the patch by Solar and ported by Julio.
Compiles, boots, and passes my looptorturetest.sh.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Julio Auto <mindvortex@gmail.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This appears to be a verbatim copy-n-paste of the GPL.
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
it could remove a CPU or Node from the top cpuset, while leaving it still
in some child cpusets.
One basic rule of cpusets is that each cpusets cpus and mems are subsets of
its parents. The cpuset hot unplug code violated this rule.
So the cpuset hotunplug handler must walk down the tree, removing any
removed CPU or Node from all cpusets.
However, it is not allowed to make a cpusets cpus or mems become empty.
They can only transition from empty to non-empty, not back.
So if the last CPU or Node would be removed from a cpuset by the above
walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
that still has something online, and copy its CPU or Memory placement.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code. Make this top cpus file read-only.
On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.
If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset. This is a surprising regression
over earlier kernels that didn't have cpusets enabled.
One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.
This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets. The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed. With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes. Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.
In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map. Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.
[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A previous patch to allow an exiting task to OOM kill itself (and thereby
avoid a little deadlock) introduced a problem. We don't want the
PF_EXITING task, even if it is 'current', to access mem reserves if there
is already a TIF_MEMDIE process in the system sucking up reserves.
Also make the commenting a little bit clearer, and note that our current
scheme of effectively single threading the OOM killer is not itself
perfect.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- It is not possible to have task->mm == &init_mm.
- task_lock() buys nothing for 'if (!p->mm)' check.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No logic changes, but imho easier to read.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only one usage of TASK_DEAD outside of last schedule path,
select_bad_process:
for_each_task(p) {
if (!p->mm)
continue;
...
if (p->state == TASK_DEAD)
continue;
...
TASK_DEAD state is set at the end of do_exit(), this means that p->mm
was already set == NULL by exit_mm(), so this task was already rejected
by 'if (!p->mm)' above.
Note also that the caller holds tasklist_lock, this means that p can't
pass exit_notify() and then set TASK_DEAD when p->mm != NULL.
Also, remove open-coded is_init().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>