Commit Graph

1151 Commits

Author SHA1 Message Date
john stultz
260a42309b [PATCH] Time: Let user request precision from current_tick_length()
Change the current_tick_length() function so it takes an argument which
specifies how much precision to return in shifted nanoseconds.  This provides
a simple way to convert between NTPs internal nanoseconds shifted by
(SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
clocksource abstraction.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
ad596171ed [PATCH] Time: Use clocksource infrastructure for update_wall_time
Modify the update_wall_time function so it increments time using the
clocksource abstraction instead of jiffies.  Since the only clocksource driver
currently provided is the jiffies clocksource, this should result in no
functional change.  Additionally, a timekeeping_init and timekeeping_resume
function has been added to initialize and maintain some of the new timekeping
state.

[hirofumi@mail.parknet.co.jp: fixlet]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
734efb467b [PATCH] Time: Clocksource Infrastructure
This introduces the clocksource management infrastructure.  A clocksource is a
driver-like architecture generic abstraction of a free-running counter.  This
code defines the clocksource structure, and provides management code for
registering, selecting, accessing and scaling clocksources.

Additionally, this includes the trivial jiffies clocksource, a lowest common
denominator clocksource, provided mainly for use as an example.

[hirofumi@mail.parknet.co.jp: Don't enable IRQ too early]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
Ingo Molnar
81615b624a [PATCH] Convert kernel/cpu.c to mutexes
Convert kernel/cpu.c from semaphore to mutex.

I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
fit mutex semantics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Ingo Molnar
1fb00c6cbd [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts
disabled, and they expect interrupts to stay disabled, else they crash.

Work around this bug by making mutex debugging variants save/restore irq
flags.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Linus Torvalds
3448097fcc Revert "swsusp special saveable pages support" commits
This reverts commits

  3e3318dee0 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
  b6370d96e0 [PATCH] swsusp: i386 mark special saveable/unsaveable pages
  ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support

because not only do they apparently cause page faults on x86, the
infrastructure doesn't compile on powerpc.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 18:41:00 -07:00
Linus Torvalds
72cf2709bf Fix PM_TRACE dependency: works only on 32-bit x86 for now
Not that x86-64 and other architecture support should be difficult to
add (trivial fixups to the data format and add the proper linker script
entry).

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:04:15 -07:00
KaiGai Kohei
77787bfb44 [PATCH] pacct: none-delayed process accounting accumulation
In current 2.6.17 implementation, signal_struct refered from task_struct is
used for per-process data structure.  The pacct facility also uses it as a
per-process data structure to store stime, utime, minflt, majflt.  But those
members are saved in __exit_signal().  It's too late.

For example, if some threads exits at same time, pacct facility has a
possibility to drop accountings for a part of those threads.  (see, the
following 'The results of original 2.6.17 kernel') I think accounting
information should be completely collected into the per-process data structure
before writing out an accounting record.

This patch fixes this matter.  Accumulation of stime, utime, minflt and majflt
are done before generating accounting record.

[mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:25 -07:00
KaiGai Kohei
f6ec29a42d [PATCH] pacct: avoidance to refer the last thread as a representation of the process
When pacct facility generate an 'ac_flag' field in accounting record, it
refers a task_struct of the thread which died last in the process.  But any
other task_structs are ignored.

Therefore, pacct facility drops ASU flag even if root-privilege operations are
used by any other threads except the last one.  In addition, AFORK flag is
always set when the thread of group-leader didn't die last, although this
process has called execve() after fork().

We have a same matter in ac_exitcode.  The recorded ac_exitcode is an exit
code of the last thread in the process.  There is a possibility this exitcode
is not the group leader's one.
2006-06-25 10:01:25 -07:00
KaiGai Kohei
0e4648141a [PATCH] pacct: add pacct_struct to fix some pacct bugs.
The pacct facility need an i/o operation when an accounting record is
generated.  There is a possibility to wake OOM killer up.  If OOM killer is
activated, it kills some processes to make them release process memory
regions.

But acct_process() is called in the killed processes context before calling
exit_mm(), so those processes cannot release own memory.  In the results, any
processes stop in this point and it finally cause a system stall.
2006-06-25 10:01:25 -07:00
Randy Dunlap
9e37bd301e [PATCH] kthread: move kernel-doc and put it into DocBook
Move kthread API kernel-doc from kthread.h to kthread.c & fix it.
Add kthread API to kernel-api DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:24 -07:00
Randy Dunlap
fa9799e33d [PATCH] ktime/hrtimer: fix kernel-doc comments
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:23 -07:00
Heiko Carstens
fc75cdfa5b [PATCH] cpu hotplug: fix CPU_UP_CANCEL handling
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
called with CPU_UP_CANCELED.  A few of these callbacks assume that on
CPU_UP_PREPARE a pointer to task has been stored in a percpu array.  This
assumption is not true if CPU_UP_PREPARE fails and the following calls to
kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
because of passing a NULL pointer.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Serge E. Hallyn
8bdd1d1250 [PATCH] kthread: convert stop_machine into a kthread
- Update stop_machine.c to spawn stop_machine as kthreads rather than the
  deprecated kernel_threads.

- Update stop_machine to use the more efficient kthread_bind() before
  running task in place of set_cpus_allowed() after.

[akpm@osdl.org: remove now-wrong set_cpus_allowed()]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Anton Blanchard
2aa92581fb [PATCH] Link error when futexes are disabled on 64bit architectures
If futexes are disabled we fail to link on ppc64.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:17 -07:00
akpm@osdl.org
838cd153a5 [PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs.

Among the NPTL test failures seen are some arising from sigsuspend problems
for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked
despite glibc's carefully excluding it from sets of signals to block.
Specifically, testing suggests it blocks signal N^32 instead of signal N,
so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead.

glibc's sigset_t uses an array of unsigned long, as does the kernel.
In both cases, signal N+1 is represented as
(1UL << (N % (8 * sizeof (unsigned long)))) in word number
(N / (8 * sizeof (unsigned long))).

Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an
array of 64-bit words.  For little-endian, the layout is the same, with
signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for
big-endian, userspace has that layout while in the kernel each 8 bytes have
the two halves swapped from the userspace layout.

The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace
sigset to kernel format.  If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses
logic of the form

  set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 )

to convert the userspace sigset to a kernel one.  This looks correct to me
for both big and little endian, given that in userspace compat->sig[1] will
represent signals 33-64, and so will the high 32 bits of set->sig[0] in the
kernel.  If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for
__MIPSEL__, it uses

  set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 );

which seems incorrect for both big and little endian, and would
explain the observed symptoms.

This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect
then that macro serves no purpose, in which case something like the
following patch would seem appropriate to remove it.

Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Stephen Hemminger
eab03ac7bd [PATCH] Get rid of /proc/sys/proc
The table is empty, why does it still exist?

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Jan Engelhardt
3b9c04106b [PATCH] printk time parameter
Currently, enabling/disabling printk timestamps is only possible through
reboot (bootparam) or recompile.  I normally do not run with timestamps
(since syslog handles that in a good manner), but for measuring small
kernel delays (e.g.  irq probing - see parport thread) I needed subsecond
precision, but then again, just for some minutes rather than all kernel
messages to come.  The following patch adds a module_param() with which the
timestamps can be en-/disabled in a live system through
/sys/modules/printk/parameters/printk_time.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:13 -07:00
Matt Helsley
11e64757f9 [PATCH] Remove unecessary NULL check in kernel/acct.c
copy_process() appears to be the only caller of acct_clear_integrals() and
does not pass in NULL task pointers.  Remove the unecessary check.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:09 -07:00
Andreas Mohr
3b364b8d58 [PATCH] constify parts of kernel/power/
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:08 -07:00
Andrew Morton
b61367732f [PATCH] schedule_on_each_cpu(): reduce kmalloc() size
schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU
64-bit.

Rework it so that we do one 8192-byte allocation and then a pile of tiny ones,
via alloc_percpu().  This has a much higher chance of success (100% in the
current VM).

This also has the effect of reducing the memory requirements from NR_CPUS*n to
num_possible_cpus()*n.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Adrian Bunk
83cc5ed3c4 [PATCH] kernel/sys.c: cleanups
- proper prototypes for the following functions:
  - ctrl_alt_del()  (in include/linux/reboot.h)
  - getrusage()     (in include/linux/resource.h)
- make the following needlessly global functions static:
  - kernel_restart_prepare()
  - kernel_kexec()

[akpm@osdl.org: compile fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:06 -07:00
Michael Ellerman
76a8ad2939 [PATCH] Make printk work for really early debugging
Currently printk is no use for early debugging because it refuses to
actually print anything to the console unless
cpu_online(smp_processor_id()) is true.

The stated explanation is that console drivers may require per-cpu
resources, or otherwise barf, because the system is not yet setup
correctly.  Fair enough.

However some console drivers might be quite happy running early during
boot, in fact we have one, and so it'd be nice if printk understood that.

So I added a flag (which I would have called CON_BOOT, but that's taken)
called CON_ANYTIME, which indicates that a console is happy to be called
anytime, even if the cpu is not yet online.

Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus
console driver that BUG()s if called while offline.  No problems AFAICT.
Built for i386 UP & SMP.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:05 -07:00
Alan Stern
bbb1747d4e [PATCH] Allow raw_notifier callouts to unregister themselves
Since raw_notifier chains don't benefit from any centralized locking
protections, they shouldn't suffer from the associated limitations.  Under
some circumstances it might make sense for a raw_notifier callout routine
to unregister itself from the notifier chain.  This patch (as678) changes
the notifier core to allow for such things.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Paul Mackerras
bfe5d83419 [PATCH] Define __raw_get_cpu_var and use it
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled.  For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.

This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Jesper Juhl
f867d2a2e5 [PATCH] ensure NULL deref can't possibly happen in is_exported()
If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is
given a NULL 'mod' and lookup_symbol(name, __start___ksymtab,
__stop___ksymtab) returns 0, then we'll end up dereferencing a NULL
pointer.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:59 -07:00
Linus Torvalds
eb71c87a49 Add some basic resume trace facilities
Considering that there isn't a lot of hw we can depend on during resume,
this is about as good as it gets.

This is x86-only for now, although the basic concept (and most of the
code) will certainly work on almost any platform.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-24 14:44:01 -07:00
Eric Sesterhenn
125e18745f [PATCH] More BUG_ON conversion
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Jan Beulich
908dcecda1 [PATCH] adjust handle_IRR_event() return type
Correct the return type of handle_IRQ_event() (inconsistency noticed during
Xen development), and remove redundant declarations.  The return type
adjustment required breaking out the definition of irqreturn_t into a
separate header, in order to satisfy current include order dependencies.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Porpoise
3439dd86e3 [PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loop
When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
base->tv5 may cascade into base->tv5.  So, the kernel enters the infinite
loop in the function cascade().

I created a test module to verify this bug, and a patch to fix it.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/timer.h>
#if 0
#include <linux/kdb.h>
#else
#define kdb_printf printk
#endif

#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS)
#define TVR_SIZE (1 << TVR_BITS)
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

#define TV_SIZE(N)  (N*TVN_BITS  + TVR_BITS)

struct timer_list timer0;
struct timer_list dummy_timer1;
struct timer_list dummy_timer2;

void dummy_timer_fun(unsigned long data) {
}
unsigned long j=0;
void check_timer_base(unsigned long data)
{
        kdb_printf("check_timer_base %08x\n",jiffies);
        mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
}

int init_module(void)
{
        init_timer(&timer0);
        timer0.data = (unsigned long)0;
        timer0.function = check_timer_base;
        mod_timer(&timer0,jiffies+1);

        init_timer(&dummy_timer1);
        dummy_timer1.data = (unsigned long)0;
        dummy_timer1.function = dummy_timer_fun;

        init_timer(&dummy_timer2);
        dummy_timer2.data = (unsigned long)0;
        dummy_timer2.function = dummy_timer_fun;

        j=jiffies;
        j&=(~((1<<TV_SIZE(3))-1));
        j+=(1<<TV_SIZE(3));
        j+=(1<<TV_SIZE(4));

        kdb_printf("mod_timer %08x\n",j);

        mod_timer(&dummy_timer1, j );
        mod_timer(&dummy_timer2, j );

        return 0;
}

void cleanup_module()
{
        del_timer_sync(&timer0);
        del_timer_sync(&dummy_timer1);
        del_timer_sync(&dummy_timer2);
}

(Cleanups from Oleg)

[oleg@tv-sign.ru: use list_replace_init()]
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Oleg Nesterov
626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Randy Dunlap
862f5f0133 [PATCH] Doc: add audit & acct to DocBook
Fix one audit kernel-doc description (one parameter was missing).
Add audit*.c interfaces to DocBook.
Add BSD accounting interfaces to DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Paul E. McKenney
d83015b8f6 [PATCH] Make RCU API inaccessible to non-GPL Linux kernel modules
Remove synchronize_kernel() (deprecated 2-APR-2005 in
http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to
non-GPL Linux kernel modules (as was announced more than one year ago in
http://lkml.org/lkml/2005/4/3/8).  Tested on x86 and ppc64.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Jes Sorensen
55f4e8d156 [PATCH] kernel/sys.c doesn't need init.h
kernel/sys.c doesn't have anything in it relying on linux/init.h -
remove the include.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Andrew Morton
57ae250861 [PATCH] CONFIG_NET=n build fix
Cc: Greg KH <greg@kroah.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:06 -07:00
Andreas Mohr
83d4e6e7fb [PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely()
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:05 -07:00
Daniel Walker
89d0cf01c0 [PATCH] invert irq/migration.c brach prediction
If you get to that point in the code it means that desc->move_irq is set,
pending_irq_cpumask[irq] and cpu_online_map should have a value.  Still
pretty good chance anding those two you'll still have a value.  So these
two branch predictors should be inverted.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Ingo Molnar
8e0a43d8fa [PATCH] cond_resched() might_sleep() fix
add the __might_sleep() check back to cond_resched().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Prasanna Meda
6e66726047 [PATCH] dup fd error fix
Set errorp in dup_fd, it will be used in sys_unshare also.

Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Andrew Morton
0ae26f1b31 [PATCH] mmput() might sleep
exit_aio() and exit_mmap() can sleep.  But it's easy to accidentally call
mmput() from inside locks.

Cc: Dave Peterson <dsp@llnl.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:03 -07:00
Jeff Moyer
c330dda908 [PATCH] Add a sysfs file to determine if a kexec kernel is loaded
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded.  Each
file contains a simple boolean value indicating whether the relevant kernel
has been loaded into memory.  The motivation for this is geared around
support.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:02 -07:00
Rafael J. Wysocki
968808b895 [PATCH] swsusp: use less memory during resume
Make swsusp allocate only as much memory as needed to store the image data
and metadata during resume.

Without this patch swsusp additionally allocates many page frames that will
conflict with the "original" locations of the image data and are considered
as "unsafe", treating them as "eaten" pages (ie.  allocated but unusable).

The patch makes swsusp allocate as many pages as it'll need to store the
data read from the image in one shot, creating a list of allocated "safe"
pages, and use the observation that all pages allocated by it are marked
with the PG_nosave and PG_nosave_free flags set.   Namely, when it's about
to load an image page, swsusp can check whether the page frame
corresponding to the "original" location of this page has been allocated
(ie.  if the page frame has the PG_nosave and PG_nosave_free flags set) and
if so, it can load the page directly into this page frame.   Otherwise it
uses an allocated "safe" page from the list to store the data that will be
copied to their "original" location later on.

This allows us to save many page copyings and page allocations during
resume and in the future it may allow us to load images greater than 50% of
the normal zone.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: "Pavel Machek" <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:00 -07:00
Adrian Bunk
7bff24e255 [PATCH] kernel/power/snapshot.c: cleanups
- make needlessly global functions static
- make dummy functions static inline

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Rafael J. Wysocki
a938c356d5 [PATCH] swsusp: take lowmem reserves into account
swsusp allocates memory from the normal zone, so it cannot use lowmem
reserve pages from the lower zones.  Therefore it should not count these
pages as available to it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Shaohua Li
ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support
1. Add architecture specific pages save/restore support.  Next two patches
   will use this to save/restore 'ACPI NVS' pages.

2. Allow reserved pages 'nosave'.  This could avoid save/restore BIOS
   reserved pages.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Zhang Yanmin
1b61b910e9 [PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.

1) In function do_irq_balance, after kernel finds the min_loaded cpu but
   before calling set_pending_irq to really pin the selected_irq to the
   target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
   Later on, when the irq is acked, kernel would calls
   move_native_irq=>desc->handler->set_affinity to change the irq affinity.
    However, every function pointed by
   hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
   always changes irq_affinity[irq] to cpumask.  Next time when recalling
   do_irq_balance, it has to do cpu_ands again with
   irq_affinity[selected_irq], but irq_affinity[selected_irq] already
   becomes one cpu selected by the first irq balance.

2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
   issue.

[akpm@osdl.org: cleanups]
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:57 -07:00
David Quigley
22fb52dd73 [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)
Add a security hook call to enable security modules to control the ability
to attach a task to a cpuset.  While limited control over this operation is
possible via permission checks on the pseudo fs interface, those checks are
not sufficient to control access to the target task, which is looked up in
this function.  The existing task_setscheduler hook is re-used for this
operation since this falls under the same class of operations.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:54 -07:00
David Quigley
e7834f8fcc [PATCH] SELinux: add security hooks to {get,set}affinity
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
9216dfad4f [PATCH] move_pages: fix 32 -> 64 bit compat function
The definition of the third parameter is a pointer to an array of virtual
addresses which give us some trouble.  The existing code calculated the
wrong address in the array since I used void to avoid having to specify a
type.

I now use the correct type "compat_uptr_t __user *" in the definition of
the function in kernel/compat.c.

However, I used __u32 in syscalls.h.  Would have to include compat.h there
in order to provide the same definition which would generate an ugly
include situation.

On both ia64 and x86_64 compat_uptr_t is u32. So this works although
parameter declarations differ.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
1b2db9fb7a [PATCH] sys_move_pages: 32bit support (i386, x86_64)
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)

Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.

Add compat_sys_move_pages to the x86_64 32bit binary layer.  Note that it is
not up to date so I added the missing pieces.  Not sure if this is done the
right way.

[akpm@osdl.org: compile fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
742755a1d8 [PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES	The page is now on the indicated node.

-ENOENT		Page is not present

-EACCES		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM		The page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Invalid address (no VMA or zero page).

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. The page must be written
		back in order to move it since the page is dirty and the
		filesystem does not provide a migration function that
		would allow the moving of dirty pages.

-EINVAL		A dirty page cannot be moved. The filesystem does not provide
		a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT		No pages found that would require moving. All pages
		are either already on the target node, not present, had an
		invalid address or could not be moved because they were
		mapped by multiple processes.

-EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-EACCES		One of the target nodes is not allowed by the current cpuset.

-ENODEV		One of the target nodes is not online.

-ESRCH		Process does not exist.

-E2BIG		Too many pages to move.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

  Detailed results for sys_move_pages()

  Pass a pointer to an integer to get_new_page() that may be used to
  indicate where the completion status of a migration operation should be
  placed.  This allows sys_move_pags() to report back exactly what happened to
  each page.

  Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Rafael J. Wysocki
d6277db4ab [PATCH] swsusp: rework memory shrinker
Rework the swsusp's memory shrinker in the following way:

- Simplify balance_pgdat() by removing all of the swsusp-related code
  from it.

- Make shrink_all_memory() use shrink_slab() and a new function
  shrink_all_zones() which calls shrink_active_list() and
  shrink_inactive_list() directly for each zone in a way that's optimized
  for suspend.

In shrink_all_memory() we try to free exactly as many pages as the caller
asks for, preferably in one shot, starting from easier targets.   If slab
caches are huge, they are most likely to have enough pages to reclaim.
 The inactive lists are next (the zones with more inactive pages go first)
etc.

Each time shrink_all_memory() attempts to shrink the active and inactive
lists for each zone in 5 passes.   In the first pass, only the inactive
lists are taken into consideration.   In the next two passes the active
lists are also shrunk, but mapped pages are not reclaimed.   In the last
two passes the active and inactive lists are shrunk and mapped pages are
reclaimed as well.  The aim of this is to alter the reclaim logic to choose
the best pages to keep on resume and improve the responsiveness of the
resumed system.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:48 -07:00
KAMEZAWA Hiroyuki
fadd8fbd15 [PATCH] support for panic at OOM
This patch adds panic_on_oom sysctl under sys.vm.

When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
processes.  And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
the same way as it does today.  Of course, the default value is 0 and only
root can modifies it.

In general, oom_killer works well and kill rogue processes.  So the whole
system can survive.  But there are environments where panic is preferable
rather than kill some processes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:47 -07:00
David Howells
726c334223 [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.

This complements the get_sb() patch.  That reduced the significance of
sb->s_root, allowing NFS to place a fake root there.  However, NFS does
require a dentry to use as a target for the statfs operation.  This permits
the root in the vfsmount to be used instead.

linux/mount.h has been added where necessary to make allyesconfig build
successfully.

Interest has also been expressed for use with the FUSE and XFS filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
David Howells
454e2398be [PATCH] VFS: Permit filesystem to override root dentry on mount
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers.  For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing.  In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

 (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
     pointer argument and return an integer, so most filesystems have to change
     very little.

 (*) If one of the convenience function is not used, then get_sb() should
     normally call simple_set_mnt() to instantiate the vfsmount. This will
     always return 0, and so can be tail-called from get_sb().

 (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
     dcache upon superblock destruction rather than shrink_dcache_anon().

     This is required because the superblock may now have multiple trees that
     aren't actually bound to s_root, but that still need to be cleaned up. The
     currently called functions assume that the whole tree is rooted at s_root,
     and that anonymous dentries are not the roots of trees which results in
     dentries being left unculled.

     However, with the way NFS superblock sharing are currently set to be
     implemented, these assumptions are violated: the root of the filesystem is
     simply a dummy dentry and inode (the real inode for '/' may well be
     inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
     with child trees.

     [*] Anonymous until discovered from another tree.

 (*) The documentation has been adjusted, including the additional bit of
     changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
Linus Torvalds
45c091bb2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
  [POWERPC] re-enable OProfile for iSeries, using timer interrupt
  [POWERPC] support ibm,extended-*-frequency properties
  [POWERPC] Extra sanity check in EEH code
  [POWERPC] Dont look for class-code in pci children
  [POWERPC] Fix mdelay badness on shared processor partitions
  [POWERPC] disable floating point exceptions for init
  [POWERPC] Unify ppc syscall tables
  [POWERPC] mpic: add support for serial mode interrupts
  [POWERPC] pseries: Print PCI slot location code on failure
  [POWERPC] spufs: one more fix for 64k pages
  [POWERPC] spufs: fail spu_create with invalid flags
  [POWERPC] spufs: clear class2 interrupt status before wakeup
  [POWERPC] spufs: fix Makefile for "make clean"
  [POWERPC] spufs: remove stop_code from struct spu
  [POWERPC] spufs: fix spu irq affinity setting
  [POWERPC] spufs: further abstract priv1 register access
  [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
  [POWERPC] spufs: dont try to access SPE channel 1 count
  [POWERPC] spufs: use kzalloc in create_spu
  [POWERPC] spufs: fix initial state of wbox file
  ...

Manually resolved conflicts in:
	drivers/net/phy/Makefile
	include/asm-powerpc/spu.h
2006-06-22 22:11:30 -07:00
Ravikiran G Thirumalai
de047c1bcd [PATCH] avoid tasklist_lock at getrusage for multithreaded case too
Avoid taking tasklist_lock for at getrusage for the multithreaded case too.
We don't need to take the tasklist lock for thread traversal of a process
since Oleg's do-__unhash_process-under-siglock.patch and related work.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:57 -07:00
Andrew Morton
6cc0719181 [PATCH] suspend_console() warning fix
kernel/power/main.c: In function 'suspend_prepare':
kernel/power/main.c:89: warning: implicit declaration of function 'suspend_console'
kernel/power/main.c: In function 'suspend_finish':
kernel/power/main.c:137: warning: implicit declaration of function 'resume_console'

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:56 -07:00
Michael LeMay
d720024e94 [PATCH] selinux: add hooks for key subsystem
Introduce SELinux hooks to support the access key retention subsystem
within the kernel.  Incorporate new flask headers from a modified version
of the SELinux reference policy, with support for the new security class
representing retained keys.  Extend the "key_alloc" security hook with a
task parameter representing the intended ownership context for the key
being allocated.  Attach security information to root's default keyrings
within the SELinux initialization routine.

Has passed David's testsuite.

Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:55 -07:00
Linus Torvalds
d9eaec9e29 Merge branch 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (25 commits)
  [PATCH] make set_loginuid obey audit_enabled
  [PATCH] log more info for directory entry change events
  [PATCH] fix AUDIT_FILTER_PREPEND handling
  [PATCH] validate rule fields' types
  [PATCH] audit: path-based rules
  [PATCH] Audit of POSIX Message Queue Syscalls v.2
  [PATCH] fix se_sen audit filter
  [PATCH] deprecate AUDIT_POSSBILE
  [PATCH] inline more audit helpers
  [PATCH] proc_loginuid_write() uses simple_strtoul() on non-terminated array
  [PATCH] update of IPC audit record cleanup
  [PATCH] minor audit updates
  [PATCH] fix audit_krule_to_{rule,data} return values
  [PATCH] add filtering by ppid
  [PATCH] log ppid
  [PATCH] collect sid of those who send signals to auditd
  [PATCH] execve argument logging
  [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
  [PATCH] audit_panic() is audit-internal
  [PATCH] inotify (5/5): update kernel documentation
  ...

Manual fixup of conflict in unclude/linux/inotify.h
2006-06-20 15:37:56 -07:00
Linus Torvalds
2edc322d42 Merge git://git.infradead.org/~dwmw2/rbtree-2.6
* git://git.infradead.org/~dwmw2/rbtree-2.6:
  [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
  Update UML kernel/physmem.c to use rb_parent() accessor macro
  [RBTREE] Update hrtimers to use rb_parent() accessor macro.
  [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
  [RBTREE] Merge colour and parent fields of struct rb_node.
  [RBTREE] Remove dead code in rb_erase()
  [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
  [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
  [RBTREE] Update key.c to use rb_parent() accessor macro.
  [RBTREE] Update ext3 to use rb_parent() accessor macro.
  [RBTREE] Change rbtree off-tree marking in I/O schedulers.
  [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20 14:51:22 -07:00
Linus Torvalds
be967b7e2f Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (199 commits)
  [MTD] NAND: Fix breakage all over the place
  [PATCH] NAND: fix remaining OOB length calculation
  [MTD] NAND Fixup NDFC merge brokeness
  [MTD NAND] S3C2410 driver cleanup
  [MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation.
  [JFFS2] Check CRC32 on dirent and data nodes each time they're read
  [JFFS2] When retiring nextblock, allocate a node_ref for the wasted space
  [JFFS2] Mark XATTR support as experimental, for now
  [JFFS2] Don't trust node headers before the CRC is checked.
  [MTD] Restore MTD_ROM and MTD_RAM types
  [MTD] assume mtd->writesize is 1 for NOR flashes
  [MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles
  [MTD] Prepare physmap for 64-bit-resources
  [JFFS2] Fix more breakage caused by janitorial meddling.
  [JFFS2] Remove stray __exit from jffs2_compressors_exit()
  [MTD] Allow alternate JFFS2 mount variant for root filesystem.
  [MTD] Disconnect struct mtd_info from ABI
  [MTD] replace MTD_RAM with MTD_GENERIC_TYPE
  [MTD] replace MTD_ROM with MTD_GENERIC_TYPE
  [MTD] remove a forgotten MTD_XIP
  ...
2006-06-20 14:50:31 -07:00
Steve Grubb
41757106b9 [PATCH] make set_loginuid obey audit_enabled
Hi,

I was doing some testing and noticed that when the audit system was disabled,
I was still getting messages about the loginuid being set. The following patch
makes audit_set_loginuid look at in_syscall to determine if it should create
an audit event. The loginuid will continue to be set as long as there is a context.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:29 -04:00
Amy Griffis
9c937dcc71 [PATCH] log more info for directory entry change events
When an audit event involves changes to a directory entry, include
a PATH record for the directory itself.  A few other notable changes:

    - fixed audit_inode_child() hooks in fsnotify_move()
    - removed unused flags arg from audit_inode()
    - added audit log routines for logging a portion of a string

Here's some sample output.

before patch:
type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149821605.320:26):  cwd="/root"
type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

after patch:
type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149822032.332:24):  cwd="/root"
type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0
type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Amy Griffis
6a2bceec0e [PATCH] fix AUDIT_FILTER_PREPEND handling
Clear AUDIT_FILTER_PREPEND flag after adding rule to list.  This
fixes three problems when a rule is added with the -A syntax:

    - auditctl displays filter list as "(null)"
    - the rule cannot be removed using -d
    - a duplicate rule can be added with -a

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Al Viro
0a73dccc4f [PATCH] validate rule fields' types
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
Amy Griffis
f368c07d72 [PATCH] audit: path-based rules
In this implementation, audit registers inotify watches on the parent
directories of paths specified in audit rules.  When audit's inotify
event handler is called, it updates any affected rules based on the
filesystem event.  If the parent directory is renamed, removed, or its
filesystem is unmounted, audit removes all rules referencing that
inotify watch.

To keep things simple, this implementation limits location-based
auditing to the directory entries in an existing directory.  Given
a path-based rule for /foo/bar/passwd, the following table applies:

    passwd modified -- audit event logged
    passwd replaced -- audit event logged, rules list updated
    bar renamed     -- rule removed
    foo renamed     -- untracked, meaning that the rule now applies to
		       the new location

Audit users typically want to have many rules referencing filesystem
objects, which can significantly impact filtering performance.  This
patch also adds an inode-number-based rule hash to mitigate this
situation.

The patch is relative to the audit git tree:
http://kernel.org/git/?p=linux/kernel/git/viro/audit-current.git;a=summary
and uses the inotify kernel API:
http://lkml.org/lkml/2006/6/1/145

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
George C. Wilson
20ca73bc79 [PATCH] Audit of POSIX Message Queue Syscalls v.2
This patch adds audit support to POSIX message queues.  It applies cleanly to
the lspp.b15 branch of Al Viro's git tree.  There are new auxiliary data
structures, and collection and emission routines in kernel/auditsc.c.  New hooks
in ipc/mqueue.c collect arguments from the syscalls.

I tested the patch by building the examples from the POSIX MQ library tarball.
Build them -lrt, not against the old MQ library in the tarball.  Here's the URL:
http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz
Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive,
mq_notify, mq_getsetattr.  mq_unlink has no new hooks.  Please see the
corresponding userspace patch to get correct output from auditd for the new
record types.

[fixes folded]

Signed-off-by: George Wilson <ltcgcw@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:26 -04:00
Al Viro
014149cce1 [PATCH] deprecate AUDIT_POSSBILE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Al Viro
d8945bb51a [PATCH] inline more audit helpers
pull checks for ->audit_context into inlined wrappers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Linda Knippers
ac03221a4f [PATCH] update of IPC audit record cleanup
The following patch addresses most of the issues with the IPC_SET_PERM
records as described in:
https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html
and addresses the comments I received on the record field names.

To summarize, I made the following changes:

1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM
   record is emitted in the failure case as well as the success case.
   This matches the behavior in sys_shmctl().  I could simplify the
   code in sys_msgctl() and semctl_down() slightly but it would mean
   that in some error cases we could get an IPC_SET_PERM record
   without an IPC record and that seemed odd.

2. No change to the IPC record type, given no feedback on the backward
   compatibility question.

3. Removed the qbytes field from the IPC record.  It wasn't being
   set and when audit_ipc_obj() is called from ipcperms(), the
   information isn't available.  If we want the information in the IPC
   record, more extensive changes will be necessary.  Since it only
   applies to message queues and it isn't really permission related, it
   doesn't seem worth it.

4. Removed the obj field from the IPC_SET_PERM record.  This means that
   the kern_ipc_perm argument is no longer needed.

5. Removed the spaces and renamed the IPC_SET_PERM field names.  Replaced iuid and
   igid fields with ouid and ogid in the IPC record.

I tested this with the lspp.22 kernel on an x86_64 box.  I believe it
applies cleanly on the latest kernel.

-- ljk

Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:24 -04:00
Serge E. Hallyn
5d136a010d [PATCH] minor audit updates
Just a few minor proposed updates.  Only the last one will
actually affect behavior.  The rest are just misleading
code.

Several AUDIT_SET functions return 'old' value, but only
return value <0 is checked for.  So just return 0.

propagate audit_set_rate_limit and audit_set_backlog_limit
error values

In audit_buffer_free, the audit_freelist_count was being
incremented even when we discard the return buffer, so
audit_freelist_count can end up wrong.  This could cause
the actual freelist to shrink over time, eventually
threatening to degrate audit performance.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Amy Griffis
0a3b483e83 [PATCH] fix audit_krule_to_{rule,data} return values
Don't return -ENOMEM when callers of these functions are checking for
a NULL return.  Bug noticed by Serge Hallyn.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Al Viro
3c66251e57 [PATCH] add filtering by ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
f46038ff7d [PATCH] log ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
e1396065e0 [PATCH] collect sid of those who send signals to auditd
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
473ae30bc7 [PATCH] execve argument logging
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
9044e6bca5 [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
We should not send a pile of replies while holding audit_netlink_mutex
since we hold the same mutex when we receive commands.  As the result,
we can get blocked while sending and sit there holding the mutex while
auditctl is unable to send the next command and get around to receiving
what we'd sent.

Solution: create skb and put them into a queue instead of sending;
once we are done, send what we've got on the list.  The former can
be done synchronously while we are handling AUDIT_LIST or AUDIT_LIST_RULES;
we are holding audit_netlink_mutex at that point.  The latter is done
asynchronously and without messing with audit_netlink_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:20 -04:00
Amy Griffis
2d9048e201 [PATCH] inotify (1/5): split kernel API from userspace support
The following series of patches introduces a kernel API for inotify,
making it possible for kernel modules to benefit from inotify's
mechanism for watching inodes.  With these patches, inotify will
maintain for each caller a list of watches (via an embedded struct
inotify_watch), where each inotify_watch is associated with a
corresponding struct inode.  The caller registers an event handler and
specifies for which filesystem events their event handler should be
called per inotify_watch.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:17 -04:00
Linus Torvalds
557240b48e Add support for suspending and resuming the whole console subsystem
Trying to suspend/resume with console messages flying all around is
doomed to failure, when the devices that the messages are trying to
go to are being shut down.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-19 18:16:01 -07:00
Oleg Nesterov
f53ae1dc34 [PATCH] arm_timer: remove a racy and obsolete PF_EXITING check
arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state)
in run_posix_cpu_timers().

However, for some reason it does so only for CPUCLOCK_PERTHREAD
case (which is imho wrong).

Also, this check is not reliable, PF_EXITING could be set on
another cpu without any locks/barriers just after the check,
so it can't prevent from attaching the timer to the exiting
task.

The previous patch makes this check unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
30f1e3dd8c [PATCH] run_posix_cpu_timers: remove a bogus BUG_ON()
do_exit() clears ->it_##clock##_expires, but nothing prevents
another cpu to attach the timer to exiting process after that.
arm_timer() tries to protect against this race, but the check
is racy.

After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and
before do_exit() calls 'schedule() local timer interrupt can find
tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu
does sys_wait4) interrupted task has ->signal == NULL.

At this moment exiting task has no pending cpu timers, they were
cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(),
so we can just return from irq.

John Stultz recently confirmed this bug, see

	http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
8f17fc20bf [PATCH] check_process_timers: fix possible lockup
If the local timer interrupt happens just after do_exit() sets PF_EXITING
(and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call
check_process_timers() with tasklist_lock + ->siglock held and

	check_process_timers:

		t = tsk;
		do {
			....

			do {
				t = next_thread(t);
			} while (unlikely(t->flags & PF_EXITING));
		} while (t != tsk);

the outer loop will never stop.

Actually, the window is bigger.  Another process can attach the timer
after ->it_xxx_expires was cleared (see the next commit) and the 'if
(PF_EXITING)' check in arm_timer() is racy (see the one after that).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Sam Ravnborg
b817f6feff kbuild: check license compatibility when building modules
Modules that uses GPL symbols can no longer be build with kbuild,
the build will fail during the modpost step.
When a GPL-incompatible module uses a EXPORT_SYMBOL_GPL_FUTURE symbol
then warn during modpost so author are actually notified.

The actual license compatibility check is shared with the kernel
to make sure it is in sync.

Patch originally from: Andreas Gruenbacher <agruen@suse.de> and
Ram Pai <linuxram@us.ibm.com>

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-06-09 21:53:55 +02:00
Anton Blanchard
651d765d0b [PATCH] Add a prctl to change the endianness of a process.
This new prctl is intended for changing the execution mode of the
processor, on processors that support both a little-endian mode and a
big-endian mode.  It is intended for use by programs such as
instruction set emulators (for example an x86 emulator on PowerPC),
which may find it convenient to use the processor in an alternate
endianness mode when executing translated instructions.

Note that this does not imply the existence of a fully-fledged ABI for
both endiannesses, or of compatibility code for converting system
calls done in the non-native endianness mode.  The program is expected
to arrange for all of its system call arguments to be presented in the
native endianness.

Switching between big and little-endian mode will require some care in
constructing the instruction sequence for the switch.  Generally the
instructions up to the instruction that invokes the prctl system call
will have to be in the old endianness, and subsequent instructions
will have to be in the new endianness.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-09 21:24:13 +10:00
Stephen Hemminger
8d16b76421 [PATCH] hrtimer: export symbols
From: Stephen Hemminger <shemminger@osdl.org>

I want to use the hrtimer's in the netem (Network Emulator) qdisc.  But the
necessary symbols aren't exported for module use.

Also needed by SystemTap.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Stone, Joshua I" <joshua.i.stone@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:11 -07:00
Linus Torvalds
f1adad78dd Revert "[PATCH] sched: fix interactive task starvation"
This reverts commit 5ce74abe78 (and its
dependent commit 8a5bc075b8), because of
audio underruns.

Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed
the exact cause of the underruns:

  "Audio underruns galore, with only ogg123 and firefox (browsing the
   GIT tree online is also a nice trigger by the way).

   If I back it out, everything is fine for me again."

Cc: Rene Herman <rene.herman@keyaccess.nl>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 18:54:09 -07:00
Zachary Amsden
0662b71322 [PATCH] Fix a NO_IDLE_HZ timer bug
Under certain timing conditions, a race during boot occurs where timer
ticks are being processed on remote CPUs.  The remote timer ticks can
increment jiffies, and if this happens during a window when a timeout is
very close to expiring but a local tick has not yet been delivered, you can
end up with

1) No softirq pending
2) A local timer wheel which is not synced to jiffies
3) No high resolution timer active
4) A local timer which is supposed to fire before the current jiffies value.

In this circumstance, the comparison in next_timer_interrupt overflows,
because the base of the comparison for high resolution timers is jiffies,
but for the softirq timer wheel, it is relative the the current base of the
wheel (jiffies_base).

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:21 -07:00
Paul Jackson
92d1dbd274 [PATCH] cpuset: might_sleep_if check in cpuset_zones_allowed
It's too easy to incorrectly call cpuset_zone_allowed() in an atomic
context without __GFP_HARDWALL set, and when done, it is not noticed until
a tight memory situation forces allocations to be tried outside the current
cpuset.

Add a 'might_sleep_if()' check, to catch this earlier on, instead of
waiting for a similar check in the mutex_lock() code, which is only rarely
invoked.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
Paul Jackson
36be57ffe3 [PATCH] cpuset: update cpuset_zones_allowed comment
Update the kernel/cpuset.c:cpuset_zone_allowed() comment.

The rule for when mm/page_alloc.c should call cpuset_zone_allowed()
was intended to be:

  Don't call cpuset_zone_allowed() if you can't sleep, unless you
  pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  the code that might scan up ancestor cpusets and sleep.

The explanation of this rule in the comment above cpuset_zone_allowed() was
stale, as a result of a restructuring of some __alloc_pages() code in
November 2005.

Rewrite that comment ...

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
David Woodhouse
18594822fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-16 01:19:52 +01:00
Trent Piepho
5e37661389 [PATCH] symbol_put_addr() locks kernel
Even since a previous patch:

Fix race between CONFIG_DEBUG_SLABALLOC and modules
Sun, 27 Jun 2004 17:55:19 +0000 (17:55 +0000)
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=92b3db26d31cf21b70e3c1eadc56c179506d8fbe

The function symbol_put_addr() will deadlock the kernel.

symbol_put_addr() would acquire modlist_lock, then while holding the lock call
two functions kernel_text_address() and module_text_address() which also try
to acquire the same lock.  This deadlocks the kernel of course.

This patch changes symbol_put_addr() to not acquire the modlist_lock, it
doesn't need it since it never looks at the module list directly.  Also, it
now uses core_kernel_text() instead of kernel_text_address().  The latter has
an additional check for addr inside a module, but we don't need to do that
since we call module_text_address() (the same function kernel_text_address
uses) ourselves.

Signed-off-by: Trent Piepho <xyzzy@speakeasy.org>
Cc: Zwane Mwaikambo <zwane@fsmlabs.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Johannes Stezenbach <js@linuxtv.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Heiko Carstens
986733e01d [PATCH] RCU: introduce rcu_needs_cpu() interface
With "Paul E. McKenney" <paulmck@us.ibm.com>

Introduce rcu_needs_cpu() interface.  This can be used to tell if there
will be a new rcu batch on a cpu soon by looking at the curlist pointer.
This can be used to avoid to enter a tickless idle state where the cpu
would miss that a new batch is ready when rcu_start_batch would be called
on a different cpu.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Linus Torvalds
f358166a94 ptrace_attach: fix possible deadlock schenario with irqs
Eric Biederman points out that we can't take the task_lock while holding
tasklist_lock for writing, because another CPU that holds the task lock
might take an interrupt that then tries to take tasklist_lock for writing.

Which would be a nasty deadlock, with one CPU spinning forever in an
interrupt handler (although admittedly you need to really work at
triggering it ;)

Since the ptrace_attach() code is special and very unusual, just make it
be extra careful, and use trylock+repeat to avoid the possible deadlock.

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-11 11:08:49 -07:00
David Woodhouse
6f18a022fb Finally remove the obnoxious inter_module_xxx()
This was already a bad plan when I argued against adding it in the first
place. Good riddance.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-08 22:40:05 +01:00
Linus Torvalds
f5b40e363a Fix ptrace_attach()/ptrace_traceme()/de_thread() race
This holds the task lock (and, for ptrace_attach, the tasklist_lock)
over the actual attach event, which closes a race between attacking to a
thread that is either doing a PTRACE_TRACEME or getting de-threaded.

Thanks to Oleg Nesterov for reminding me about this, and Chris Wright
for noticing a lost return value in my first version.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-07 10:49:33 -07:00
Steve Grubb
2ad312d209 [PATCH] Audit Filter Performance
While testing the watch performance, I noticed that selinux_task_ctxid()
was creeping into the results more than it should. Investigation showed
that the function call was being called whether it was needed or not. The
below patch fixes this.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:07 -04:00
Steve Grubb
073115d6b2 [PATCH] Rework of IPC auditing
1) The audit_ipc_perms() function has been split into two different
functions:
        - audit_ipc_obj()
        - audit_ipc_set_perm()

There's a key shift here...  The audit_ipc_obj() collects the uid, gid,
mode, and SElinux context label of the current ipc object.  This
audit_ipc_obj() hook is now found in several places.  Most notably, it
is hooked in ipcperms(), which is called in various places around the
ipc code permforming a MAC check.  Additionally there are several places
where *checkid() is used to validate that an operation is being
performed on a valid object while not necessarily having a nearby
ipcperms() call.  In these locations, audit_ipc_obj() is called to
ensure that the information is captured by the audit system.

The audit_set_new_perm() function is called any time the permissions on
the ipc object changes.  In this case, the NEW permissions are recorded
(and note that an audit_ipc_obj() call exists just a few lines before
each instance).

2) Support for an AUDIT_IPC_SET_PERM audit message type.  This allows
for separate auxiliary audit records for normal operations on an IPC
object and permissions changes.  Note that the same struct
audit_aux_data_ipcctl is used and populated, however there are separate
audit_log_format statements based on the type of the message.  Finally,
the AUDIT_IPC block of code in audit_free_aux() was extended to handle
aux messages of this new type.  No more mem leaks I hope ;-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:04 -04:00
Steve Grubb
ce29b682e2 [PATCH] More user space subject labels
Hi,

The patch below builds upon the patch sent earlier and adds subject label to
all audit events generated via the netlink interface. It also cleans up a few
other minor things.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:01 -04:00
Steve Grubb
e7c3497013 [PATCH] Reworked patch for labels on user space messages
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.

[updated:
>  Stephen Smalley also wanted to change a variable from isec to tsec in the
>  user sid patch.                                                              ]

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:58 -04:00
Steve Grubb
9c7aa6aa74 [PATCH] change lspp ipc auditing
Hi,

The patch below converts IPC auditing to collect sid's and convert to context
string only if it needs to output an audit record. This patch depends on the
inode audit change patch already being applied.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:56 -04:00
Steve Grubb
1b50eed9ca [PATCH] audit inode patch
Previously, we were gathering the context instead of the sid. Now in this patch,
we gather just the sid and convert to context only if an audit event is being
output.

This patch brings the performance hit from 146% down to 23%

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:53 -04:00
Darrel Goeddel
3dc7e3153e [PATCH] support for context based audit filtering, part 2
This patch provides the ability to filter audit messages based on the
elements of the process' SELinux context (user, role, type, mls sensitivity,
and mls clearance).  It uses the new interfaces from selinux to opaquely
store information related to the selinux context and to filter based on that
information.  It also uses the callback mechanism provided by selinux to
refresh the information when a new policy is loaded.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:36 -04:00
Al Viro
97e94c4530 [PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:21 -04:00
Al Viro
5411be59db [PATCH] drop task argument of audit_syscall_{entry,exit}
... it's always current, and that's a good thing - allows simpler locking.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:18 -04:00
Al Viro
e495149b17 [PATCH] drop gfp_mask in audit_log_exit()
now we can do that - all callers are process-synchronous and do not hold
any locks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:16 -04:00
Al Viro
fa84cb935d [PATCH] move call of audit_free() into do_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:13 -04:00
Al Viro
45d9bb0e37 [PATCH] deal with deadlocks in audit_free()
Don't assume that audit_log_exit() et.al. are called for the context of
current; pass task explictly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:07 -04:00
Andrew Morton
13e87ec686 [PATCH] request_irq(): remove warnings from irq probing
- Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning.
  Some drivers like to use request_irq() to find an unused interrupt slot.

- Use it in i82365.c

- Kill unused SA_PROBE.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
dean gaudet
47bb789973 [PATCH] off-by-1 in kernel/power/main.c
There's an off-by-1 in kernel/power/main.c:state_store() ...  if your
kernel just happens to have some non-zero data at pm_states[PM_SUSPEND_MAX]
(i.e.  one past the end of the array) then it'll let you write anything you
want to /sys/power/state and in response the box will enter S5.

Signed-off-by: dean gaudet <dean@arctic.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
Chandra Seetharaman
83d722f7e1 [PATCH] Remove __devinit and __cpuinit from notifier_call definitions
Few of the notifier_chain_register() callers use __init in the definition
of notifier_call.  It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:30:03 -07:00
Chandra Seetharaman
649bbaa484 [PATCH] Remove __devinitdata from notifier block definitions
Few of the notifier_chain_register() callers use __devinitdata in the
definition of notifier_block data structure.  It is incorrect as the
data structure should be available after the initializations (they do
not unregister them during initializations).

This was leading to an oops when notifier_chain_register() call is
invoked for those callback chains after initialization.

This patch fixes all such usages to _not_ have the notifier_block data
structure in the init data section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:27:50 -07:00
David Woodhouse
ed198cb497 [RBTREE] Update hrtimers to use rb_parent() accessor macro.
Also switch it to use the same method of using off-tree nodes as
everyone else now does -- set them to point to themselves.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-22 02:38:50 +01:00
Linus Torvalds
402a26f0c0 Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] block/elevator.c: remove unused exports
  [PATCH] splice: fix smaller sized splice reads
  [PATCH] Don't inherit ->splice_pipe across forks
  [patch] cleanup: use blk_queue_stopped
  [PATCH] Document online io scheduler switching
2006-04-20 08:17:04 -07:00
Ananth N Mavinakayanahalli
7522a8423b [PATCH] kprobes: NULL out non-relevant fields in struct kretprobe
In cases where a struct kretprobe's *_handler fields are non-NULL, it is
possible to cause a system crash, due to the possibility of calls ending up
in zombie functions.  Documentation clearly states that unused *_handlers
should be set to NULL, but kprobe users sometimes fail to do so.

Fix it by setting the non-relevant fields of the struct kretprobe to NULL.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:03 -07:00
Jens Axboe
a0aa7f68af [PATCH] Don't inherit ->splice_pipe across forks
It's really task private, so clear that field on fork after copying
task structure.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-20 13:05:33 +02:00
OGAWA Hirofumi
5a7b46b369 [PATCH] Add more prevent_tail_call()
Those also break userland regs like following.

   00000000 <sys_chown16>:
      0:	0f b7 44 24 0c       	movzwl 0xc(%esp),%eax
      5:	83 ca ff             	or     $0xffffffff,%edx
      8:	0f b7 4c 24 08       	movzwl 0x8(%esp),%ecx
      d:	66 83 f8 ff          	cmp    $0xffffffff,%ax
     11:	0f 44 c2             	cmove  %edx,%eax
     14:	66 83 f9 ff          	cmp    $0xffffffff,%cx
     18:	0f 45 d1             	cmovne %ecx,%edx
     1b:	89 44 24 0c          	mov    %eax,0xc(%esp)
     1f:	89 54 24 08          	mov    %edx,0x8(%esp)
     23:	e9 fc ff ff ff       	jmp    24 <sys_chown16+0x24>

where the tailcall at the end overwrites the incoming stack-frame.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
[ I would _really_ like to have a way to tell gcc about calling
  conventions. The "prevent_tail_call()" macro is pretty ugly ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 16:27:18 -07:00
Rafael J. Wysocki
4a3b98a422 [PATCH] swsusp: prevent possible image corruption on resume
The function free_pagedir() used by swsusp for freeing its internal data
structures clears the PG_nosave and PG_nosave_free flags for each page
being freed.

However, during resume PG_nosave_free set means that the page in
question is "unsafe" (ie.  it will be overwritten in the process of
restoring the saved system state from the image), so it should not be
used for the image data.

Therefore free_pagedir() should not clear PG_nosave_free if it's called
during resume (otherwise "unsafe" pages freed by it may be used for
storing the image data and the data may get corrupted later on).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
5e85d4abe3 [PATCH] task: Make task list manipulations RCU safe
While we can currently walk through thread groups, process groups, and
sessions with just the rcu_read_lock, this opens the door to walking the
entire task list.

We already have all of the other RCU guarantees so there is no cost in
doing this, this should be enough so that proc can stop taking the
tasklist lock during readdir.

prev_task was killed because it has no users, and using it will miss new
tasks when doing an rcu traversal.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
64541d1970 [PATCH] kill unushed __put_task_struct_cb
Somehow in the midst of dotting i's and crossing t's during
the merge up to rc1 we wound up keeping __put_task_struct_cb
when it should have been killed as it no longer has any users.
Sorry I probably should have caught this while it was
still in the -mm tree.

Having the old code there gets confusing when reading
through the code and trying to understand what is
happening.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 17:43:57 -07:00
Adrian Bunk
78a596b449 [PATCH] remove kernel/power/pm.c:pm_unregister()
Since the last user is removed in -mm, we can now remove this long deprecated
function.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14 12:25:26 -07:00
Roland McGrath
e57a505984 [PATCH] fix non-leader exec under ptrace
This reverts most of commit 30e0fca6c1.
It broke the case of non-leader MT exec when ptraced.
I think the bug it was intended to fix was already addressed by commit
788e05a67c.

Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 08:59:13 -07:00
Oleg Nesterov
a145410dcc [PATCH] __group_complete_signal: remove bogus BUG_ON
Commit e56d090310

   [PATCH] RCU signal handling

made this BUG_ON() unsafe. This code runs under ->siglock,
while switch_exec_pids() takes tasklist_lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 07:34:01 -07:00
Linus Torvalds
88dd9c16ce Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] vfs: add splice_write and splice_read to documentation
  [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
  [PATCH] splice: warning fix
  [PATCH] another round of fs/pipe.c cleanups
  [PATCH] splice: comment styles
  [PATCH] splice: add Ingo as addition copyright holder
  [PATCH] splice: unlikely() optimizations
  [PATCH] splice: speedups and optimizations
  [PATCH] pipe.c/fifo.c code cleanups
  [PATCH] get rid of the PIPE_*() macros
  [PATCH] splice: speedup __generic_file_splice_read
  [PATCH] splice: add direct fd <-> fd splicing support
  [PATCH] splice: add optional input and output offsets
  [PATCH] introduce a "kernel-internal pipe object" abstraction
  [PATCH] splice: be smarter about calling do_page_cache_readahead()
  [PATCH] splice: optimize the splice buffer mapping
  [PATCH] splice: cleanup __generic_file_splice_read()
  [PATCH] splice: only call wake_up_interruptible() when we really have to
  [PATCH] splice: potential !page dereference
  [PATCH] splice: mark the io page as accessed
2006-04-11 06:34:02 -07:00
Joe Korty
5ef37b1964 [PATCH] add cpu_relax to hrtimer_cancel
Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel().

Signed-off-by: Joe Korty <joe.korty@ccur.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:42 -07:00
Christoph Hellwig
d824e66a9a [PATCH] build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is set
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:41 -07:00
Adrian Bunk
aa7271076a [PATCH] the scheduled unexport of panic_timeout
Implement the scheduled unexport of panic_timeout.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Andrew Morton
ba6edfcd17 [PATCH] timer initialisation fix
We need the boot CPU's tvec_bases[] entry to be initialised super-early in
boot, for early_serial_setup().  That runs within setup_arch(), before even
per-cpu areas are initialised.

The patch changes tvec_bases to use compile-time initialisation, and adds a
separate array `tvec_base_done' to keep track of which CPU has had its
tvec_bases[] entry initialised (because we can no longer use the zeroness of
that tvec_bases[] entry to determine whether it has been initialised).

Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this.

Cc: Eugene Surovegin <ebs@ebshome.net>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Hyok S. Choi
3016b42153 [PATCH] frv: define MMU mode specific syscalls as 'cond_syscall' and clean up unneeded macros
For some architectures, a few syscalls are not linked in noMMU mode.  In
that case, the MMU depending syscalls are needed to be defined as
'cond_syscall'.  For example, ARM architecture selectively links sys_mlock
by the mode configuration.

In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in
arch/frv/kernel/entry.S.  However these conditional macros are just
duplicates if they were defined as cond_syscall.  Compilation test is done
with FRV toolchains for both of MMU and noMMU mode.

Signed-off-by: Hyok S. Choi <hyok.choi@samsung.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:33 -07:00
Mike Galbraith
8a5bc075b8 [PATCH] sched: don't awaken RT tasks on expired array
RT tasks are being awakened on the expired array when expired_starving() is
true, whereas they really should be excluded.  Fix.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Mike Galbraith
5ce74abe78 [PATCH] sched: fix interactive task starvation
Fix a starvation problem that occurs when a stream of highly interactive tasks
delay an array switch for extended periods despite EXPIRED_STARVING(rq) being
true.  AFAIKT, the only choice is to enqueue awakening tasks on the expired
array in this case.

Without this patch, it can be nearly impossible to remotely login to a busy
server, and interactive shell commands can starve for minutes.

Also, convert the EXPIRED_STARVING macro into an inline function which humans
can understand.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Jens Axboe
b92ce55893 [PATCH] splice: add direct fd <-> fd splicing support
It's more efficient for sendfile() emulation. Basically we cache an
internal private pipe and just use that as the intermediate area for
pages. Direct splicing is not available from sys_splice(), it is only
meant to be used for sendfile() emulation.

Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
exit for the normal fast path.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-11 13:52:07 +02:00
Jordan Hargrave
b20367a6c2 [PATCH] x86_64: Fix drift with HPET timer enabled
If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).

If HZ changes, this drift can become even more pronounced.

HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.

Vojtech comments:

  "It's not entirely correct (it assumes the HPET ticks totally
   exactly), but it's significantly better than assuming the PIT error
   there."

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09 11:53:53 -07:00
Eric Sesterhenn
9f31252cb6 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:45:55 +02:00
Eric Sesterhenn
fda8bd78a1 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:44:47 +02:00
Eric Sesterhenn
524223ca81 BUG_ON() Conversion in kernel/ptrace.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:43:40 +02:00
Kalin KOZHUHAROV
8ba8e95ed1 Fix comments: s/granuality/granularity/
I was grepping through the code and some `grep ganularity -R .` didn't
catch what I thought. Then looking closer I saw the term "granuality"
used in only four places (in comments) and granularity in many more
places describing the same idea. Some other facts:

dictionary.com does not know such a word
define:granuality on google is not found (and pages for granuality are
mostly related to patches to the kernel)
it has not been discussed as a term on LKML, AFAICS (=Can Search)

To be consistent, I think granularity should be used everywhere.

Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:41:22 +02:00
Eric Sesterhenn
8abd8e298e BUG_ON() Conversion in kernel/printk.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:21:17 +02:00
Adrian Bunk
3e6e952d1d help text: SOFTWARE_SUSPEND doesn't need ACPI
The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays
the information that it doesn't need ACPI, too, is even more helpful.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:03:08 +02:00
Kirill Korotaev
4286229868 [PATCH] wrong error path in dup_fd() leading to oopses in RCU
Wrong error path in dup_fd() - it should return NULL on error,
not an address of already freed memory :/

Triggered by OpenVZ stress test suite.

What is interesting is that it was causing different oopses in RCU like
below:
Call Trace:
   [<c013492c>] rcu_do_batch+0x2c/0x80
   [<c0134bdd>] rcu_process_callbacks+0x3d/0x70
   [<c0126cf3>] tasklet_action+0x73/0xe0
   [<c01269aa>] __do_softirq+0x10a/0x130
   [<c01058ff>] do_softirq+0x4f/0x60
   =======================
   [<c0113817>] smp_apic_timer_interrupt+0x77/0x110
   [<c0103b54>] apic_timer_interrupt+0x1c/0x24
  Code:  Bad EIP value.
   <0>Kernel panic - not syncing: Fatal exception in interrupt

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Dmitry Mishin <dim@openvz.org>
Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:25:46 -08:00
Eric W. Biederman
92476d7fc0 [PATCH] pidhash: Refactor the pid hash table
Simplifies the code, reduces the need for 4 pid hash tables, and makes the
code more capable.

In the discussions I had with Oleg it was felt that to a large extent the
cleanup itself justified the work.  With struct pid being dynamically
allocated meant we could create the hash table entry when the pid was
allocated and free the hash table entry when the pid was freed.  Instead of
playing with the hash lists when ever a process would attach or detach to a
process.

For myself the fact that it gave what my previous task_ref patch gave for free
with simpler code was a big win.  The problem is that if you hold a reference
to struct task_struct you lock in 10K of low memory.  If you do that in a user
controllable way like /proc does, with an unprivileged but hostile user space
application with typical resource limits of 1000 fds and 100 processes I can
trigger the OOM killer by consuming all of low memory with task structs, on a
machine wight 1GB of low memory.

If I instead hold a reference to struct pid which holds a pointer to my
task_struct, I don't suffer from that problem because struct pid is 2 orders
of magnitude smaller.  In fact struct pid is small enough that most other
kernel data structures dwarf it, so simply limiting the number of referring
data structures is enough to prevent exhaustion of low memory.

This splits the current struct pid into two structures, struct pid and struct
pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
struct pid_link is the per process linkage into the hash tables and lives in
struct task_struct.  struct pid is given an indepedent lifetime, and holds
pointers to each of the pid types.

The independent life of struct pid simplifies attach_pid, and detach_pid,
because we are always manipulating the list of pids and not the hash table.
In addition in giving struct pid an indpendent life it makes the concept much
more powerful.

Kernel data structures can now embed a struct pid * instead of a pid_t and
not suffer from pid wrap around problems or from keeping unnecessarily
large amounts of memory allocated.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:00 -08:00
Eric W. Biederman
8c7904a00b [PATCH] task: RCU protect task->usage
A big problem with rcu protected data structures that are also reference
counted is that you must jump through several hoops to increase the reference
count.  I think someone finally implemented atomic_inc_not_zero(&count) to
automate the common case.  Unfortunately this means you must special case the
rcu access case.

When data structures are only visible via rcu in a manner that is not
determined by the reference count on the object (i.e.  tasks are visible until
their zombies are reaped) there is a much simpler technique we can employ.
Simply delaying the decrement of the reference count until the rcu interval is
over.

What that means is that the proc code that looks up a task and later
wants to sleep can now do:

rcu_read_lock();
task = find_task_by_pid(some_pid);
if (task) {
	get_task_struct(task);
}
rcu_read_unlock();

The effect on the rest of the kernel is that put_task_struct becomes cheaper
and immediate, and in the case where the task has been reaped it frees the
task immediate instead of unnecessarily waiting an until the rcu interval is
over.

Cleanup of task_struct does not happen when its reference count drops to
zero, instead cleanup happens when release_task is called.  Tasks can only
be looked up via rcu before release_task is called.  All rcu protected
members of task_struct are freed by release_task.

Therefore we can move call_rcu from put_task_struct into release_task.  And
we can modify release_task to not immediately release the reference count
but instead have it call put_task_struct from the function it gives to
call_rcu.

The end result:

- get_task_struct is safe in an rcu context where we have just looked
  up the task.

- put_task_struct() simplifies into its old pre rcu self.

This reorganization also makes put_task_struct uncallable from modules as
it is not exported but it does not appear to be called from any modules so
this should not be an issue, and is trivially fixed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Andrew Morton
158d9ebd19 [PATCH] resurrect __put_task_struct
This just got nuked in mainline.  Bring it back because Eric's patches use it.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Eric W. Biederman
390e2ff077 [PATCH] Make setsid() more robust
The core problem: setsid fails if it is called by init.  The effect in 2.6.16
and the earlier kernels that have this problem is that if you do a "ps -j 1 or
ps -ej 1" you will see that init and several of it's children have process
group and session == 0.  Instead of process group == session == 1.  Despite
init calling setsid.

The reason it fails is that daemonize calls set_special_pids(1,1) on kernel
threads that are launched before /sbin/init is called.

The only remaining effect in that current->signal->leader == 0 for init
instead of 1.  And the setsid call fails.  No one has noticed because
/sbin/init does not check the return value of setsid.

In 2.4 where we don't have the pidhash table, and daemonize doesn't exist
setsid actually works for init.

I care a lot about pid == 1 not being a special case that we leave broken,
because of the container/jail work that I am doing.

- Carefully allow init (pid == 1) to call setsid despite the kernel using
  its session.

- Use find_task_by_pid instead of find_pid because find_pid taking a
  pidtype is going away.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Thomas Gleixner
9741ef964d [PATCH] futex: check and validate timevals
The futex timeval is not checked for correctness.  The change does not
break existing applications as the timeval is supplied by glibc (and glibc
always passes a correct value), but the glibc-internal tests for this
functionality fail.

Signed-off-by: Thomas Gleixner <tglx@tglx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
d425b274ba [PATCH] sched: activate SCHED BATCH expired
To increase the strength of SCHED_BATCH as a scheduling hint we can
activate batch tasks on the expired array since by definition they are
latency insensitive tasks.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
7c4bb1f9b3 [PATCH] sched: remove on runqueue requeueing
On runqueue time is used to elevate priority in schedule().

In the code it currently requeues tasks even if their priority is not
elevated, which would end up placing them at the end of their runqueue
array effectively delaying them instead of improving their priority.

Bug spotted by Mike Galbraith <efault@gmx.de>

This patch removes this requeueing.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
5138930e6a [PATCH] sched: include noninteractive sleep in idle detect
Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so
they need to be included in the idle detection code.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
e72ff0bb2c [PATCH] sched: dont decrease idle sleep avg
We watch for tasks that sleep extended periods and don't allow one single
prolonged sleep period from elevating priority to maximum bonus to prevent cpu
bound tasks from getting high priority with single long sleeps.  There is a
bug in the current code that also penalises tasks that already have high
priority.  Correct that bug.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Con Kolivas
e7c38cb49c [PATCH] sched: make task_noninteractive use sleep_type
Alterations to the pipe code in the kernel made it possible for relative
starvation to occur with tasks that slept waiting on a pipe getting unfair
priority bonuses even if they were otherwise fully cpu bound so the
TASK_NONINTERACTIVE flag was introduced which prevented any change to
sleep_avg while sleeping waiting on a pipe.  This change also leads to the
converse though, preventing any priority boost from occurring in truly
interactive tasks that wait on pipes.

Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE
which will allow a linear bonus to priority based on sleep time thus allowing
interactive tasks to get high priority if they sleep enough.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00