android_kernel_xiaomi_sm8350

Author	SHA1	Message	Date
Ingo Molnar	b29739f902	[PATCH] pi-futex: scheduler support for pi Add framework to boost/unboost the priority of RT tasks. This consists of: - caching the 'normal' priority in ->normal_prio - providing a functions to set/get the priority of the task - make sched_setscheduler() aware of boosting The effective_prio() cleanups also fix a priority-calculation bug pointed out by Andrey Gelman, in set_user_nice(). has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrey Gelman <agelman@012.net.il> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:46 -07:00
Ingo Molnar	e2970f2fb6	[PATCH] pi-futex: futex code cleanups We are pleased to announce "lightweight userspace priority inheritance" (PI) support for futexes. The following patchset and glibc patch implements it, ontop of the robust-futexes patchset which is included in 2.6.16-mm1. We are calling it lightweight for 3 reasons: - in the user-space fastpath a PI-enabled futex involves no kernel work (or any other PI complexity) at all. No registration, no extra kernel calls - just pure fast atomic ops in userspace. - in the slowpath (in the lock-contention case), the system call and scheduling pattern is in fact better than that of normal futexes, due to the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down] - the in-kernel PI implementation is streamlined around the mutex abstraction, with strict rules that keep the implementation relatively simple: only a single owner may own a lock (i.e. no read-write lock support), only the owner may unlock a lock, no recursive locking, etc. Priority Inheritance - why, oh why??? ------------------------------------- Many of you heard the horror stories about the evil PI code circling Linux for years, which makes no real sense at all and is only used by buggy applications and which has horrible overhead. Some of you have dreaded this very moment, when someone actually submits working PI code ;-) So why would we like to see PI support for futexes? We'd like to see it done purely for technological reasons. We dont think it's a buggy concept, we think it's useful functionality to offer to applications, which functionality cannot be achieved in other ways. We also think it's the right thing to do, and we think we've got the right arguments and the right numbers to prove that. We also believe that we can address all the counter-arguments as well. For these reasons (and the reasons outlined below) we are submitting this patch-set for upstream kernel inclusion. What are the benefits of PI? The short reply: ---------------- User-space PI helps achieving/improving determinism for user-space applications. In the best-case, it can help achieve determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. The longer reply: ----------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we can see it in the kernel [which is a quite complex program in itself], lockless structures are rather the exception than the norm - the current ratio of lockless vs. locky code for shared data structures is somewhere between 1:10 and 1:100. Lockless is hard, and the complexity of lockless algorithms often endangers to ability to do robust reviews of said code. I.e. critical RT apps often choose lock structures to protect critical data structures, instead of lockless algorithms. Furthermore, there are cases (like shared hardware, or other resource limits) where lockless access is mathematically impossible. Media players (such as Jack) are an example of reasonable application design with multiple tasks (with multiple priority levels) sharing short-held locks: for example, a highprio audio playback thread is combined with medium-prio construct-audio-data threads and low-prio display-colory-stuff threads. Add video and decoding to the mix and we've got even more priority levels. So once we accept that synchronization objects (locks) are an unavoidable fact of life, and once we accept that multi-task userspace apps have a very fair expectation of being able to use locks, we've got to think about how to offer the option of a deterministic locking implementation to user-space. Most of the technical counter-arguments against doing priority inheritance only apply to kernel-space locks. But user-space locks are different, there we cannot disable interrupts or make the task non-preemptible in a critical section, so the 'use spinlocks' argument does not apply (user-space spinlocks have the same priority inversion problems as other user-space locking constructs). Fact is, pretty much the only technique that currently enables good determinism for userspace locks (such as futex-based pthread mutexes) is priority inheritance: Currently (without PI), if a high-prio and a low-prio task shares a lock [this is a quite common scenario for most non-trivial RT applications], even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. Implementation: --------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to normal futex-based locks: a 0 value means unlocked, and a value==TID means locked. (This is the same method as used by list-based robust futexes.) Userspace uses atomic ops to lock/unlock these mutexes without entering the kernel. To handle the slowpath, we have added two new futex ops: FUTEX_LOCK_PI FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work: if there is no futex-queue attached to the futex address yet then the code looks up the task that owns the futex [it has put its own TID into the futex value], and attaches a 'PI state' structure to the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, kernel-based synchronization object. The 'other' task is made the owner of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex acquired, and it sets the futex value to its own TID and returns. Userspace has no other work to perform - it now owns the lock, and futex value contains FUTEX_WAITERS\|TID. If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID -> 0 atomic transition of the futex value], then no kernel work is triggered. If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes up any potential waiters. Note that under this approach, contrary to other PI-futex approaches, there is no prior 'registration' of a PI-futex. [which is not quite possible anyway, due to existing ABI properties of pthread mutexes.] Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. glibc support: -------------- Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no additional kernel changes are needed for that). [NOTE: The glibc patch is obviously inofficial and unsupported without matching upstream kernel functionality.] the patch-queue and the glibc patch can also be downloaded from: http://redhat.com/~mingo/PI-futex-patches/ Many thanks go to the people who helped us create this kernel feature: Steven Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan van de Ven, Oleg Nesterov and others. Credits for related prior projects goes to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others. Clean up the futex code, before adding more features to it: - use u32 as the futex field type - that's the ABI - use __user and pointers to u32 instead of unsigned long - code style / comment style cleanups - rename hash-bucket name from 'bh' to 'hb'. I checked the pre and post futex.o object files to make sure this patch has no code effects. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:46 -07:00
Steven Rostedt	66e5393a78	[PATCH] BUG() if setscheduler is called from interrupt context Thomas Gleixner is adding the call to a rtmutex function in setscheduler. This call grabs a spin_lock that is not always protected by interrupts disabled. So this means that setscheduler cant be called from interrupt context. To prevent this from happening in the future, this patch adds a BUG_ON(in_interrupt()) in that function. (Thanks to akpm <aka. Andrew Morton> for this suggestion). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:46 -07:00
Oleg Nesterov	9fea80e4d9	[PATCH] sched: uninline task_rq_lock() Saves 543 bytes from sched.o (gcc 3.3.3). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Siddha, Suresh B	5c45bf279d	[PATCH] sched: mc/smt power savings sched policy sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in /sys/devices/system/cpu/ control the MC/SMT power savings policy for the scheduler. Based on the values (1-enable, 0-disable) for these controls, sched groups cpu power will be determined for different domains. When power savings policy is enabled and under light load conditions, scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power(with a perf impact based on the workload characteristics... see OLS 2005 CMP kernel scheduler paper for more details..) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri	369381694d	[PATCH] sched_domai: Allocate sched_group structures dynamically As explained here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2 there is a problem with sharing sched_group structures between two separate sched_group structures for different sched_domains. The patch has been tested and found to avoid the kernel lockup problem described in above URL. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri	15f0b676a4	[PATCH] sched_domai: Use kmalloc_node The sched group structures used to represent various nodes need to be allocated from respective nodes (as suggested here also: http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html) Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri	d3a5aa9858	[PATCH] sched_domai: Don't use GFP_ATOMIC Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based allocation. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri	51888ca25a	[PATCH] sched_domain: handle kmalloc failure Try to handle mem allocation failures in build_sched_domains by bailing out and cleaning up thus-far allocated memory. The patch has a direct consequence that we disable load balancing completely (even at sibling level) upon any memory allocation failure. [Lee.Schermerhorn@hp.com: bugfix] Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:45 -07:00
Peter Williams	615052dc3b	[PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks() Problem: To help distribute high priority tasks evenly across the available CPUs move_tasks() does not, under some circumstances, skip tasks whose load weight is bigger than the designated amount. Because the highest priority task on the busiest queue may be on the expired array it may be moved as a result of this mechanism. Apart from not being the most desirable way to redistribute the high priority tasks (we'd rather move the second highest priority task), there is a risk that this could set up a loop with this task bouncing backwards and forwards between the two queues. (This latter possibility can be demonstrated by running a nice==-20 CPU bound task on an otherwise quiet 2 CPU system.) Solution: Modify the mechanism so that it does not override skip for the highest priority task on the CPU. Of course, if there are more than one tasks at the highest priority then it will allow the override for one of them as this is a desirable redistribution of high priority tasks. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Peter Williams	50ddd96917	[PATCH] sched: modify move_tasks() to improve load balancing outcomes Problem: The move_tasks() function is designed to move UP TO the amount of load it is asked to move and in doing this it skips over tasks looking for ones whose load weights are less than or equal to the remaining load to be moved. This is (in general) a good thing but it has the unfortunate result of breaking one of the original load balancer's good points: namely, that (within the limits imposed by the active/expired array model and the fact the expired is processed first) it moves high priority tasks before low priority ones and this means there's a good chance (see active/expired problem for why it's only a chance) that the highest priority task on the queue but not actually on the CPU will be moved to the other CPU where (as a high priority task) it may preempt the current task. Solution: Modify move_tasks() so that high priority tasks are not skipped when moving them will make them the highest priority task on their new run queue. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Peter Williams	2dd73a4f09	[PATCH] sched: implement smpnice Problem: The introduction of separate run queues per CPU has brought with it "nice" enforcement problems that are best described by a simple example. For the sake of argument suppose that on a single CPU machine with a nice==19 hard spinner and a nice==0 hard spinner running that the nice==0 task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and 2 nice==0 hard spinners running. The user of this system would be entitled to expect that the nice==0 tasks each get 95% of a CPU and the nice==19 tasks only get 5% each. However, whether this expectation is met is pretty much down to luck as there are four equally likely distributions of the tasks to the CPUs that the load balancing code will consider to be balanced with loads of 2.0 for each CPU. Two of these distributions involve one nice==0 and one nice==19 task per CPU and in these circumstances the users expectations will be met. The other two distributions both involve both nice==0 tasks being on one CPU and both nice==19 being on the other CPU and each task will get 50% of a CPU and the user's expectations will not be met. Solution: The solution to this problem that is implemented in the attached patch is to use weighted loads when determining if the system is balanced and, when an imbalance is detected, to move an amount of weighted load between run queues (as opposed to a number of tasks) to restore the balance. Once again, the easiest way to explain why both of these measures are necessary is to use a simple example. Suppose that (in a slight variation of the above example) that we have a two CPU system with 4 nice==0 and 4 nice=19 hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and the 4 nice==19 tasks are on the other CPU. The weighted loads for the two CPUs would be 4.0 and 0.2 respectively and the load balancing code would move 2 tasks resulting in one CPU with a load of 2.0 and the other with load of 2.2. If this was considered to be a big enough imbalance to justify moving a task and that task was moved using the current move_tasks() then it would move the highest priority task that it found and this would result in one CPU with a load of 3.0 and the other with a load of 1.2 which would result in the movement of a task in the opposite direction and so on -- infinite loop. If, on the other hand, an amount of load to be moved is calculated from the imbalance (in this case 0.1) and move_tasks() skips tasks until it find ones whose contributions to the weighted load are less than this amount it would move two of the nice==19 tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with loads of 2.1 for each CPU. One of the advantages of this mechanism is that on a system where all tasks have nice==0 the load balancing calculations would be mathematically identical to the current load balancing code. Notes: struct task_struct: has a new field load_weight which (in a trade off of space for speed) stores the contribution that this task makes to a CPU's weighted load when it is runnable. struct runqueue: has a new field raw_weighted_load which is the sum of the load_weight values for the currently runnable tasks on this run queue. This field always needs to be updated when nr_running is updated so two new inline functions inc_nr_running() and dec_nr_running() have been created to make sure that this happens. This also offers a convenient way to optimize away this part of the smpnice mechanism when CONFIG_SMP is not defined. int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). int move_tasks(): The modifications to this function were complicated by the fact that active_load_balance() uses it to move exactly one task without checking whether an imbalance actually exists. This precluded the simple overloading of max_nr_move with max_load_move and necessitated the addition of the latter as an extra argument to the function. The internal implementation is then modified to move up to max_nr_move tasks and max_load_move of weighted load. This slightly complicates the code where move_tasks() is called and if ever active_load_balance() is changed to not use move_tasks() the implementation of move_tasks() should be simplified accordingly. struct sched_group find_busiest_group(): Similar to try_to_wake_up(), there are places in this function where SCHED_LOAD_SCALE is used to represent the load contribution of a single task and the same issues are created. A similar solution is adopted except that it is now the average per task contribution to a group's load (as opposed to a run queue) that is required. As this value is not directly available from the group it is calculated on the fly as the queues in the groups are visited when determining the busiest group. A key change to this function is that it is no longer to scale down imbalance on exit as move_tasks() uses the load in its scaled form. void set_user_nice(): has been modified to update the task's load_weight field when it's nice value and also to ensure that its run queue's raw_weighted_load field is updated if it was runnable. From: "Siddha, Suresh B" <suresh.b.siddha@intel.com> With smpnice, sched groups with highest priority tasks can mask the imbalance between the other sched groups with in the same domain. This patch fixes some of the listed down scenarios by not considering the sched groups which are lightly loaded. a) on a simple 4-way MP system, if we have one high priority and 4 normal priority tasks, with smpnice we would like to see the high priority task scheduled on one cpu, two other cpus getting one normal task each and the fourth cpu getting the remaining two normal tasks. but with current smpnice extra normal priority task keeps jumping from one cpu to another cpu having the normal priority task. This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the cpu with high priority task in max_load calculations but including that in total and avg_load calcuations.. leading to max_load < avg_load and load balance between cpus running normal priority tasks(2 Vs 1) will always show imbalanace as one normal priority and the extra normal priority task will keep moving from one cpu to another cpu having normal priority task.. b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority task each.. P2 and P3 are idle. With this patch, one of the normal priority tasks on P1 will be moved to P2 or P3.. c) With the current weighted smp nice calculations, it doesn't always make sense to look at the highest weighted runqueue in the busy group.. Consider a load balance scenario on a DP with HT system, with Package-0 containing one high priority and one low priority, Package-1 containing one low priority(with other thread being idle).. Package-1 thinks that it need to take the low priority thread from Package-0. And find_busiest_queue() returns the cpu thread with highest priority task.. And ultimately(with help of active load balance) we move high priority task to Package-1. And same continues with Package-0 now, moving high priority task from package-1 to package-0.. Even without the presence of active load balance, load balance will fail to balance the above scenario.. Fix find_busiest_queue to use "imbalance" when it is lightly loaded. [kernel@kolivas.org: sched: store weighted load on up] [kernel@kolivas.org: sched: add discrete weighted cpu load function] [suresh.b.siddha@intel.com: sched: remove dead code] Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Kirill Korotaev	efc30814a8	[PATCH] sched: CPU hotplug race vs. set_cpus_allowed() There is a race between set_cpus_allowed() and move_task_off_dead_cpu(). __migrate_task() doesn't report any err code, so task can be left on its runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a possible target. Also, chaning cpus_allowed mask requires rq->lock being held. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-By: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Steven Rostedt	cc94abfcbc	[PATCH] unnecessary long index i in sched Unless we expect to have more than 2G CPUs, there's no reason to have 'i' as a long long here. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Con Kolivas	72d2854d4e	[PATCH] sched: fix interactive ceiling code The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect and not explicit enough. The sleep boost is not supposed to be any larger than without this code and the comment is not clear enough about what exactly it does, just the reason it does it. Fix it. There is a ceiling to the priority beyond which tasks that only ever sleep for very long periods cannot surpass. Fix it. Prevent the on-runqueue bonus logic from defeating the idle sleep logic. Opportunity to micro-optimise. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Steven Rostedt	d444886e14	[PATCH] sched: simplify bitmap definition Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Chen, Kenneth W	c96d145e71	[PATCH] sched: fix smt nice lock contention and optimization Initial report and lock contention fix from Chris Mason: Recent benchmarks showed some performance regressions between 2.6.16 and 2.6.5. We tracked down one of the regressions to lock contention in schedule heavy workloads (~70,000 context switches per second) kernel/sched.c:dependent_sleeper() was responsible for most of the lock contention, hammering on the run queue locks. The patch below is more of a discussion point than a suggested fix (although it does reduce lock contention significantly). The dependent_sleeper code looks very expensive to me, especially for using a spinlock to bounce control between two different siblings in the same cpu. It is further optimized: * perform dependent_sleeper check after next task is determined * convert wake_sleeping_dependent to use trylock * skip smt runqueue check if trylock fails * optimize double_rq_lock now that smt nice is converted to trylock * early exit in searching first SD_SHARE_CPUPOWER domain * speedup fast path of dependent_sleeper [akpm@osdl.org: cleanup] Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Chris Mason <mason@suse.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:44 -07:00
Chandra Seetharaman	26c2143b63	[PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only Make notifier_calls associated with cpu_notifier as __cpuinit. __cpuinit makes sure that the function is init time only unless CONFIG_HOTPLUG_CPU is defined. [akpm@osdl.org: section fix] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:41 -07:00
Chandra Seetharaman	65edc68c34	[PATCH] cpu hotplug: make [un]register_cpu_notifier init time only CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined). So, cpu_notifier functionality need to be available only at init time. This patch makes register_cpu_notifier() available only at init time, unless CONFIG_HOTPLUG_CPU is defined. This patch exports register_cpu_notifier() and unregister_cpu_notifier() only if CONFIG_HOTPLUG_CPU is defined. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:41 -07:00
Chandra Seetharaman	054cc8a2d8	[PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 This patch reverts notifier_block changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:41 -07:00
Chandra Seetharaman	9c7b216d23	[PATCH] cpu hotplug: revert init patch submitted for 2.6.17 In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a band-aid solution to solve that problem. In the process, i undid all the changes you both were making to ensure that these notifiers were available only at init time (unless CONFIG_HOTPLUG_CPU is defined). We deferred the real fix to 2.6.18. Here is a set of patches that fixes the XFS problem cleanly and makes the cpu notifiers available only at init time (unless CONFIG_HOTPLUG_CPU is defined). If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run time. This patch reverts the notifier_call changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:40 -07:00
Paul E. McKenney	c32e066057	[PATCH] rcutorture: add call_rcu_bh() operations Add operations for the call_rcu_bh() variant of RCU. Also add an rcu_batches_completed_bh() function, which is needed by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:40 -07:00
Paul E. McKenney	72e9bb5492	[PATCH] rcutorture: add ops vector and Classic RCU ops Add an ops vector to rcutorture, and add the ops for Classic RCU. Update the rcutorture documentation to reflect slight change to the dmesg formats. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:40 -07:00
Paul E. McKenney	29766f1eb3	[PATCH] rcutorture: catchup doc fixes for idle-hz tests This just catches the RCU torture documentation up with the recent fixes that test RCU for architectures that turn of the scheduling-clock interrupt for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up an obsolete comment as well. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:40 -07:00
Randy Dunlap	1dbe83c344	[PATCH] fix kernel-doc in kernel/ dir Fix kernel-doc parameters in kernel/ Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:39 -07:00
Ingo Molnar	34af946a22	[PATCH] spin/rwlock init cleanups locking init cleanups: - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK() - convert rwlocks in a similar manner this patch was generated automatically. Motivation: - cleanliness - lockdep needs control of lock initialization, which the open-coded variants do not give - it's also useful for -rt and for lock debugging in general Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:39 -07:00
Randy Dunlap	a7807a32bb	[PATCH] poison: add & use more constants Add more poison values to include/linux/poison.h. It's not clear to me whether some others should be added or not, so I haven't added any of these: ./include/linux/libata.h:#define ATA_TAG_POISON 0xfafbfcfdU ./arch/ppc/8260_io/fcc_enet.c:1918: memset((char )(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32); ./drivers/usb/mon/mon_text.c:429: memset(mem, 0xe5, sizeof(struct mon_event_text)); ./drivers/char/ftape/lowlevel/ftape-ctl.c:738: memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE); ./drivers/block/sx8.c:/ 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */ Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:38 -07:00
Ingo Molnar	e6e5494cb2	[PATCH] vdso: randomize the i386 vDSO by moving it into a vma Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications, this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this option off. Turning it off is also recommended for security reasons: attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore. There is a new vdso=[0\|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off. (This version of the VDSO-randomization patch also has working ELF coredumping, the previous patch crashed in the coredumping code.) This code is a combined work of the exec-shield VDSO randomization code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell started this patch and i completed it. [akpm@osdl.org: cleanups] [akpm@osdl.org: compile fix] [akpm@osdl.org: compile fix 2] [akpm@osdl.org: compile fix 3] [akpm@osdl.org: revernt MAXMEM change] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:38 -07:00
KAMEZAWA Hiroyuki	2842f11419	[PATCH] catch valid mem range at onlining memory This patch allows hot-add memory which is not aligned to section. Now, hot-added memory has to be aligned to section size. Considering big section sized archs, this is not useful. When hot-added memory is registerd as iomem resoruce by iomem resource patch, we can make use of that information to detect valid memory range. Note: With this, not-aligned memory can be registerd. To allow hot-add memory with holes, we have to do more work around add_memory(). (It doesn't allows add memory to already existing mem section.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:36 -07:00
Andrew Morton	5c31f2738a	[PATCH] pm_trace is dangerous CONFIG_PM_TRACES scrogs your RTC. Mark it as experimental, and defaulting to `off'. Also beef up the help message a bit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:35 -07:00
Randy Dunlap	7f32a25f63	[PATCH] kernel/acct: fix function definition kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-27 17:32:35 -07:00
Greg Kroah-Hartman	6550e07f41	[PATCH] 64bit Resource: finally enable 64bit resource sizes Introduce the Kconfig entry and actually switch to a 64bit value, if wanted, for resource_size_t. Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2006-06-27 09:24:00 -07:00
Greg Kroah-Hartman	d75fc8bbcc	[PATCH] 64bit resource: change resource core to use resource_size_t Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2006-06-27 09:23:59 -07:00
Greg Kroah-Hartman	685143ac1f	[PATCH] 64bit resource: fix up printks for resources in arch and core code This is needed if we wish to change the size of the resource structures. Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com> and Andrew Morton. (tweaked by Andy Isaacson <adi@hexapodia.org>) Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Andy Isaacson <adi@hexapodia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2006-06-27 09:23:59 -07:00
Linus Torvalds	2a2ed2db35	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild * git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits) kbuild: trivial fixes in Makefile kbuild: adding symbols in Kconfig and defconfig to TAGS kbuild: replace abort() with exit(1) kbuild: support for %.symtypes files kbuild: fix silentoldconfig recursion kbuild: add option for stripping modules while installing them kbuild: kill some false positives from modpost kbuild: export-symbol usage report generator kbuild: fix make -rR breakage kbuild: append -dirty for updated but uncommited changes kbuild: append git revision for all untagged commits kbuild: fix module.symvers parsing in modpost kbuild: ignore make's built-in rules & variables kbuild: bugfix with initramfs kbuild: modpost build fix kbuild: check license compatibility when building modules kbuild: export-type enhancement to modpost.c kbuild: add dependency on kernel.release to the package targets kbuild: `make kernelrelease' speedup kconfig: KCONFIG_OVERWRITECONFIG ...	2006-06-26 11:05:15 -07:00
Linus Torvalds	81a07d7588	Merge branch 'x86-64' * x86-64: (83 commits) [PATCH] x86_64: x86_64 stack usage debugging [PATCH] x86_64: (resend) x86_64 stack overflow debugging [PATCH] x86_64: msi_apic.c build fix [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs [PATCH] x86_64: Avoid broadcasting NMI IPIs [PATCH] x86_64: fix apic error on bootup [PATCH] x86_64: enlarge window for stack growth [PATCH] x86_64: Minor string functions optimizations [PATCH] x86_64: Move export symbols to their C functions [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR [PATCH] x86_64: Fix modular pc speaker [PATCH] x86_64: remove sys32_ni_syscall() [PATCH] x86_64: Do not use -ffunction-sections for modules [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle [PATCH] x86_64: adjust kstack_depth_to_print default [PATCH] i386/x86-64: adjust /proc/interrupts column headings [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels [PATCH] x86_64: Fix fast check in safe_smp_processor_id [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status ... Manual resolve of trivial conflict in arch/i386/kernel/Makefile	2006-06-26 10:51:09 -07:00
Andi Kleen	495ab9c045	[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status During some profiling I noticed that default_idle causes a lot of memory traffic. I think that is caused by the atomic operations to clear/set the polling flag in thread_info. There is actually no reason to make this atomic - only the idle thread does it to itself, other CPUs only read it. So I moved it into ti->status. Converted i386/x86-64/ia64 for now because that was the easiest way to fix ACPI which also manipulates these flags in its idle function. Cc: Nick Piggin <npiggin@novell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:48:21 -07:00
Jan Beulich	83f4fcce7f	[PATCH] x86_64: allow unwinder to build without module support Add proper conditionals to be able to build with CONFIG_MODULES=n. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:48:18 -07:00
Jan Beulich	c33bd9aac0	[PATCH] i386/x86-64: fall back to old-style call trace if no unwinding If no unwinding is possible at all for a certain exception instance, fall back to the old style call trace instead of not showing any trace at all. Also, allow setting the stack trace mode at the command line. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:48:18 -07:00
Jan Beulich	4552d5dc08	[PATCH] x86_64: reliable stack trace support These are the generic bits needed to enable reliable stack traces based on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will enable x86-64 and i386 to make use of this. Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities for improvement. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:48:17 -07:00
Andi Kleen	bebfa1013e	[PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warnings Sometimes e.g. with crashme the compat layer warnings can be noisy. Add a way to turn them off by gating all output through compat_printk that checks a global sysctl. The default is not changed. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:48:16 -07:00
Peter Williams	b78709cfd4	[PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval() The introduction of SCHED_BATCH scheduling class with a value of 3 means that the expression (p->policy & SCHED_FIFO) will return true if policy is SCHED_BATCH or SCHED_FIFO. Unfortunately, this expression is used in sys_sched_rr_get_interval() and in the absence of a comment to say that this is intentional I presume that it is unintentional and erroneous. The fix is to change the expression to (p->policy == SCHED_FIFO). Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 10:02:41 -07:00
Oleg Nesterov	cf2dfbfbf4	[PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXIT After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we can remove this check from copy_process() because we already checked !signal_pending(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:27 -07:00
Oleg Nesterov	d5f70c00ad	[PATCH] coredump: kill ptrace related stuff With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to the thread group. This means that a TASK_TRACED task 1. Will be awakened by signal_wake_up(1) 2. Can't sleep again via ptrace_notify() 3. Can't go to do_signal_stop() after return from ptrace_stop() in get_signal_to_deliver() So we can remove all ptrace related stuff from coredump path. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:27 -07:00
Eric W. Biederman	df26c40e56	[PATCH] proc: Cleanup proc_fd_access_allowed In process of getting proc_fd_access_allowed to work it has developed a few warts. In particular the special case that always allows introspection and the special case to allow inspection of kernel threads. The special case for introspection is needed for /proc/self/mem. The special case for kernel threads really should be overridable by security modules. So consolidate these checks into ptrace.c:may_attach(). The check to always allow introspection is trivial. The check to allow access to kernel threads, and zombies is a little trickier. mem_read and mem_write already verify an mm exists so it isn't needed twice. proc_fd_access_allowed only doesn't want a check to verify task->mm exits, s it prevents all access to kernel threads. So just move the task->mm check into ptrace_attach where it is needed for practical reasons. I did a quick audit and none of the security modules in the kernel seem to care if they are passed a task without an mm into security_ptrace. So the above move should be safe and it allows security modules to come up with more restrictive policy. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:26 -07:00
Eric W. Biederman	13b41b0949	[PATCH] proc: Use struct pid not struct task_ref Incrementally update my proc-dont-lock-task_structs-indefinitely patches so that they work with struct pid instead of struct task_ref. Mostly this is a straight 1-1 substitution. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:26 -07:00
Eric W. Biederman	99f8955183	[PATCH] proc: don't lock task_structs indefinitely Every inode in /proc holds a reference to a struct task_struct. If a directory or file is opened and remains open after the the task exits this pinning continues. With 8K stacks on a 32bit machine the amount pinned per file descriptor is about 10K. Normally I would figure a reasonable per user process limit is about 100 processes. With 80 processes, with a 1000 file descriptors each I can trigger the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless data. This patch replaces the struct task_struct pointer with a pointer to a struct task_ref which has a struct task_struct pointer. The so the pinning of dead tasks does not happen. The code now has to contend with the fact that the task may now exit at any time. Which is a little but not muh more complicated. With this change it takes about 1000 processes each opening up 1000 file descriptors before I can trigger the OOM killer. Much better. [mlp@google.com: task_mmu small fixes] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Paul Jackson <pj@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Albert Cahalan <acahalan@gmail.com> Signed-off-by: Prasanna Meda <mlp@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:25 -07:00
Eric W. Biederman	48e6484d49	[PATCH] proc: Rewrite the proc dentry flush on exit optimization To keep the dcache from filling up with dead /proc entries we flush them on process exit. However over the years that code has gotten hairy with a dentry_pointer and a lock in task_struct and misdocumented as a correctness feature. I have rewritten this code to look and see if we have a corresponding entry in the dcache and if so flush it on process exit. This removes the extra fields in the task_struct and allows me to trivially handle the case of a /proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:24 -07:00
Anil S Keshavamurthy	e6f47f978b	[PATCH] Notify page fault call chain With this patch Kprobes now registers for page fault notifications only when their is an active probe registered. Once all the active probes are unregistered their is no need to be notified of page faults and kprobes unregisters itself from the page fault notifications. Hence we will have ZERO side effects when no probes are active. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:22 -07:00
Anil S Keshavamurthy	3d5631e063	[PATCH] Kprobes registers for notify page fault Kprobes now registers for page fault notifications. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:22 -07:00
mao, bibo	3672165677	[PATCH] Kprobe: multi kprobe posthandler for booster If there are multi kprobes on the same probepoint, there will be one extra aggr_kprobe on the head of kprobe list. The aggr_kprobe has aggr_post_handler/aggr_break_handler whether the other kprobe post_hander/break_handler is NULL or not. This patch modifies this, only when there is one or more kprobe in the list whose post_handler is not NULL, post_handler of aggr_kprobe will be set as aggr_post_handler. [soshima@redhat.com: !CONFIG_PREEMPT fix] Signed-off-by: bibo, mao <bibo.mao@intel.com> Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: Satoshi Oshima <soshima@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:22 -07:00
Roman Zippel	19923c190e	[PATCH] fix and optimize clock source update This fixes the clock source updates in update_wall_time() to correctly track the time coming in via current_tick_length(). Optimize the fast paths to be as short as possible to keep the overhead low. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:21 -07:00
john stultz	a275254975	[PATCH] time: rename clocksource functions As suggested by Roman Zippel, change clocksource functions to use clocksource_xyz rather then xyz_clocksource to avoid polluting the namespace. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:21 -07:00
john stultz	5d0cf410e9	[PATCH] Time: i386 Clocksource Drivers Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc). With this patch, the conversion of the i386 arch to the generic timekeeping code should be complete. The patch should be fairly straight forward, only adding the new clocksources. [hirofumi@mail.parknet.co.jp: acpi_pm cleanup] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:21 -07:00
john stultz	cf3c769b4b	[PATCH] Time: Introduce arch generic time accessors Introduces clocksource switching code and the arch generic time accessor functions that use the clocksource infrastructure. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:20 -07:00
john stultz	5eb6d20533	[PATCH] Time: Use clocksource abstraction for NTP adjustments Instead of incrementing xtime by tick_nsec + ntp adjustments, use the clocksource abstraction to increment and scale time. Using the clocksource abstraction allows other clocksources to be used consistently in the face of late or lost ticks, while preserving the existing behavior via the jiffies clocksource. This removes the need to keep time_phase adjustments as we just use the current_tick_length() function as the NTP interface and accumulate time using shifted nanoseconds. The basics of this design was by Roman Zippel, however it is my own interpretation and implementation, so the credit should go to him and the blame to me. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:20 -07:00
john stultz	260a42309b	[PATCH] Time: Let user request precision from current_tick_length() Change the current_tick_length() function so it takes an argument which specifies how much precision to return in shifted nanoseconds. This provides a simple way to convert between NTPs internal nanoseconds shifted by (SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the clocksource abstraction. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:20 -07:00
john stultz	ad596171ed	[PATCH] Time: Use clocksource infrastructure for update_wall_time Modify the update_wall_time function so it increments time using the clocksource abstraction instead of jiffies. Since the only clocksource driver currently provided is the jiffies clocksource, this should result in no functional change. Additionally, a timekeeping_init and timekeeping_resume function has been added to initialize and maintain some of the new timekeping state. [hirofumi@mail.parknet.co.jp: fixlet] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:20 -07:00
john stultz	734efb467b	[PATCH] Time: Clocksource Infrastructure This introduces the clocksource management infrastructure. A clocksource is a driver-like architecture generic abstraction of a free-running counter. This code defines the clocksource structure, and provides management code for registering, selecting, accessing and scaling clocksources. Additionally, this includes the trivial jiffies clocksource, a lowest common denominator clocksource, provided mainly for use as an example. [hirofumi@mail.parknet.co.jp: Don't enable IRQ too early] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:20 -07:00
Ingo Molnar	81615b624a	[PATCH] Convert kernel/cpu.c to mutexes Convert kernel/cpu.c from semaphore to mutex. I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to fit mutex semantics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:16 -07:00
Ingo Molnar	1fb00c6cbd	[PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs It seems ppc64 wants to lock mutexes in early bootup code, with interrupts disabled, and they expect interrupts to stay disabled, else they crash. Work around this bug by making mutex debugging variants save/restore irq flags. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-26 09:58:16 -07:00
Linus Torvalds	3448097fcc	Revert "swsusp special saveable pages support" commits This reverts commits `3e3318dee0` [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages `b6370d96e0` [PATCH] swsusp: i386 mark special saveable/unsaveable pages `ce4ab0012b` [PATCH] swsusp: add architecture special saveable pages support because not only do they apparently cause page faults on x86, the infrastructure doesn't compile on powerpc. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 18:41:00 -07:00
Linus Torvalds	72cf2709bf	Fix PM_TRACE dependency: works only on 32-bit x86 for now Not that x86-64 and other architecture support should be difficult to add (trivial fixups to the data format and add the proper linker script entry). Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:04:15 -07:00
KaiGai Kohei	77787bfb44	[PATCH] pacct: none-delayed process accounting accumulation In current 2.6.17 implementation, signal_struct refered from task_struct is used for per-process data structure. The pacct facility also uses it as a per-process data structure to store stime, utime, minflt, majflt. But those members are saved in __exit_signal(). It's too late. For example, if some threads exits at same time, pacct facility has a possibility to drop accountings for a part of those threads. (see, the following 'The results of original 2.6.17 kernel') I think accounting information should be completely collected into the per-process data structure before writing out an accounting record. This patch fixes this matter. Accumulation of stime, utime, minflt and majflt are done before generating accounting record. [mingo@elte.hu: fix acct_collect() siglock bug found by lockdep] Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:25 -07:00
KaiGai Kohei	f6ec29a42d	[PATCH] pacct: avoidance to refer the last thread as a representation of the process When pacct facility generate an 'ac_flag' field in accounting record, it refers a task_struct of the thread which died last in the process. But any other task_structs are ignored. Therefore, pacct facility drops ASU flag even if root-privilege operations are used by any other threads except the last one. In addition, AFORK flag is always set when the thread of group-leader didn't die last, although this process has called execve() after fork(). We have a same matter in ac_exitcode. The recorded ac_exitcode is an exit code of the last thread in the process. There is a possibility this exitcode is not the group leader's one.	2006-06-25 10:01:25 -07:00
KaiGai Kohei	0e4648141a	[PATCH] pacct: add pacct_struct to fix some pacct bugs. The pacct facility need an i/o operation when an accounting record is generated. There is a possibility to wake OOM killer up. If OOM killer is activated, it kills some processes to make them release process memory regions. But acct_process() is called in the killed processes context before calling exit_mm(), so those processes cannot release own memory. In the results, any processes stop in this point and it finally cause a system stall.	2006-06-25 10:01:25 -07:00
Randy Dunlap	9e37bd301e	[PATCH] kthread: move kernel-doc and put it into DocBook Move kthread API kernel-doc from kthread.h to kthread.c & fix it. Add kthread API to kernel-api DocBook. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:24 -07:00
Randy Dunlap	fa9799e33d	[PATCH] ktime/hrtimer: fix kernel-doc comments Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:23 -07:00
Heiko Carstens	fc75cdfa5b	[PATCH] cpu hotplug: fix CPU_UP_CANCEL handling If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be called with CPU_UP_CANCELED. A few of these callbacks assume that on CPU_UP_PREPARE a pointer to task has been stored in a percpu array. This assumption is not true if CPU_UP_PREPARE fails and the following calls to kthread_bind() in CPU_UP_CANCELED will cause an addressing exception because of passing a NULL pointer. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:22 -07:00
Serge E. Hallyn	8bdd1d1250	[PATCH] kthread: convert stop_machine into a kthread - Update stop_machine.c to spawn stop_machine as kthreads rather than the deprecated kernel_threads. - Update stop_machine to use the more efficient kthread_bind() before running task in place of set_cpus_allowed() after. [akpm@osdl.org: remove now-wrong set_cpus_allowed()] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:22 -07:00
Anton Blanchard	2aa92581fb	[PATCH] Link error when futexes are disabled on 64bit architectures If futexes are disabled we fail to link on ppc64. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:17 -07:00
akpm@osdl.org	838cd153a5	[PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__ I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs. Among the NPTL test failures seen are some arising from sigsuspend problems for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked despite glibc's carefully excluding it from sets of signals to block. Specifically, testing suggests it blocks signal N^32 instead of signal N, so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead. glibc's sigset_t uses an array of unsigned long, as does the kernel. In both cases, signal N+1 is represented as (1UL << (N % (8 * sizeof (unsigned long)))) in word number (N / (8 * sizeof (unsigned long))). Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an array of 64-bit words. For little-endian, the layout is the same, with signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for big-endian, userspace has that layout while in the kernel each 8 bytes have the two halves swapped from the userspace layout. The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace sigset to kernel format. If __COMPAT_ENDIAN_SWAP__ is not set, this uses logic of the form set->sig[0] = compat->sig[0] \| (((long)compat->sig[1]) << 32 ) to convert the userspace sigset to a kernel one. This looks correct to me for both big and little endian, given that in userspace compat->sig[1] will represent signals 33-64, and so will the high 32 bits of set->sig[0] in the kernel. If however __COMPAT_ENDIAN_SWAP__ is set, as it is for __MIPSEL__, it uses set->sig[0] = compat->sig[1] \| (((long)compat->sig[0]) << 32 ); which seems incorrect for both big and little endian, and would explain the observed symptoms. This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect then that macro serves no purpose, in which case something like the following patch would seem appropriate to remove it. Signed-off-by: Joseph Myers <joseph@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:15 -07:00
Stephen Hemminger	eab03ac7bd	[PATCH] Get rid of /proc/sys/proc The table is empty, why does it still exist? Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:15 -07:00
Jan Engelhardt	3b9c04106b	[PATCH] printk time parameter Currently, enabling/disabling printk timestamps is only possible through reboot (bootparam) or recompile. I normally do not run with timestamps (since syslog handles that in a good manner), but for measuring small kernel delays (e.g. irq probing - see parport thread) I needed subsecond precision, but then again, just for some minutes rather than all kernel messages to come. The following patch adds a module_param() with which the timestamps can be en-/disabled in a live system through /sys/modules/printk/parameters/printk_time. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:13 -07:00
Matt Helsley	11e64757f9	[PATCH] Remove unecessary NULL check in kernel/acct.c copy_process() appears to be the only caller of acct_clear_integrals() and does not pass in NULL task pointers. Remove the unecessary check. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:09 -07:00
Andreas Mohr	3b364b8d58	[PATCH] constify parts of kernel/power/ Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:08 -07:00
Andrew Morton	b61367732f	[PATCH] schedule_on_each_cpu(): reduce kmalloc() size schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU 64-bit. Rework it so that we do one 8192-byte allocation and then a pile of tiny ones, via alloc_percpu(). This has a much higher chance of success (100% in the current VM). This also has the effect of reducing the memory requirements from NR_CPUSn to num_possible_cpus()n. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:07 -07:00
Adrian Bunk	83cc5ed3c4	[PATCH] kernel/sys.c: cleanups - proper prototypes for the following functions: - ctrl_alt_del() (in include/linux/reboot.h) - getrusage() (in include/linux/resource.h) - make the following needlessly global functions static: - kernel_restart_prepare() - kernel_kexec() [akpm@osdl.org: compile fix] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:06 -07:00
Michael Ellerman	76a8ad2939	[PATCH] Make printk work for really early debugging Currently printk is no use for early debugging because it refuses to actually print anything to the console unless cpu_online(smp_processor_id()) is true. The stated explanation is that console drivers may require per-cpu resources, or otherwise barf, because the system is not yet setup correctly. Fair enough. However some console drivers might be quite happy running early during boot, in fact we have one, and so it'd be nice if printk understood that. So I added a flag (which I would have called CON_BOOT, but that's taken) called CON_ANYTIME, which indicates that a console is happy to be called anytime, even if the cpu is not yet online. Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus console driver that BUG()s if called while offline. No problems AFAICT. Built for i386 UP & SMP. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:05 -07:00
Alan Stern	bbb1747d4e	[PATCH] Allow raw_notifier callouts to unregister themselves Since raw_notifier chains don't benefit from any centralized locking protections, they shouldn't suffer from the associated limitations. Under some circumstances it might make sense for a raw_notifier callout routine to unregister itself from the notifier chain. This patch (as678) changes the notifier core to allow for such things. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:01 -07:00
Paul Mackerras	bfe5d83419	[PATCH] Define __raw_get_cpu_var and use it There are several instances of per_cpu(foo, raw_smp_processor_id()), which is semantically equivalent to __get_cpu_var(foo) but without the warning that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For those architectures with optimized per-cpu implementations, namely ia64, powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower code than __get_cpu_var(), so it would be preferable to use __get_cpu_var on those platforms. This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x, raw_smp_processor_id()) on architectures that use the generic per-cpu implementation, and turns into __get_cpu_var(x) on the architectures that have an optimized per-cpu implementation. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:01:01 -07:00
Jesper Juhl	f867d2a2e5	[PATCH] ensure NULL deref can't possibly happen in is_exported() If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is given a NULL 'mod' and lookup_symbol(name, __start___ksymtab, __stop___ksymtab) returns 0, then we'll end up dereferencing a NULL pointer. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-25 10:00:59 -07:00
Linus Torvalds	eb71c87a49	Add some basic resume trace facilities Considering that there isn't a lot of hw we can depend on during resume, this is about as good as it gets. This is x86-only for now, although the basic concept (and most of the code) will certainly work on almost any platform. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-24 14:44:01 -07:00
Eric Sesterhenn	125e18745f	[PATCH] More BUG_ON conversion Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:08 -07:00
Jan Beulich	908dcecda1	[PATCH] adjust handle_IRR_event() return type Correct the return type of handle_IRQ_event() (inconsistency noticed during Xen development), and remove redundant declarations. The return type adjustment required breaking out the definition of irqreturn_t into a separate header, in order to satisfy current include order dependencies. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:08 -07:00
Porpoise	3439dd86e3	[PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loop When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop. Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list base->tv5 may cascade into base->tv5. So, the kernel enters the infinite loop in the function cascade(). I created a test module to verify this bug, and a patch to fix it. #include <linux/kernel.h> #include <linux/module.h> #include <linux/init.h> #include <linux/timer.h> #if 0 #include <linux/kdb.h> #else #define kdb_printf printk #endif #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6) #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8) #define TVN_SIZE (1 << TVN_BITS) #define TVR_SIZE (1 << TVR_BITS) #define TVN_MASK (TVN_SIZE - 1) #define TVR_MASK (TVR_SIZE - 1) #define TV_SIZE(N) (N*TVN_BITS + TVR_BITS) struct timer_list timer0; struct timer_list dummy_timer1; struct timer_list dummy_timer2; void dummy_timer_fun(unsigned long data) { } unsigned long j=0; void check_timer_base(unsigned long data) { kdb_printf("check_timer_base %08x\n",jiffies); mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF); } int init_module(void) { init_timer(&timer0); timer0.data = (unsigned long)0; timer0.function = check_timer_base; mod_timer(&timer0,jiffies+1); init_timer(&dummy_timer1); dummy_timer1.data = (unsigned long)0; dummy_timer1.function = dummy_timer_fun; init_timer(&dummy_timer2); dummy_timer2.data = (unsigned long)0; dummy_timer2.function = dummy_timer_fun; j=jiffies; j&=(~((1<<TV_SIZE(3))-1)); j+=(1<<TV_SIZE(3)); j+=(1<<TV_SIZE(4)); kdb_printf("mod_timer %08x\n",j); mod_timer(&dummy_timer1, j ); mod_timer(&dummy_timer2, j ); return 0; } void cleanup_module() { del_timer_sync(&timer0); del_timer_sync(&dummy_timer1); del_timer_sync(&dummy_timer2); } (Cleanups from Oleg) [oleg@tv-sign.ru: use list_replace_init()] Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:08 -07:00
Oleg Nesterov	626ab0e69d	[PATCH] list: use list_replace_init() instead of list_splice_init() list_splice_init(list, head) does unneeded job if it is known that list_empty(head) == 1. We can use list_replace_init() instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:07 -07:00
Randy Dunlap	862f5f0133	[PATCH] Doc: add audit & acct to DocBook Fix one audit kernel-doc description (one parameter was missing). Add audit*.c interfaces to DocBook. Add BSD accounting interfaces to DocBook. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:07 -07:00
Paul E. McKenney	d83015b8f6	[PATCH] Make RCU API inaccessible to non-GPL Linux kernel modules Remove synchronize_kernel() (deprecated 2-APR-2005 in http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to non-GPL Linux kernel modules (as was announced more than one year ago in http://lkml.org/lkml/2005/4/3/8). Tested on x86 and ppc64. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:07 -07:00
Jes Sorensen	55f4e8d156	[PATCH] kernel/sys.c doesn't need init.h kernel/sys.c doesn't have anything in it relying on linux/init.h - remove the include. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:07 -07:00
Andrew Morton	57ae250861	[PATCH] CONFIG_NET=n build fix Cc: Greg KH <greg@kroah.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:06 -07:00
Andreas Mohr	83d4e6e7fb	[PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely() Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:05 -07:00
Daniel Walker	89d0cf01c0	[PATCH] invert irq/migration.c brach prediction If you get to that point in the code it means that desc->move_irq is set, pending_irq_cpumask[irq] and cpu_online_map should have a value. Still pretty good chance anding those two you'll still have a value. So these two branch predictors should be inverted. Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:04 -07:00
Ingo Molnar	8e0a43d8fa	[PATCH] cond_resched() might_sleep() fix add the __might_sleep() check back to cond_resched(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:04 -07:00
Prasanna Meda	6e66726047	[PATCH] dup fd error fix Set errorp in dup_fd, it will be used in sys_unshare also. Signed-off-by: Prasanna Meda <mlp@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:04 -07:00
Andrew Morton	0ae26f1b31	[PATCH] mmput() might sleep exit_aio() and exit_mmap() can sleep. But it's easy to accidentally call mmput() from inside locks. Cc: Dave Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:03 -07:00
Jeff Moyer	c330dda908	[PATCH] Add a sysfs file to determine if a kexec kernel is loaded Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded. Each file contains a simple boolean value indicating whether the relevant kernel has been loaded into memory. The motivation for this is geared around support. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:02 -07:00
Rafael J. Wysocki	968808b895	[PATCH] swsusp: use less memory during resume Make swsusp allocate only as much memory as needed to store the image data and metadata during resume. Without this patch swsusp additionally allocates many page frames that will conflict with the "original" locations of the image data and are considered as "unsafe", treating them as "eaten" pages (ie. allocated but unusable). The patch makes swsusp allocate as many pages as it'll need to store the data read from the image in one shot, creating a list of allocated "safe" pages, and use the observation that all pages allocated by it are marked with the PG_nosave and PG_nosave_free flags set. Namely, when it's about to load an image page, swsusp can check whether the page frame corresponding to the "original" location of this page has been allocated (ie. if the page frame has the PG_nosave and PG_nosave_free flags set) and if so, it can load the page directly into this page frame. Otherwise it uses an allocated "safe" page from the list to store the data that will be copied to their "original" location later on. This allows us to save many page copyings and page allocations during resume and in the future it may allow us to load images greater than 50% of the normal zone. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: "Pavel Machek" <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:43:00 -07:00
Adrian Bunk	7bff24e255	[PATCH] kernel/power/snapshot.c: cleanups - make needlessly global functions static - make dummy functions static inline Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:59 -07:00
Rafael J. Wysocki	a938c356d5	[PATCH] swsusp: take lowmem reserves into account swsusp allocates memory from the normal zone, so it cannot use lowmem reserve pages from the lower zones. Therefore it should not count these pages as available to it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:59 -07:00
Shaohua Li	ce4ab0012b	[PATCH] swsusp: add architecture special saveable pages support 1. Add architecture specific pages save/restore support. Next two patches will use this to save/restore 'ACPI NVS' pages. 2. Allow reserved pages 'nosave'. This could avoid save/restore BIOS reserved pages. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@suspend2.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:59 -07:00
Zhang Yanmin	1b61b910e9	[PATCH] x86: kernel irq balance doesn't work On i386, kernel irq balance doesn't work. 1) In function do_irq_balance, after kernel finds the min_loaded cpu but before calling set_pending_irq to really pin the selected_irq to the target cpu, kernel does a cpus_and with irq_affinity[selected_irq]. Later on, when the irq is acked, kernel would calls move_native_irq=>desc->handler->set_affinity to change the irq affinity. However, every function pointed by hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask) always changes irq_affinity[irq] to cpumask. Next time when recalling do_irq_balance, it has to do cpu_ands again with irq_affinity[selected_irq], but irq_affinity[selected_irq] already becomes one cpu selected by the first irq balance. 2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same issue. [akpm@osdl.org: cleanups] Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:57 -07:00
David Quigley	22fb52dd73	[PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c) Add a security hook call to enable security modules to control the ability to attach a task to a cpuset. While limited control over this operation is possible via permission checks on the pseudo fs interface, those checks are not sufficient to control access to the target task, which is looked up in this function. The existing task_setscheduler hook is re-used for this operation since this falls under the same class of operations. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:54 -07:00
David Quigley	e7834f8fcc	[PATCH] SELinux: add security hooks to {get,set}affinity This patch adds LSM hooks into the setaffinity and getaffinity functions to enable security modules to control these operations between tasks with task_setscheduler and task_getscheduler LSM hooks. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:53 -07:00
Christoph Lameter	9216dfad4f	[PATCH] move_pages: fix 32 -> 64 bit compat function The definition of the third parameter is a pointer to an array of virtual addresses which give us some trouble. The existing code calculated the wrong address in the array since I used void to avoid having to specify a type. I now use the correct type "compat_uptr_t __user *" in the definition of the function in kernel/compat.c. However, I used __u32 in syscalls.h. Would have to include compat.h there in order to provide the same definition which would generate an ugly include situation. On both ia64 and x86_64 compat_uptr_t is u32. So this works although parameter declarations differ. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:53 -07:00
Christoph Lameter	1b2db9fb7a	[PATCH] sys_move_pages: 32bit support (i386, x86_64) sys_move_pages() support for 32bit (i386 plus x86_64 compat layer) Add support for move_pages() on i386 and also add the compat functions necessary to run 32 bit binaries on x86_64. Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is not up to date so I added the missing pieces. Not sure if this is done the right way. [akpm@osdl.org: compile fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:53 -07:00
Christoph Lameter	742755a1d8	[PATCH] page migration: sys_move_pages(): support moving of individual pages move_pages() is used to move individual pages of a process. The function can be used to determine the location of pages and to move them onto the desired node. move_pages() returns status information for each page. long move_pages(pid, number_of_pages_to_move, addresses_of_pages[], nodes[] or NULL, status[], flags); The addresses of pages is an array of void * pointing to the pages to be moved. The nodes array contains the node numbers that the pages should be moved to. If a NULL is passed instead of an array then no pages are moved but the status array is updated. The status request may be used to determine the page state before issuing another move_pages() to move pages. The status array will contain the state of all individual page migration attempts when the function terminates. The status array is only valid if move_pages() completed successfullly. Possible page states in status[]: 0..MAX_NUMNODES The page is now on the indicated node. -ENOENT Page is not present -EACCES Page is mapped by multiple processes and can only be moved if MPOL_MF_MOVE_ALL is specified. -EPERM The page has been mlocked by a process/driver and cannot be moved. -EBUSY Page is busy and cannot be moved. Try again later. -EFAULT Invalid address (no VMA or zero page). -ENOMEM Unable to allocate memory on target node. -EIO Unable to write back page. The page must be written back in order to move it since the page is dirty and the filesystem does not provide a migration function that would allow the moving of dirty pages. -EINVAL A dirty page cannot be moved. The filesystem does not provide a migration function and has no ability to write back pages. The flags parameter indicates what types of pages to move: MPOL_MF_MOVE Move pages that are only mapped by the process. MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes. Requires sufficient capabilities. Possible return codes from move_pages() -ENOENT No pages found that would require moving. All pages are either already on the target node, not present, had an invalid address or could not be moved because they were mapped by multiple processes. -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt to migrate pages in a kernel thread. -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges. or an attempt to move a process belonging to another user. -EACCES One of the target nodes is not allowed by the current cpuset. -ENODEV One of the target nodes is not online. -ESRCH Process does not exist. -E2BIG Too many pages to move. -ENOMEM Not enough memory to allocate control array. -EFAULT Parameters could not be accessed. A test program for move_pages() may be found with the patches on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3 From: Christoph Lameter <clameter@sgi.com> Detailed results for sys_move_pages() Pass a pointer to an integer to get_new_page() that may be used to indicate where the completion status of a migration operation should be placed. This allows sys_move_pags() to report back exactly what happened to each page. Wish there would be a better way to do this. Looks a bit hacky. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:53 -07:00
Rafael J. Wysocki	d6277db4ab	[PATCH] swsusp: rework memory shrinker Rework the swsusp's memory shrinker in the following way: - Simplify balance_pgdat() by removing all of the swsusp-related code from it. - Make shrink_all_memory() use shrink_slab() and a new function shrink_all_zones() which calls shrink_active_list() and shrink_inactive_list() directly for each zone in a way that's optimized for suspend. In shrink_all_memory() we try to free exactly as many pages as the caller asks for, preferably in one shot, starting from easier targets. If slab caches are huge, they are most likely to have enough pages to reclaim. The inactive lists are next (the zones with more inactive pages go first) etc. Each time shrink_all_memory() attempts to shrink the active and inactive lists for each zone in 5 passes. In the first pass, only the inactive lists are taken into consideration. In the next two passes the active lists are also shrunk, but mapped pages are not reclaimed. In the last two passes the active and inactive lists are shrunk and mapped pages are reclaimed as well. The aim of this is to alter the reclaim logic to choose the best pages to keep on resume and improve the responsiveness of the resumed system. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:48 -07:00
KAMEZAWA Hiroyuki	fadd8fbd15	[PATCH] support for panic at OOM This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:47 -07:00
David Howells	726c334223	[PATCH] VFS: Permit filesystem to perform statfs with a known root dentry Give the statfs superblock operation a dentry pointer rather than a superblock pointer. This complements the get_sb() patch. That reduced the significance of sb->s_root, allowing NFS to place a fake root there. However, NFS does require a dentry to use as a target for the statfs operation. This permits the root in the vfsmount to be used instead. linux/mount.h has been added where necessary to make allyesconfig build successfully. Interest has also been expressed for use with the FUSE and XFS filesystems. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:45 -07:00
David Howells	454e2398be	[PATCH] VFS: Permit filesystem to override root dentry on mount Extend the get_sb() filesystem operation to take an extra argument that permits the VFS to pass in the target vfsmount that defines the mountpoint. The filesystem is then required to manually set the superblock and root dentry pointers. For most filesystems, this should be done with simple_set_mnt() which will set the superblock pointer and then set the root dentry to the superblock's s_root (as per the old default behaviour). The get_sb() op now returns an integer as there's now no need to return the superblock pointer. This patch permits a superblock to be implicitly shared amongst several mount points, such as can be done with NFS to avoid potential inode aliasing. In such a case, simple_set_mnt() would not be called, and instead the mnt_root and mnt_sb would be set directly. The patch also makes the following changes: () the get_sb_() convenience functions in the core kernel now take a vfsmount pointer argument and return an integer, so most filesystems have to change very little. () If one of the convenience function is not used, then get_sb() should normally call simple_set_mnt() to instantiate the vfsmount. This will always return 0, and so can be tail-called from get_sb(). () generic_shutdown_super() now calls shrink_dcache_sb() to clean up the dcache upon superblock destruction rather than shrink_dcache_anon(). This is required because the superblock may now have multiple trees that aren't actually bound to s_root, but that still need to be cleaned up. The currently called functions assume that the whole tree is rooted at s_root, and that anonymous dentries are not the roots of trees which results in dentries being left unculled. However, with the way NFS superblock sharing are currently set to be implemented, these assumptions are violated: the root of the filesystem is simply a dummy dentry and inode (the real inode for '/' may well be inaccessible), and all the vfsmounts are rooted on anonymous[] dentries with child trees. [] Anonymous until discovered from another tree. () The documentation has been adjusted, including the additional bit of changing ext2_ into foo_* in the documentation. [akpm@osdl.org: convert ipath_fs, do other stuff] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-23 07:42:45 -07:00
Linus Torvalds	45c091bb2d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc * git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits) [POWERPC] re-enable OProfile for iSeries, using timer interrupt [POWERPC] support ibm,extended-*-frequency properties [POWERPC] Extra sanity check in EEH code [POWERPC] Dont look for class-code in pci children [POWERPC] Fix mdelay badness on shared processor partitions [POWERPC] disable floating point exceptions for init [POWERPC] Unify ppc syscall tables [POWERPC] mpic: add support for serial mode interrupts [POWERPC] pseries: Print PCI slot location code on failure [POWERPC] spufs: one more fix for 64k pages [POWERPC] spufs: fail spu_create with invalid flags [POWERPC] spufs: clear class2 interrupt status before wakeup [POWERPC] spufs: fix Makefile for "make clean" [POWERPC] spufs: remove stop_code from struct spu [POWERPC] spufs: fix spu irq affinity setting [POWERPC] spufs: further abstract priv1 register access [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts [POWERPC] spufs: dont try to access SPE channel 1 count [POWERPC] spufs: use kzalloc in create_spu [POWERPC] spufs: fix initial state of wbox file ... Manually resolved conflicts in: drivers/net/phy/Makefile include/asm-powerpc/spu.h	2006-06-22 22:11:30 -07:00
Ravikiran G Thirumalai	de047c1bcd	[PATCH] avoid tasklist_lock at getrusage for multithreaded case too Avoid taking tasklist_lock for at getrusage for the multithreaded case too. We don't need to take the tasklist lock for thread traversal of a process since Oleg's do-__unhash_process-under-siglock.patch and related work. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-22 15:05:57 -07:00
Andrew Morton	6cc0719181	[PATCH] suspend_console() warning fix kernel/power/main.c: In function 'suspend_prepare': kernel/power/main.c:89: warning: implicit declaration of function 'suspend_console' kernel/power/main.c: In function 'suspend_finish': kernel/power/main.c:137: warning: implicit declaration of function 'resume_console' Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-22 15:05:56 -07:00
Michael LeMay	d720024e94	[PATCH] selinux: add hooks for key subsystem Introduce SELinux hooks to support the access key retention subsystem within the kernel. Incorporate new flask headers from a modified version of the SELinux reference policy, with support for the new security class representing retained keys. Extend the "key_alloc" security hook with a task parameter representing the intended ownership context for the key being allocated. Attach security information to root's default keyrings within the SELinux initialization routine. Has passed David's testsuite. Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-22 15:05:55 -07:00
Linus Torvalds	d9eaec9e29	Merge branch 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (25 commits) [PATCH] make set_loginuid obey audit_enabled [PATCH] log more info for directory entry change events [PATCH] fix AUDIT_FILTER_PREPEND handling [PATCH] validate rule fields' types [PATCH] audit: path-based rules [PATCH] Audit of POSIX Message Queue Syscalls v.2 [PATCH] fix se_sen audit filter [PATCH] deprecate AUDIT_POSSBILE [PATCH] inline more audit helpers [PATCH] proc_loginuid_write() uses simple_strtoul() on non-terminated array [PATCH] update of IPC audit record cleanup [PATCH] minor audit updates [PATCH] fix audit_krule_to_{rule,data} return values [PATCH] add filtering by ppid [PATCH] log ppid [PATCH] collect sid of those who send signals to auditd [PATCH] execve argument logging [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES [PATCH] audit_panic() is audit-internal [PATCH] inotify (5/5): update kernel documentation ... Manual fixup of conflict in unclude/linux/inotify.h	2006-06-20 15:37:56 -07:00
Linus Torvalds	2edc322d42	Merge git://git.infradead.org/~dwmw2/rbtree-2.6 * git://git.infradead.org/~dwmw2/rbtree-2.6: [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency Update UML kernel/physmem.c to use rb_parent() accessor macro [RBTREE] Update hrtimers to use rb_parent() accessor macro. [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node. [RBTREE] Merge colour and parent fields of struct rb_node. [RBTREE] Remove dead code in rb_erase() [RBTREE] Update JFFS2 to use rb_parent() accessor macro. [RBTREE] Update eventpoll.c to use rb_parent() accessor macro. [RBTREE] Update key.c to use rb_parent() accessor macro. [RBTREE] Update ext3 to use rb_parent() accessor macro. [RBTREE] Change rbtree off-tree marking in I/O schedulers. [RBTREE] Add accessor macros for colour and parent fields of rb_node	2006-06-20 14:51:22 -07:00
Linus Torvalds	be967b7e2f	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (199 commits) [MTD] NAND: Fix breakage all over the place [PATCH] NAND: fix remaining OOB length calculation [MTD] NAND Fixup NDFC merge brokeness [MTD NAND] S3C2410 driver cleanup [MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation. [JFFS2] Check CRC32 on dirent and data nodes each time they're read [JFFS2] When retiring nextblock, allocate a node_ref for the wasted space [JFFS2] Mark XATTR support as experimental, for now [JFFS2] Don't trust node headers before the CRC is checked. [MTD] Restore MTD_ROM and MTD_RAM types [MTD] assume mtd->writesize is 1 for NOR flashes [MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles [MTD] Prepare physmap for 64-bit-resources [JFFS2] Fix more breakage caused by janitorial meddling. [JFFS2] Remove stray __exit from jffs2_compressors_exit() [MTD] Allow alternate JFFS2 mount variant for root filesystem. [MTD] Disconnect struct mtd_info from ABI [MTD] replace MTD_RAM with MTD_GENERIC_TYPE [MTD] replace MTD_ROM with MTD_GENERIC_TYPE [MTD] remove a forgotten MTD_XIP ...	2006-06-20 14:50:31 -07:00
Steve Grubb	41757106b9	[PATCH] make set_loginuid obey audit_enabled Hi, I was doing some testing and noticed that when the audit system was disabled, I was still getting messages about the loginuid being set. The following patch makes audit_set_loginuid look at in_syscall to determine if it should create an audit event. The loginuid will continue to be set as long as there is a context. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:29 -04:00
Amy Griffis	9c937dcc71	[PATCH] log more info for directory entry change events When an audit event involves changes to a directory entry, include a PATH record for the directory itself. A few other notable changes: - fixed audit_inode_child() hooks in fsnotify_move() - removed unused flags arg from audit_inode() - added audit log routines for logging a portion of a string Here's some sample output. before patch: type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255 type=CWD msg=audit(1149821605.320:26): cwd="/root" type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0 after patch: type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255 type=CWD msg=audit(1149822032.332:24): cwd="/root" type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0 type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0 Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:28 -04:00
Amy Griffis	6a2bceec0e	[PATCH] fix AUDIT_FILTER_PREPEND handling Clear AUDIT_FILTER_PREPEND flag after adding rule to list. This fixes three problems when a rule is added with the -A syntax: - auditctl displays filter list as "(null)" - the rule cannot be removed using -d - a duplicate rule can be added with -a Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:28 -04:00
Al Viro	0a73dccc4f	[PATCH] validate rule fields' types Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:27 -04:00
Amy Griffis	f368c07d72	[PATCH] audit: path-based rules In this implementation, audit registers inotify watches on the parent directories of paths specified in audit rules. When audit's inotify event handler is called, it updates any affected rules based on the filesystem event. If the parent directory is renamed, removed, or its filesystem is unmounted, audit removes all rules referencing that inotify watch. To keep things simple, this implementation limits location-based auditing to the directory entries in an existing directory. Given a path-based rule for /foo/bar/passwd, the following table applies: passwd modified -- audit event logged passwd replaced -- audit event logged, rules list updated bar renamed -- rule removed foo renamed -- untracked, meaning that the rule now applies to the new location Audit users typically want to have many rules referencing filesystem objects, which can significantly impact filtering performance. This patch also adds an inode-number-based rule hash to mitigate this situation. The patch is relative to the audit git tree: http://kernel.org/git/?p=linux/kernel/git/viro/audit-current.git;a=summary and uses the inotify kernel API: http://lkml.org/lkml/2006/6/1/145 Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:27 -04:00
George C. Wilson	20ca73bc79	[PATCH] Audit of POSIX Message Queue Syscalls v.2 This patch adds audit support to POSIX message queues. It applies cleanly to the lspp.b15 branch of Al Viro's git tree. There are new auxiliary data structures, and collection and emission routines in kernel/auditsc.c. New hooks in ipc/mqueue.c collect arguments from the syscalls. I tested the patch by building the examples from the POSIX MQ library tarball. Build them -lrt, not against the old MQ library in the tarball. Here's the URL: http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive, mq_notify, mq_getsetattr. mq_unlink has no new hooks. Please see the corresponding userspace patch to get correct output from auditd for the new record types. [fixes folded] Signed-off-by: George Wilson <ltcgcw@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:26 -04:00
Al Viro	014149cce1	[PATCH] deprecate AUDIT_POSSBILE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:25 -04:00
Al Viro	d8945bb51a	[PATCH] inline more audit helpers pull checks for ->audit_context into inlined wrappers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:25 -04:00
Linda Knippers	ac03221a4f	[PATCH] update of IPC audit record cleanup The following patch addresses most of the issues with the IPC_SET_PERM records as described in: https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html and addresses the comments I received on the record field names. To summarize, I made the following changes: 1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM record is emitted in the failure case as well as the success case. This matches the behavior in sys_shmctl(). I could simplify the code in sys_msgctl() and semctl_down() slightly but it would mean that in some error cases we could get an IPC_SET_PERM record without an IPC record and that seemed odd. 2. No change to the IPC record type, given no feedback on the backward compatibility question. 3. Removed the qbytes field from the IPC record. It wasn't being set and when audit_ipc_obj() is called from ipcperms(), the information isn't available. If we want the information in the IPC record, more extensive changes will be necessary. Since it only applies to message queues and it isn't really permission related, it doesn't seem worth it. 4. Removed the obj field from the IPC_SET_PERM record. This means that the kern_ipc_perm argument is no longer needed. 5. Removed the spaces and renamed the IPC_SET_PERM field names. Replaced iuid and igid fields with ouid and ogid in the IPC record. I tested this with the lspp.22 kernel on an x86_64 box. I believe it applies cleanly on the latest kernel. -- ljk Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:24 -04:00
Serge E. Hallyn	5d136a010d	[PATCH] minor audit updates Just a few minor proposed updates. Only the last one will actually affect behavior. The rest are just misleading code. Several AUDIT_SET functions return 'old' value, but only return value <0 is checked for. So just return 0. propagate audit_set_rate_limit and audit_set_backlog_limit error values In audit_buffer_free, the audit_freelist_count was being incremented even when we discard the return buffer, so audit_freelist_count can end up wrong. This could cause the actual freelist to shrink over time, eventually threatening to degrate audit performance. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:23 -04:00
Amy Griffis	0a3b483e83	[PATCH] fix audit_krule_to_{rule,data} return values Don't return -ENOMEM when callers of these functions are checking for a NULL return. Bug noticed by Serge Hallyn. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:23 -04:00
Al Viro	3c66251e57	[PATCH] add filtering by ppid Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:22 -04:00
Al Viro	f46038ff7d	[PATCH] log ppid Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:22 -04:00
Al Viro	e1396065e0	[PATCH] collect sid of those who send signals to auditd Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:21 -04:00
Al Viro	473ae30bc7	[PATCH] execve argument logging Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:21 -04:00
Al Viro	9044e6bca5	[PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES We should not send a pile of replies while holding audit_netlink_mutex since we hold the same mutex when we receive commands. As the result, we can get blocked while sending and sit there holding the mutex while auditctl is unable to send the next command and get around to receiving what we'd sent. Solution: create skb and put them into a queue instead of sending; once we are done, send what we've got on the list. The former can be done synchronously while we are handling AUDIT_LIST or AUDIT_LIST_RULES; we are holding audit_netlink_mutex at that point. The latter is done asynchronously and without messing with audit_netlink_mutex. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:20 -04:00
Amy Griffis	2d9048e201	[PATCH] inotify (1/5): split kernel API from userspace support The following series of patches introduces a kernel API for inotify, making it possible for kernel modules to benefit from inotify's mechanism for watching inodes. With these patches, inotify will maintain for each caller a list of watches (via an embedded struct inotify_watch), where each inotify_watch is associated with a corresponding struct inode. The caller registers an event handler and specifies for which filesystem events their event handler should be called per inotify_watch. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Acked-by: Robert Love <rml@novell.com> Acked-by: John McCutchan <john@johnmccutchan.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-06-20 05:25:17 -04:00
Linus Torvalds	557240b48e	Add support for suspending and resuming the whole console subsystem Trying to suspend/resume with console messages flying all around is doomed to failure, when the devices that the messages are trying to go to are being shut down. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-19 18:16:01 -07:00
Oleg Nesterov	f53ae1dc34	[PATCH] arm_timer: remove a racy and obsolete PF_EXITING check arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state) in run_posix_cpu_timers(). However, for some reason it does so only for CPUCLOCK_PERTHREAD case (which is imho wrong). Also, this check is not reliable, PF_EXITING could be set on another cpu without any locks/barriers just after the check, so it can't prevent from attaching the timer to the exiting task. The previous patch makes this check unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-17 10:52:13 -07:00
Oleg Nesterov	30f1e3dd8c	[PATCH] run_posix_cpu_timers: remove a bogus BUG_ON() do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. arm_timer() tries to protect against this race, but the check is racy. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. John Stultz recently confirmed this bug, see http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-17 10:52:13 -07:00
Oleg Nesterov	8f17fc20bf	[PATCH] check_process_timers: fix possible lockup If the local timer interrupt happens just after do_exit() sets PF_EXITING (and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call check_process_timers() with tasklist_lock + ->siglock held and check_process_timers: t = tsk; do { .... do { t = next_thread(t); } while (unlikely(t->flags & PF_EXITING)); } while (t != tsk); the outer loop will never stop. Actually, the window is bigger. Another process can attach the timer after ->it_xxx_expires was cleared (see the next commit) and the 'if (PF_EXITING)' check in arm_timer() is racy (see the one after that). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-06-17 10:52:13 -07:00
Sam Ravnborg	b817f6feff	kbuild: check license compatibility when building modules Modules that uses GPL symbols can no longer be build with kbuild, the build will fail during the modpost step. When a GPL-incompatible module uses a EXPORT_SYMBOL_GPL_FUTURE symbol then warn during modpost so author are actually notified. The actual license compatibility check is shared with the kernel to make sure it is in sync. Patch originally from: Andreas Gruenbacher <agruen@suse.de> and Ram Pai <linuxram@us.ibm.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>	2006-06-09 21:53:55 +02:00
Anton Blanchard	651d765d0b	[PATCH] Add a prctl to change the endianness of a process. This new prctl is intended for changing the execution mode of the processor, on processors that support both a little-endian mode and a big-endian mode. It is intended for use by programs such as instruction set emulators (for example an x86 emulator on PowerPC), which may find it convenient to use the processor in an alternate endianness mode when executing translated instructions. Note that this does not imply the existence of a fully-fledged ABI for both endiannesses, or of compatibility code for converting system calls done in the non-native endianness mode. The program is expected to arrange for all of its system call arguments to be presented in the native endianness. Switching between big and little-endian mode will require some care in constructing the instruction sequence for the switch. Generally the instructions up to the instruction that invokes the prctl system call will have to be in the old endianness, and subsequent instructions will have to be in the new endianness. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>	2006-06-09 21:24:13 +10:00
Stephen Hemminger	8d16b76421	[PATCH] hrtimer: export symbols From: Stephen Hemminger <shemminger@osdl.org> I want to use the hrtimer's in the netem (Network Emulator) qdisc. But the necessary symbols aren't exported for module use. Also needed by SystemTap. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Stone, Joshua I" <joshua.i.stone@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-31 16:27:11 -07:00
Linus Torvalds	f1adad78dd	Revert "[PATCH] sched: fix interactive task starvation" This reverts commit `5ce74abe78` (and its dependent commit `8a5bc075b8`), because of audio underruns. Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed the exact cause of the underruns: "Audio underruns galore, with only ogg123 and firefox (browsing the GIT tree online is also a nice trigger by the way). If I back it out, everything is fine for me again." Cc: Rene Herman <rene.herman@keyaccess.nl> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-21 18:54:09 -07:00
Zachary Amsden	0662b71322	[PATCH] Fix a NO_IDLE_HZ timer bug Under certain timing conditions, a race during boot occurs where timer ticks are being processed on remote CPUs. The remote timer ticks can increment jiffies, and if this happens during a window when a timeout is very close to expiring but a local tick has not yet been delivered, you can end up with 1) No softirq pending 2) A local timer wheel which is not synced to jiffies 3) No high resolution timer active 4) A local timer which is supposed to fire before the current jiffies value. In this circumstance, the comparison in next_timer_interrupt overflows, because the base of the comparison for high resolution timers is jiffies, but for the softirq timer wheel, it is relative the the current base of the wheel (jiffies_base). Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-21 12:59:21 -07:00
Paul Jackson	92d1dbd274	[PATCH] cpuset: might_sleep_if check in cpuset_zones_allowed It's too easy to incorrectly call cpuset_zone_allowed() in an atomic context without __GFP_HARDWALL set, and when done, it is not noticed until a tight memory situation forces allocations to be tried outside the current cpuset. Add a 'might_sleep_if()' check, to catch this earlier on, instead of waiting for a similar check in the mutex_lock() code, which is only rarely invoked. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-21 12:59:18 -07:00
Paul Jackson	36be57ffe3	[PATCH] cpuset: update cpuset_zones_allowed comment Update the kernel/cpuset.c:cpuset_zone_allowed() comment. The rule for when mm/page_alloc.c should call cpuset_zone_allowed() was intended to be: Don't call cpuset_zone_allowed() if you can't sleep, unless you pass in the __GFP_HARDWALL flag set in gfp_flag, which disables the code that might scan up ancestor cpusets and sleep. The explanation of this rule in the comment above cpuset_zone_allowed() was stale, as a result of a restructuring of some __alloc_pages() code in November 2005. Rewrite that comment ... Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-21 12:59:18 -07:00
David Woodhouse	18594822fc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2006-05-16 01:19:52 +01:00
Trent Piepho	5e37661389	[PATCH] symbol_put_addr() locks kernel Even since a previous patch: Fix race between CONFIG_DEBUG_SLABALLOC and modules Sun, 27 Jun 2004 17:55:19 +0000 (17:55 +0000) http://www.kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=92b3db26d31cf21b70e3c1eadc56c179506d8fbe The function symbol_put_addr() will deadlock the kernel. symbol_put_addr() would acquire modlist_lock, then while holding the lock call two functions kernel_text_address() and module_text_address() which also try to acquire the same lock. This deadlocks the kernel of course. This patch changes symbol_put_addr() to not acquire the modlist_lock, it doesn't need it since it never looks at the module list directly. Also, it now uses core_kernel_text() instead of kernel_text_address(). The latter has an additional check for addr inside a module, but we don't need to do that since we call module_text_address() (the same function kernel_text_address uses) ourselves. Signed-off-by: Trent Piepho <xyzzy@speakeasy.org> Cc: Zwane Mwaikambo <zwane@fsmlabs.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Johannes Stezenbach <js@linuxtv.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-15 11:20:55 -07:00
Heiko Carstens	986733e01d	[PATCH] RCU: introduce rcu_needs_cpu() interface With "Paul E. McKenney" <paulmck@us.ibm.com> Introduce rcu_needs_cpu() interface. This can be used to tell if there will be a new rcu batch on a cpu soon by looking at the curlist pointer. This can be used to avoid to enter a tickless idle state where the cpu would miss that a new batch is ready when rcu_start_batch would be called on a different cpu. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-15 11:20:55 -07:00
Linus Torvalds	f358166a94	ptrace_attach: fix possible deadlock schenario with irqs Eric Biederman points out that we can't take the task_lock while holding tasklist_lock for writing, because another CPU that holds the task lock might take an interrupt that then tries to take tasklist_lock for writing. Which would be a nasty deadlock, with one CPU spinning forever in an interrupt handler (although admittedly you need to really work at triggering it ;) Since the ptrace_attach() code is special and very unusual, just make it be extra careful, and use trylock+repeat to avoid the possible deadlock. Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-11 11:08:49 -07:00
David Woodhouse	6f18a022fb	Finally remove the obnoxious inter_module_xxx() This was already a bad plan when I argued against adding it in the first place. Good riddance. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2006-05-08 22:40:05 +01:00
Linus Torvalds	f5b40e363a	Fix ptrace_attach()/ptrace_traceme()/de_thread() race This holds the task lock (and, for ptrace_attach, the tasklist_lock) over the actual attach event, which closes a race between attacking to a thread that is either doing a PTRACE_TRACEME or getting de-threaded. Thanks to Oleg Nesterov for reminding me about this, and Chris Wright for noticing a lost return value in my first version. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-05-07 10:49:33 -07:00
Steve Grubb	2ad312d209	[PATCH] Audit Filter Performance While testing the watch performance, I noticed that selinux_task_ctxid() was creeping into the results more than it should. Investigation showed that the function call was being called whether it was needed or not. The below patch fixes this. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:10:07 -04:00
Steve Grubb	073115d6b2	[PATCH] Rework of IPC auditing 1) The audit_ipc_perms() function has been split into two different functions: - audit_ipc_obj() - audit_ipc_set_perm() There's a key shift here... The audit_ipc_obj() collects the uid, gid, mode, and SElinux context label of the current ipc object. This audit_ipc_obj() hook is now found in several places. Most notably, it is hooked in ipcperms(), which is called in various places around the ipc code permforming a MAC check. Additionally there are several places where *checkid() is used to validate that an operation is being performed on a valid object while not necessarily having a nearby ipcperms() call. In these locations, audit_ipc_obj() is called to ensure that the information is captured by the audit system. The audit_set_new_perm() function is called any time the permissions on the ipc object changes. In this case, the NEW permissions are recorded (and note that an audit_ipc_obj() call exists just a few lines before each instance). 2) Support for an AUDIT_IPC_SET_PERM audit message type. This allows for separate auxiliary audit records for normal operations on an IPC object and permissions changes. Note that the same struct audit_aux_data_ipcctl is used and populated, however there are separate audit_log_format statements based on the type of the message. Finally, the AUDIT_IPC block of code in audit_free_aux() was extended to handle aux messages of this new type. No more mem leaks I hope ;-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:10:04 -04:00
Steve Grubb	ce29b682e2	[PATCH] More user space subject labels Hi, The patch below builds upon the patch sent earlier and adds subject label to all audit events generated via the netlink interface. It also cleans up a few other minor things. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:10:01 -04:00
Steve Grubb	e7c3497013	[PATCH] Reworked patch for labels on user space messages The below patch should be applied after the inode and ipc sid patches. This patch is a reworking of Tim's patch that has been updated to match the inode and ipc patches since its similar. [updated: > Stephen Smalley also wanted to change a variable from isec to tsec in the > user sid patch. ] Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:09:58 -04:00
Steve Grubb	9c7aa6aa74	[PATCH] change lspp ipc auditing Hi, The patch below converts IPC auditing to collect sid's and convert to context string only if it needs to output an audit record. This patch depends on the inode audit change patch already being applied. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:09:56 -04:00
Steve Grubb	1b50eed9ca	[PATCH] audit inode patch Previously, we were gathering the context instead of the sid. Now in this patch, we gather just the sid and convert to context only if an audit event is being output. This patch brings the performance hit from 146% down to 23% Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:09:53 -04:00
Darrel Goeddel	3dc7e3153e	[PATCH] support for context based audit filtering, part 2 This patch provides the ability to filter audit messages based on the elements of the process' SELinux context (user, role, type, mls sensitivity, and mls clearance). It uses the new interfaces from selinux to opaquely store information related to the selinux context and to filter based on that information. It also uses the callback mechanism provided by selinux to refresh the information when a new policy is loaded. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:09:36 -04:00
Al Viro	97e94c4530	[PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:06:21 -04:00
Al Viro	5411be59db	[PATCH] drop task argument of audit_syscall_{entry,exit} ... it's always current, and that's a good thing - allows simpler locking. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:06:18 -04:00
Al Viro	e495149b17	[PATCH] drop gfp_mask in audit_log_exit() now we can do that - all callers are process-synchronous and do not hold any locks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:06:16 -04:00
Al Viro	fa84cb935d	[PATCH] move call of audit_free() into do_exit() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:06:13 -04:00
Al Viro	45d9bb0e37	[PATCH] deal with deadlocks in audit_free() Don't assume that audit_log_exit() et.al. are called for the context of current; pass task explictly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-05-01 06:06:07 -04:00
Andrew Morton	13e87ec686	[PATCH] request_irq(): remove warnings from irq probing - Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning. Some drivers like to use request_irq() to find an unused interrupt slot. - Use it in i82365.c - Kill unused SA_PROBE. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-28 08:33:46 -07:00
dean gaudet	47bb789973	[PATCH] off-by-1 in kernel/power/main.c There's an off-by-1 in kernel/power/main.c:state_store() ... if your kernel just happens to have some non-zero data at pm_states[PM_SUSPEND_MAX] (i.e. one past the end of the array) then it'll let you write anything you want to /sys/power/state and in response the box will enter S5. Signed-off-by: dean gaudet <dean@arctic.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-28 08:33:46 -07:00
Chandra Seetharaman	83d722f7e1	[PATCH] Remove __devinit and __cpuinit from notifier_call definitions Few of the notifier_chain_register() callers use __init in the definition of notifier_call. It is incorrect as the function definition should be available after the initializations (they do not unregister them during initializations). This patch fixes all such usages to _not_ have the notifier_call __init section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-26 08:30:03 -07:00
Chandra Seetharaman	649bbaa484	[PATCH] Remove __devinitdata from notifier block definitions Few of the notifier_chain_register() callers use __devinitdata in the definition of notifier_block data structure. It is incorrect as the data structure should be available after the initializations (they do not unregister them during initializations). This was leading to an oops when notifier_chain_register() call is invoked for those callback chains after initialization. This patch fixes all such usages to _not_ have the notifier_block data structure in the init data section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-26 08:27:50 -07:00
David Woodhouse	ed198cb497	[RBTREE] Update hrtimers to use rb_parent() accessor macro. Also switch it to use the same method of using off-tree nodes as everyone else now does -- set them to point to themselves. Signed-off-by: David Woodhouse <dwmw2@infradead.org>	2006-04-22 02:38:50 +01:00
Linus Torvalds	402a26f0c0	Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block * 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block: [PATCH] block/elevator.c: remove unused exports [PATCH] splice: fix smaller sized splice reads [PATCH] Don't inherit ->splice_pipe across forks [patch] cleanup: use blk_queue_stopped [PATCH] Document online io scheduler switching	2006-04-20 08:17:04 -07:00
Ananth N Mavinakayanahalli	7522a8423b	[PATCH] kprobes: NULL out non-relevant fields in struct kretprobe In cases where a struct kretprobe's _handler fields are non-NULL, it is possible to cause a system crash, due to the possibility of calls ending up in zombie functions. Documentation clearly states that unused _handlers should be set to NULL, but kprobe users sometimes fail to do so. Fix it by setting the non-relevant fields of the struct kretprobe to NULL. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-20 07:54:03 -07:00
Jens Axboe	a0aa7f68af	[PATCH] Don't inherit ->splice_pipe across forks It's really task private, so clear that field on fork after copying task structure. Signed-off-by: Jens Axboe <axboe@suse.de>	2006-04-20 13:05:33 +02:00
OGAWA Hirofumi	5a7b46b369	[PATCH] Add more prevent_tail_call() Those also break userland regs like following. 00000000 <sys_chown16>: 0: 0f b7 44 24 0c movzwl 0xc(%esp),%eax 5: 83 ca ff or $0xffffffff,%edx 8: 0f b7 4c 24 08 movzwl 0x8(%esp),%ecx d: 66 83 f8 ff cmp $0xffffffff,%ax 11: 0f 44 c2 cmove %edx,%eax 14: 66 83 f9 ff cmp $0xffffffff,%cx 18: 0f 45 d1 cmovne %ecx,%edx 1b: 89 44 24 0c mov %eax,0xc(%esp) 1f: 89 54 24 08 mov %edx,0x8(%esp) 23: e9 fc ff ff ff jmp 24 <sys_chown16+0x24> where the tailcall at the end overwrites the incoming stack-frame. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> [ I would _really_ like to have a way to tell gcc about calling conventions. The "prevent_tail_call()" macro is pretty ugly ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-19 16:27:18 -07:00
Rafael J. Wysocki	4a3b98a422	[PATCH] swsusp: prevent possible image corruption on resume The function free_pagedir() used by swsusp for freeing its internal data structures clears the PG_nosave and PG_nosave_free flags for each page being freed. However, during resume PG_nosave_free set means that the page in question is "unsafe" (ie. it will be overwritten in the process of restoring the saved system state from the image), so it should not be used for the image data. Therefore free_pagedir() should not clear PG_nosave_free if it's called during resume (otherwise "unsafe" pages freed by it may be used for storing the image data and the data may get corrupted later on). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-19 09:13:49 -07:00
Eric W. Biederman	5e85d4abe3	[PATCH] task: Make task list manipulations RCU safe While we can currently walk through thread groups, process groups, and sessions with just the rcu_read_lock, this opens the door to walking the entire task list. We already have all of the other RCU guarantees so there is no cost in doing this, this should be enough so that proc can stop taking the tasklist lock during readdir. prev_task was killed because it has no users, and using it will miss new tasks when doing an rcu traversal. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-19 09:13:49 -07:00
Eric W. Biederman	64541d1970	[PATCH] kill unushed __put_task_struct_cb Somehow in the midst of dotting i's and crossing t's during the merge up to rc1 we wound up keeping __put_task_struct_cb when it should have been killed as it no longer has any users. Sorry I probably should have caught this while it was still in the -mm tree. Having the old code there gets confusing when reading through the code and trying to understand what is happening. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-14 17:43:57 -07:00
Adrian Bunk	78a596b449	[PATCH] remove kernel/power/pm.c:pm_unregister() Since the last user is removed in -mm, we can now remove this long deprecated function. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2006-04-14 12:25:26 -07:00
Roland McGrath	e57a505984	[PATCH] fix non-leader exec under ptrace This reverts most of commit `30e0fca6c1`. It broke the case of non-leader MT exec when ptraced. I think the bug it was intended to fix was already addressed by commit `788e05a67c`. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-14 08:59:13 -07:00
Oleg Nesterov	a145410dcc	[PATCH] __group_complete_signal: remove bogus BUG_ON Commit `e56d090310` [PATCH] RCU signal handling made this BUG_ON() unsafe. This code runs under ->siglock, while switch_exec_pids() takes tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 07:34:01 -07:00
Linus Torvalds	88dd9c16ce	Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block * 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block: [PATCH] vfs: add splice_write and splice_read to documentation [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_* [PATCH] splice: warning fix [PATCH] another round of fs/pipe.c cleanups [PATCH] splice: comment styles [PATCH] splice: add Ingo as addition copyright holder [PATCH] splice: unlikely() optimizations [PATCH] splice: speedups and optimizations [PATCH] pipe.c/fifo.c code cleanups [PATCH] get rid of the PIPE_*() macros [PATCH] splice: speedup __generic_file_splice_read [PATCH] splice: add direct fd <-> fd splicing support [PATCH] splice: add optional input and output offsets [PATCH] introduce a "kernel-internal pipe object" abstraction [PATCH] splice: be smarter about calling do_page_cache_readahead() [PATCH] splice: optimize the splice buffer mapping [PATCH] splice: cleanup __generic_file_splice_read() [PATCH] splice: only call wake_up_interruptible() when we really have to [PATCH] splice: potential !page dereference [PATCH] splice: mark the io page as accessed	2006-04-11 06:34:02 -07:00
Joe Korty	5ef37b1964	[PATCH] add cpu_relax to hrtimer_cancel Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel(). Signed-off-by: Joe Korty <joe.korty@ccur.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:42 -07:00
Christoph Hellwig	d824e66a9a	[PATCH] build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is set Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:41 -07:00
Adrian Bunk	aa7271076a	[PATCH] the scheduled unexport of panic_timeout Implement the scheduled unexport of panic_timeout. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:40 -07:00
Andrew Morton	ba6edfcd17	[PATCH] timer initialisation fix We need the boot CPU's tvec_bases[] entry to be initialised super-early in boot, for early_serial_setup(). That runs within setup_arch(), before even per-cpu areas are initialised. The patch changes tvec_bases to use compile-time initialisation, and adds a separate array `tvec_base_done' to keep track of which CPU has had its tvec_bases[] entry initialised (because we can no longer use the zeroness of that tvec_bases[] entry to determine whether it has been initialised). Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this. Cc: Eugene Surovegin <ebs@ebshome.net> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:40 -07:00
Hyok S. Choi	3016b42153	[PATCH] frv: define MMU mode specific syscalls as 'cond_syscall' and clean up unneeded macros For some architectures, a few syscalls are not linked in noMMU mode. In that case, the MMU depending syscalls are needed to be defined as 'cond_syscall'. For example, ARM architecture selectively links sys_mlock by the mode configuration. In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in arch/frv/kernel/entry.S. However these conditional macros are just duplicates if they were defined as cond_syscall. Compilation test is done with FRV toolchains for both of MMU and noMMU mode. Signed-off-by: Hyok S. Choi <hyok.choi@samsung.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:33 -07:00
Mike Galbraith	8a5bc075b8	[PATCH] sched: don't awaken RT tasks on expired array RT tasks are being awakened on the expired array when expired_starving() is true, whereas they really should be excluded. Fix. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:30 -07:00
Mike Galbraith	5ce74abe78	[PATCH] sched: fix interactive task starvation Fix a starvation problem that occurs when a stream of highly interactive tasks delay an array switch for extended periods despite EXPIRED_STARVING(rq) being true. AFAIKT, the only choice is to enqueue awakening tasks on the expired array in this case. Without this patch, it can be nearly impossible to remotely login to a busy server, and interactive shell commands can starve for minutes. Also, convert the EXPIRED_STARVING macro into an inline function which humans can understand. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-11 06:18:30 -07:00
Jens Axboe	b92ce55893	[PATCH] splice: add direct fd <-> fd splicing support It's more efficient for sendfile() emulation. Basically we cache an internal private pipe and just use that as the intermediate area for pages. Direct splicing is not available from sys_splice(), it is only meant to be used for sendfile() emulation. Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at exit for the normal fast path. Signed-off-by: Jens Axboe <axboe@suse.de>	2006-04-11 13:52:07 +02:00
Jordan Hargrave	b20367a6c2	[PATCH] x86_64: Fix drift with HPET timer enabled If the HPET timer is enabled, the clock can drift by ~3 seconds a day. This is due to the HPET timer not being initialized with the correct setting (still using PIT count). If HZ changes, this drift can become even more pronounced. HPET patch initializes tick_nsec with correct tick_nsec settings for HPET timer. Vojtech comments: "It's not entirely correct (it assumes the HPET ticks totally exactly), but it's significantly better than assuming the PIT error there." Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-04-09 11:53:53 -07:00
Eric Sesterhenn	9f31252cb6	BUG_ON() Conversion in kernel/signal.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-02 13:45:55 +02:00
Eric Sesterhenn	fda8bd78a1	BUG_ON() Conversion in kernel/signal.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-02 13:44:47 +02:00
Eric Sesterhenn	524223ca81	BUG_ON() Conversion in kernel/ptrace.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-02 13:43:40 +02:00
Kalin KOZHUHAROV	8ba8e95ed1	Fix comments: s/granuality/granularity/ I was grepping through the code and some `grep ganularity -R .` didn't catch what I thought. Then looking closer I saw the term "granuality" used in only four places (in comments) and granularity in many more places describing the same idea. Some other facts: dictionary.com does not know such a word define:granuality on google is not found (and pages for granuality are mostly related to patches to the kernel) it has not been discussed as a term on LKML, AFAICS (=Can Search) To be consistent, I think granularity should be used everywhere. Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-01 01:41:22 +02:00
Eric Sesterhenn	8abd8e298e	BUG_ON() Conversion in kernel/printk.c this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-01 01:21:17 +02:00
Adrian Bunk	3e6e952d1d	help text: SOFTWARE_SUSPEND doesn't need ACPI The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays the information that it doesn't need ACPI, too, is even more helpful. Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-04-01 01:03:08 +02:00
Kirill Korotaev	4286229868	[PATCH] wrong error path in dup_fd() leading to oopses in RCU Wrong error path in dup_fd() - it should return NULL on error, not an address of already freed memory :/ Triggered by OpenVZ stress test suite. What is interesting is that it was causing different oopses in RCU like below: Call Trace: [<c013492c>] rcu_do_batch+0x2c/0x80 [<c0134bdd>] rcu_process_callbacks+0x3d/0x70 [<c0126cf3>] tasklet_action+0x73/0xe0 [<c01269aa>] __do_softirq+0x10a/0x130 [<c01058ff>] do_softirq+0x4f/0x60 ======================= [<c0113817>] smp_apic_timer_interrupt+0x77/0x110 [<c0103b54>] apic_timer_interrupt+0x1c/0x24 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Dmitry Mishin <dim@openvz.org> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-Off-By: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:25:46 -08:00
Eric W. Biederman	92476d7fc0	[PATCH] pidhash: Refactor the pid hash table Simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable. In the discussions I had with Oleg it was felt that to a large extent the cleanup itself justified the work. With struct pid being dynamically allocated meant we could create the hash table entry when the pid was allocated and free the hash table entry when the pid was freed. Instead of playing with the hash lists when ever a process would attach or detach to a process. For myself the fact that it gave what my previous task_ref patch gave for free with simpler code was a big win. The problem is that if you hold a reference to struct task_struct you lock in 10K of low memory. If you do that in a user controllable way like /proc does, with an unprivileged but hostile user space application with typical resource limits of 1000 fds and 100 processes I can trigger the OOM killer by consuming all of low memory with task structs, on a machine wight 1GB of low memory. If I instead hold a reference to struct pid which holds a pointer to my task_struct, I don't suffer from that problem because struct pid is 2 orders of magnitude smaller. In fact struct pid is small enough that most other kernel data structures dwarf it, so simply limiting the number of referring data structures is enough to prevent exhaustion of low memory. This splits the current struct pid into two structures, struct pid and struct pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one. struct pid_link is the per process linkage into the hash tables and lives in struct task_struct. struct pid is given an indepedent lifetime, and holds pointers to each of the pid types. The independent life of struct pid simplifies attach_pid, and detach_pid, because we are always manipulating the list of pids and not the hash table. In addition in giving struct pid an indpendent life it makes the concept much more powerful. Kernel data structures can now embed a struct pid * instead of a pid_t and not suffer from pid wrap around problems or from keeping unnecessarily large amounts of memory allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:19:00 -08:00
Eric W. Biederman	8c7904a00b	[PATCH] task: RCU protect task->usage A big problem with rcu protected data structures that are also reference counted is that you must jump through several hoops to increase the reference count. I think someone finally implemented atomic_inc_not_zero(&count) to automate the common case. Unfortunately this means you must special case the rcu access case. When data structures are only visible via rcu in a manner that is not determined by the reference count on the object (i.e. tasks are visible until their zombies are reaped) there is a much simpler technique we can employ. Simply delaying the decrement of the reference count until the rcu interval is over. What that means is that the proc code that looks up a task and later wants to sleep can now do: rcu_read_lock(); task = find_task_by_pid(some_pid); if (task) { get_task_struct(task); } rcu_read_unlock(); The effect on the rest of the kernel is that put_task_struct becomes cheaper and immediate, and in the case where the task has been reaped it frees the task immediate instead of unnecessarily waiting an until the rcu interval is over. Cleanup of task_struct does not happen when its reference count drops to zero, instead cleanup happens when release_task is called. Tasks can only be looked up via rcu before release_task is called. All rcu protected members of task_struct are freed by release_task. Therefore we can move call_rcu from put_task_struct into release_task. And we can modify release_task to not immediately release the reference count but instead have it call put_task_struct from the function it gives to call_rcu. The end result: - get_task_struct is safe in an rcu context where we have just looked up the task. - put_task_struct() simplifies into its old pre rcu self. This reorganization also makes put_task_struct uncallable from modules as it is not exported but it does not appear to be called from any modules so this should not be an issue, and is trivially fixed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Andrew Morton	158d9ebd19	[PATCH] resurrect __put_task_struct This just got nuked in mainline. Bring it back because Eric's patches use it. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Eric W. Biederman	390e2ff077	[PATCH] Make setsid() more robust The core problem: setsid fails if it is called by init. The effect in 2.6.16 and the earlier kernels that have this problem is that if you do a "ps -j 1 or ps -ej 1" you will see that init and several of it's children have process group and session == 0. Instead of process group == session == 1. Despite init calling setsid. The reason it fails is that daemonize calls set_special_pids(1,1) on kernel threads that are launched before /sbin/init is called. The only remaining effect in that current->signal->leader == 0 for init instead of 1. And the setsid call fails. No one has noticed because /sbin/init does not check the return value of setsid. In 2.4 where we don't have the pidhash table, and daemonize doesn't exist setsid actually works for init. I care a lot about pid == 1 not being a special case that we leave broken, because of the container/jail work that I am doing. - Carefully allow init (pid == 1) to call setsid despite the kernel using its session. - Use find_task_by_pid instead of find_pid because find_pid taking a pidtype is going away. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Thomas Gleixner	9741ef964d	[PATCH] futex: check and validate timevals The futex timeval is not checked for correctness. The change does not break existing applications as the timeval is supplied by glibc (and glibc always passes a correct value), but the glibc-internal tests for this functionality fail. Signed-off-by: Thomas Gleixner <tglx@tglx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Con Kolivas	d425b274ba	[PATCH] sched: activate SCHED BATCH expired To increase the strength of SCHED_BATCH as a scheduling hint we can activate batch tasks on the expired array since by definition they are latency insensitive tasks. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Con Kolivas	7c4bb1f9b3	[PATCH] sched: remove on runqueue requeueing On runqueue time is used to elevate priority in schedule(). In the code it currently requeues tasks even if their priority is not elevated, which would end up placing them at the end of their runqueue array effectively delaying them instead of improving their priority. Bug spotted by Mike Galbraith <efault@gmx.de> This patch removes this requeueing. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Con Kolivas	5138930e6a	[PATCH] sched: include noninteractive sleep in idle detect Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so they need to be included in the idle detection code. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:59 -08:00
Con Kolivas	e72ff0bb2c	[PATCH] sched: dont decrease idle sleep avg We watch for tasks that sleep extended periods and don't allow one single prolonged sleep period from elevating priority to maximum bonus to prevent cpu bound tasks from getting high priority with single long sleeps. There is a bug in the current code that also penalises tasks that already have high priority. Correct that bug. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Con Kolivas	e7c38cb49c	[PATCH] sched: make task_noninteractive use sleep_type Alterations to the pipe code in the kernel made it possible for relative starvation to occur with tasks that slept waiting on a pipe getting unfair priority bonuses even if they were otherwise fully cpu bound so the TASK_NONINTERACTIVE flag was introduced which prevented any change to sleep_avg while sleeping waiting on a pipe. This change also leads to the converse though, preventing any priority boost from occurring in truly interactive tasks that wait on pipes. Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE which will allow a linear bonus to priority based on sleep time thus allowing interactive tasks to get high priority if they sleep enough. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Con Kolivas	3dee386e14	[PATCH] sched: cleanup task_activated() The activated flag in task_struct is used to track different sleep types and its usage is somewhat obfuscated. Convert the variable to an enum with more descriptive names without altering the function. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Jack Steiner	db1b1fefc2	[PATCH] sched: reduce overhead of calc_load Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Dimitri Sivanich	3055addadb	[PATCH] hrtimer: call get_softirq_time() only when necessary in run_hrtimer_queue() It seems that run_hrtimer_queue() is calling get_softirq_time() more often than it needs to. With this patch, it only calls get_softirq_time() if there's a pending timer. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Thomas Gleixner	669d7868ae	[PATCH] hrtimer: use generic sleeper for nanosleep Replace the nanosleep private sleeper functionality by the generic hrtimer sleeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Thomas Gleixner	00362e33f6	[PATCH] hrtimer: create generic sleeper The removal of the data field in the hrtimer structure enforces the embedding of the timer into another data structure. nanosleep now uses a private implementation of the most common used timer callback function (simple task wakeup). In order to avoid the reimplentation of such functionality all over the place a generic hrtimer_sleeper functionality is created. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:58 -08:00
Andrew Morton	7529c30116	[PATCH] modules: permit Dual-MIT/GPL licenses One of the LEDs driver files wants to use this. Probably drivers/mtd/maps/ipaq-flash.c wants to convert as well - right now it'll be tainting the kernel. Cc: David Woodhouse <dwmw2@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Bowler <jbowler@acm.org> Cc: "'Richard Purdie'" <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:56 -08:00
Paul Jackson	e4e364e865	[PATCH] cpuset: memory migration interaction fix Fix memory migration so that it works regardless of what cpuset the invoking task is in. If a task invoked a memory migration, by doing one of: 1) writing a different nodemask to a cpuset 'mems' file, or 2) writing a tasks pid to a different cpuset's 'tasks' file, where the cpuset had its 'memory_migrate' option turned on, then the allocation of the new pages for the migrated task(s) was constrained by the invoking tasks cpuset. If this task wasn't in a cpuset that allowed the requested memory nodes, the memory migration would happen to some other nodes that were in that invoking tasks cpuset. This was usually surprising and puzzling behaviour: Why didn't the pages move? Why did the pages move -there-? To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct field to the nodes the migrating tasks is moving to, so that new pages can be allocated there. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:55 -08:00
Paul Jackson	2741a559a0	[PATCH] cpuset: unsafe mm reference fix Fix unsafe reference to a tasks mm struct, by moving the reference inside of a convenient nearby properly guarded code block. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:55 -08:00
Paul Jackson	4a01c8d5be	[PATCH] cpuset: task_lock comment fix Fix cpuset comment involving case of a tasks cpuset pointer being NULL. Thanks to "the_top_cpuset_hack", this code no longer sees NULL task->cpuset pointers. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:55 -08:00
KaiGai Kohei	bb231fe3a5	[PATCH] Fix pacct bug in multithreading case. I noticed a bug on the process accounting facility. In multi-threading process, some data would be recorded incorrectly when the group_leader dies earlier than one or more threads. The attached patch fixes this problem. See below. 'bugacct' is a test program that create a worker thread after 4 seconds sleeping, then the group_leader dies soon. The worker thread consume CPU/Memory for 6 seconds, then exit. We can estimate 10 seconds as etime and 6 seconds as stime + utime. This is a sample program which the group_leader dies earlier than other threads. The results of same binary execution on different kernel are below. -- accounted records -------------------- \| btime \| utime \| stime \| etime \| minflt \| majflt \| comm \| original \| 13:16:40 \| 0.00 \| 0.00 \| 6.10 \| 171 \| 0 \| bugacct \| patched \| 13:20:21 \| 5.83 \| 0.18 \| 10.03 \| 32776 \| 0 \| bugacct \| () bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is appropriate. -- Test results in original kernel ------ $ date; time -p ./bugacct Tue Mar 28 13:16:36 JST 2006 <- But pacct said btime is 13:16:40 real 10.11 <- But pacct said etime is 6.10 user 5.96 <- But pacct said utime is 0.00 sys 0.14 <- But pacct said stime is 0.00 $ -- Test results in patched kernel ------- $ date; time -p ./bugacct Tue Mar 28 13:20:21 JST 2006 real 10.04 user 5.83 sys 0.19 $ In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and minflt incorrectly. In my opinion, this problem is caused by an assumption that group_leader dies last. The following section calculates process running time for etime and btime. But it means running time of the thread that dies last, not process. The start_time of the first thread in the process (group_leader) should be reduced from uptime to calculate etime and btime correctly. ---- do_acct_process() in kernel/acct.c: / calculate run_time in nsec/ do_posix_clock_monotonic_gettime(&uptime); run_time = (u64)uptime.tv_secNSEC_PER_SEC + uptime.tv_nsec; run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC + current->start_time.tv_nsec; ---- The following section calculates stime and utime of the process. But it might count the utime and stime of the group_leader duplicatly and ignore the utime and stime of the thread dies last, when one or more threads remain after group_leader dead. The ac_utime should be calculated as the sum of the signal->utime and utime of the thread dies last. The ac_stime should be done also. ---- do_acct_process() in kernel/acct.c: jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime, current->signal->utime)); ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies)); jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime, current->signal->stime)); ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies)); ---- The part of the minflt/majflt calculation has same problem. This patch solves those problems, I think. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:54 -08:00
OGAWA Hirofumi	9b41046cd0	[PATCH] Don't pass boot parameters to argv_init[] The boot cmdline is parsed in parse_early_param() and parse_args(,unknown_bootoption). And __setup() is used in obsolete_checksetup(). start_kernel() -> parse_args() -> unknown_bootoption() -> obsolete_checksetup() If __setup()'s callback (->setup_func()) returns 1 in obsolete_checksetup(), obsolete_checksetup() thinks a parameter was handled. If ->setup_func() returns 0, obsolete_checksetup() tries other ->setup_func(). If all ->setup_func() that matched a parameter returns 0, a parameter is seted to argv_init[]. Then, when runing /sbin/init or init=app, argv_init[] is passed to the app. If the app doesn't ignore those arguments, it will warning and exit. This patch fixes a wrong usage of it, however fixes obvious one only. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:53 -08:00
Oleg Nesterov	a2c348fe01	[PATCH] __mod_timer: simplify ->base changing Since base and new_base are of the same type now, we can save one 'if' branch and simplify the code a bit. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:53 -08:00
Oleg Nesterov	3691c5199e	[PATCH] kill __init_timer_base in favor of boot_tvec_bases Commit `a4a6198b80`: [PATCH] tvec_bases too large for per-cpu data introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at compile time. This means we can kill __init_timer_base and move timer_base_s's content into tvec_t_base_s. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:52 -08:00
Pavel Machek	85b6bce365	[PATCH] Fix suspend with traced tasks strace /bin/bash misbehaves after resume; this fixes it. (akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to do the right thing). Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-31 12:18:50 -08:00
Oleg Nesterov	547679087b	[PATCH] send_sigqueue: simplify and fix the race send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock. This is unsafe: 'p' can exit in between and set ->sighand = NULL. The race is theoretical, the window is tiny and irqs are disabled by the caller, so I don't think we need the fix for -stable tree. Convert send_sigqueue() to use lock_task_sighand() helper. Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the comment is wrong. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	a1d5e21e3e	[PATCH] do_notify_parent_cldstop: remove 'to_self' param The previous patch has changed callsites of do_notify_parent_cldstop() so that to_self == (->ptrace & PT_PTRACED) always (as it should be). We can remove this parameter now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	883606a7c9	[PATCH] finish_stop: don't check stop_count < 0 Remove an obscure 'stop_count < 0' check in finish_stop(). The previous patch made this case impossible. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	dac27f4a09	[PATCH] simplify do_signal_stop() do_signal_stop() considers 'thread_group_empty()' as a special case. This was needed to avoid taking tasklist_lock. Since this lock is unneeded any longer, we can remove this special case and simplify the code even more. Also, before this patch, finish_stop() was called with stop_count == -1 for 'thread_group_empty()' case. This is not strictly wrong, but confusing and unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	a7e5328a06	[PATCH] cleanup __exit_signal->cleanup_sighand path Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This makes the exit path more understandable and allows us to do cleanup_sighand() outside of ->siglock protected section. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	4a2c7a7837	[PATCH] make fork() atomic wrt pgrp/session signals Eric W. Biederman wrote: > > Ok. SUSV3/Posix is clear, fork is atomic with respect > to signals. Either a signal comes before or after a > fork but not during. (See the rationale section). > http://www.opengroup.org/onlinepubs/000095399/functions/fork.html > > The tasklist_lock does not stop forks from adding to a process > group. The forks stall while the tasklist_lock is held, but a fork > that began before we grabbed the tasklist_lock simply completes > afterwards, and the child does not receive the signal. This also means that SIGSTOP or sig_kernel_coredump() signal can't be delivered to pgrp/session reliably. With this patch copy_process() returns -ERESTARTNOINTR when it detects a pending signal, fork() will be restarted transparently after handling the signals. This patch also deletes now unneeded "group_stop_count > 0" check, copy_process() can no longer succeed while group stop in progress. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	47e65328a7	[PATCH] pids: kill PIDTYPE_TGID This patch kills PIDTYPE_TGID pid_type thus saving one hash table in kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a preparation for the further tref/pids rework. This patch adds 'struct list_head thread_group' to 'struct task_struct' instead. We don't detach group leader from PIDTYPE_PID namespace until another thread inherits it's ->pid == ->tgid, so we are safe wrt premature free_pidmap(->tgid) call. Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID). Should the need arise, we can use find_task_by_pid()->group_leader. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-By: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	88531f725b	[PATCH] do_sigaction: don't take tasklist_lock do_sigaction() does not need tasklist_lock anymore, we can simplify the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:44 -08:00
Oleg Nesterov	aacc90944d	[PATCH] do_group_exit: don't take tasklist_lock do_group_exit() takes tasklist_lock for zap_other_threads(), this is unneeded now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	a122b341b7	[PATCH] do_signal_stop: don't take tasklist_lock do_signal_stop() does not need tasklist_lock anymore. So it does not need to do misc re-checks, and we can simplify the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	6108ccd3e2	[PATCH] relax sig_needs_tasklist() handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK signals anymore. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	7d7185c818	[PATCH] sys_times: don't take tasklist_lock sys_times: don't take tasklist_lock Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	5876700cd3	[PATCH] do __unhash_process() under ->siglock This patch moves __unhash_process() call from realease_task() to __exit_signal(), so __detach_pid() is called with ->siglock held. This means we don't need tasklist_lock to iterate over thread group anymore: copy_process() was already changed to do attach_pid() under ->siglock. Eric's "pidhash-kill-switch_exec_pids.patch" from -mm changed de_thread() so it doesn't touch PIDTYPE_TGID. NOTE: de_thread() still needs some attention. It still changes task->pid lockless. Taking ->sighand.siglock here allows to do more tasklist_lock removals. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	35f5cad8c4	[PATCH] revert "Optimize sys_times for a single thread process" This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in sys_times(). The reason is that the next patch breaks memory ordering which is needed for that optimization. tasklist_lock in sys_times() will be eliminated completely by further patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	6a14c5c9da	[PATCH] move __exit_signal() to kernel/exit.c __exit_signal() is private to release_task() now. I think it is better to make it static in kernel/exit.c and export flush_sigqueue() instead - this function is much more simple and straightforward. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	c81addc9d3	[PATCH] rename __exit_sighand to cleanup_sighand Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to copy_sighand(). This matches copy_signal/cleanup_signal naming, and I think it is easier to follow. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	29ff471234	[PATCH] cleanup __exit_signal() This patch factors out duplicated code under 'if' branches. Also, BUG_ON() conversions and whitespace cleanups. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:43 -08:00
Oleg Nesterov	6b3934ef52	[PATCH] copy_process: cleanup bad_fork_cleanup_signal __exit_signal() does important cleanups atomically under ->siglock. It is also called from copy_process's error path. This is not good, for example we can't move __unhash_process() under ->siglock for that reason. We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under 'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient to just backout copy_signal(), nothing more. Again, nobody can see this task yet. For CLONE_THREAD case we just decrement signal->count, otherwise nobody can see this ->signal and we can free it lockless. This patch assumes it is safe to do exit_thread_group_keys() without tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:42 -08:00
Oleg Nesterov	7001510d0c	[PATCH] copy_process: cleanup bad_fork_cleanup_sighand The only caller of exit_sighand(tsk) is copy_process's error path. We can call __exit_sighand() directly and kill exit_sighand(). This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no external references, nobody can see it, and IF (clone_flags & CLONE_SIGHAND) At least 'current' has a reference to ->sighand, this means atomic_dec_and_test(sighand->count) can't be true. ELSE Nobody can see this ->sighand, this means we can free it without any locking. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:42 -08:00
Oleg Nesterov	a9e88e84b5	[PATCH] introduce sig_needs_tasklist() helper In my opinion this patch cleans up the code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:42 -08:00
Oleg Nesterov	f63ee72e0f	[PATCH] introduce lock_task_sighand() helper Add lock_task_sighand() helper and converts group_send_sig_info() to use it. Hopefully we will have more users soon. This patch also removes '!sighand->count' and '!p->usage' checks, I think they both are bogus, racy and unneeded (but probably it makes sense to restore them as BUG_ON()s). ->sighand is cleared and it's ->count is decremented in release_task() with sighand->siglock held, so it is a bug to have '!p->usage \|\| !->count' after we already locked and verified it is the same. On the other hand, an already dead task without ->sighand can have a non-zero ->usage due to ptrace, for example. If we read the stale value of ->sighand we must see the change after spin_lock(), because that change was done while holding that same old ->sighand.siglock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:42 -08:00
Oleg Nesterov	aa1757f90b	[PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCU This patch borrows a clever Hugh's 'struct anon_vma' trick. Without tasklist_lock held we can't trust task->sighand until we locked it and re-checked that it is still the same. But this means we don't need to defer 'kmem_cache_free(sighand)'. We can return the memory to slab immediately, all we need is to be sure that sighand->siglock can't dissapear inside rcu protected section. To do so we need to initialize ->siglock inside ctor function, SLAB_DESTROY_BY_RCU does the rest. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:42 -08:00
Oleg Nesterov	1f09f9749c	[PATCH] release_task: replace open-coded ptrace_unlink() Use ptrace_unlink() instead of open-coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	8292d633ad	[PATCH] wait_for_helper: trivial style cleanup Use NULL instead of (... *)0 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	6ac781b11a	[PATCH] reparent_thread: use remove_parent/add_parent Use remove_parent/add_parent instead of open coding. No changes in kernel/exit.o Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	73b9ebfe12	[PATCH] pidhash: don't count idle threads fork_idle() does unhash_process() just after copy_process(). Contrary, boot_cpu's idle thread explicitely registers itself for each pid_type with nr = 0. copy_process() already checks p->pid != 0 before process_counts++, I think we can just skip attach_pid() calls and job control inits for idle threads and kill unhash_process(). We don't need to cleanup ->proc_dentry in fork_idle() because with this patch idle threads are never hashed in kernel/pid.c:pid_hash[]. We don't need to hash pid == 0 in pidmap_init(). free_pidmap() is never called with pid == 0 arg, so it will never be reused. So it is still possible to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV. However with this patch we don't hash pid == 0 for PIDTYPE_PID case. We still have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and kernel threads which don't call daemonize(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	c97d98931a	[PATCH] kill SET_LINKS/REMOVE_LINKS Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and these callers already check thread_group_leader(). This patch kills theese macros, they mix two different things: setting process's parent and registering it in init_task.tasks list. Callers are updated to do these actions by hand. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	9b678ece42	[PATCH] don't use REMOVE_LINKS/SET_LINKS for reparenting There are places where kernel uses REMOVE_LINKS/SET_LINKS while changing process's ->parent. Use add_parent/remove_parent instead, they don't abuse of global process list. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	8fafabd86f	[PATCH] remove add_parent()'s parent argument add_parent(p, parent) is always called with parent == p->parent, and it makes no sense to do it differently. This patch removes this argument. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:41 -08:00
Oleg Nesterov	d799f03597	[PATCH] choose_new_parent: remove unused arg, sanitize exit_state check 'child_reaper' arg is not used in choose_new_parent(). "->exit_state >= EXIT_ZOMBIE" check is a leftover, was valid when EXIT_ZOMBIE lived in ->state var. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-03-28 18:36:40 -08:00

... 3 4 5 6 7 ...

1307 Commits