commit e5a16b1f9eec0af7cfa0830304b41c1c0833cf9f
Author: Len Brown <len.brown@intel.com>
Date: Tue Oct 2 23:44:44 2007 -0400
cpuidle: shrink diff
processor_idle.c | 440 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 429 insertions(+), 11 deletions(-)
Signed-off-by: Len Brown <len.brown@intel.com>
commit dfbb9d5aedfb18848a3e0d6f6e3e4969febb209c
Author: Len Brown <len.brown@intel.com>
Date: Wed Sep 26 02:17:55 2007 -0400
cpuidle: reduce diff size
Reduces the cpuidle processor_idle.c diff vs 2.6.22 from this
processor_idle.c | 2006 ++++++++++++++++++++++++++-----------------
1 file changed, 1219 insertions(+), 787 deletions(-)
to this:
processor_idle.c | 502 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 458 insertions(+), 44 deletions(-)
...for the purpose of making the cpuilde patch less invasive
and easier to review.
no functional changes. build tested only.
Signed-off-by: Len Brown <len.brown@intel.com>
commit 889172fc915f5a7fe20f35b133cbd205ce69bf6c
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Sep 13 13:40:05 2007 -0700
cpuidle: Retain old ACPI policy for !CONFIG_CPU_IDLE
Retain the old policy in processor_idle, so that when CPU_IDLE is not
configured, old C-state policy will still be used. This provides a
clean gradual migration path from old ACPI policy to new cpuidle
based policy.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 9544a8181edc7ecc33b3bfd69271571f98ed08bc
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Sep 13 13:39:17 2007 -0700
cpuidle: Configure governors by default
Quoting Len "Do not give an option to users to shoot themselves in the foot".
Remove the configurability of ladder and menu governors as they are
needed for default policy of cpuidle. That way users will not be able to
have cpuidle without any policy loosing all C-state power savings.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8975059a2c1e56cfe83d1bcf031bcf4cb39be743
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:27:07 2007 -0400
CPUIDLE: load ACPI properly when CPUIDLE is disabled
Change the registration return codes for when CPUIDLE
support is not compiled into the kernel. As a result, the ACPI
processor driver will load properly even if CPUIDLE is unavailable.
However, it may be possible to cleanup the ACPI processor driver further
and eliminate some dead code paths.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit e0322e2b58dd1b12ec669bf84693efe0dc2414a8
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:26:06 2007 -0400
CPUIDLE: remove cpuidle_get_bm_activity()
Remove cpuidle_get_bm_activity() and updates governors
accordingly.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 18a6e770d5c82ba26653e53d240caa617e09e9ab
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:58 2007 -0400
CPUIDLE: max_cstate fix
Currently max_cstate is limited to 0, resulting in no idle processor
power management on ACPI platforms. This patch restores the value to
the array size.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1fdc0887286179b40ce24bcdbde663172e205ef0
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:40 2007 -0400
CPUIDLE: handle BM detection inside the ACPI Processor driver
Update the ACPI processor driver to detect BM activity and
limit state entry depth internally, rather than exposing such
requirements to CPUIDLE. As a result, CPUIDLE can drop this
ACPI-specific interface and become more platform independent. BM
activity is now handled much more aggressively than it was in the
original implementation, so some testing coverage may be needed to
verify that this doesn't introduce any DMA buffer under-run issues.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0ef38840db666f48e3cdd2b769da676c57228dd9
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:25:14 2007 -0400
CPUIDLE: menu governor updates
Tweak the menu governor to more effectively handle non-timer
break events. Non-timer break events are detected by comparing the
actual sleep time to the expected sleep time. In future revisions, it
may be more reliable to use the timer data structures directly.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit bb4d74fca63fa96cf3ace644b15ae0f12b7df5a1
Author: Adam Belay <abelay@novell.com>
Date: Tue Aug 21 18:24:40 2007 -0400
CPUIDLE: fix 'current_governor' sysfs entry
Allow the "current_governor" sysfs entry to properly handle
input terminated with '\n'.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit df3c71559bb69b125f1a48971bf0d17f78bbdf47
Author: Len Brown <len.brown@intel.com>
Date: Sun Aug 12 02:00:45 2007 -0400
cpuidle: fix IA64 build (again)
Signed-off-by: Len Brown <len.brown@intel.com>
commit a02064579e3f9530fd31baae16b1fc46b5a7bca8
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:39:27 2007 -0400
cpuidle: Remove support for runtime changing of max_cstate
Remove support for runtime changeability of max_cstate. Drivers can use
use latency APIs.
max_cstate can still be used as a boot time option and dmi override.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0912a44b13adf22f5e3f607d263aed23b4910d7e
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:39:16 2007 -0400
cpuidle: Remove ACPI cstate_limit calls from ipw2100
ipw2100 already has code to use accetable_latency interfaces to limit the
C-state. Remove the calls to acpi_set_cstate_limit and acpi_get_cstate_limit
as they are redundant.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c649a76e76be6bff1fd770d0a775798813a3f6e0
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Sun Aug 12 01:35:39 2007 -0400
cpuidle: compile fix for pause and resume functions
Fix the compilation failure when cpuidle is not compiled in.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Adam Belay <adam.belay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 2305a5920fb8ee6ccec1c62ade05aa8351091d71
Author: Adam Belay <abelay@novell.com>
Date: Thu Jul 19 00:49:00 2007 -0400
cpuidle: re-write
Some portions have been rewritten to make the code cleaner and lighter
weight. The following is a list of changes:
1.) the state name is now included in the sysfs interface
2.) detection, hotplug, and available state modifications are handled by
CPUIDLE drivers directly
3.) the CPUIDLE idle handler is only ever installed when at least one
cpuidle_device is enabled and ready
4.) the menu governor BM code no longer overflows
5.) the sysfs attributes are now printed as unsigned integers, avoiding
negative values
6.) a variety of other small cleanups
Also, Idle drivers are no longer swappable during runtime through the
CPUIDLE sysfs inteface. On i386 and x86_64 most idle handlers (e.g.
poll, mwait, halt, etc.) don't benefit from an infrastructure that
supports multiple states, so I think using a more general case idle
handler selection mechanism would be cleaner.
Signed-off-by: Adam Belay <abelay@novell.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit df25b6b56955714e6e24b574d88d1fd11f0c3ee5
Author: Len Brown <len.brown@intel.com>
Date: Tue Jul 24 17:08:21 2007 -0400
cpuidle: fix IA64 buid
Signed-off-by: Len Brown <len.brown@intel.com>
commit fd6ada4c14488755ff7068860078c437431fbccd
Author: Adrian Bunk <bunk@stusta.de>
Date: Mon Jul 9 11:33:13 2007 -0700
cpuidle: static
make cpuidle_replace_governor() static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c1d4a2cebcadf2429c0c72e1d29aa2a9684c32e0
Author: Adrian Bunk <bunk@stusta.de>
Date: Tue Jul 3 00:54:40 2007 -0400
cpuidle: static
This patch makes the needlessly global struct menu_governor static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit dbf8780c6e8d572c2c273da97ed1cca7608fd999
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Tue Jul 3 00:49:14 2007 -0400
export symbol tick_nohz_get_sleep_length
ERROR: "tick_nohz_get_sleep_length" [drivers/cpuidle/governors/menu.ko] undefined!
ERROR: "tick_nohz_get_idle_jiffies" [drivers/cpuidle/governors/menu.ko] undefined!
And please be sure to get your changes to core kernel suitably reviewed.
Cc: Adam Belay <abelay@novell.com>
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 29f0e248e7017be15f99febf9143a2cef00b2961
Author: Andrew Morton <akpm@linux-foundation.org>
Date: Tue Jul 3 00:43:04 2007 -0400
tick.h needs hrtimer.h
It uses hrtimers.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit e40cede7d63a029e92712a3fe02faee60cc38fb4
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:40:34 2007 -0400
cpuidle: first round of documentation updates
Documentation changes based on Pavel's feedback.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 83b42be2efece386976507555c29e7773a0dfcd1
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:39:25 2007 -0400
cpuidle: add rating to the governors and pick the one with highest rating by default
Introduce a governor rating scheme to pick the right governor by default.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit d2a74b8c5e8f22def4709330d4bfc4a29209b71c
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:38:08 2007 -0400
cpuidle: make cpuidle sysfs driver governor switch off by default
Make default cpuidle sysfs to show current_governor and current_driver in
read-only mode. More elaborate available_governors and available_drivers with
writeable current_governor and current_driver interface only appear with
"cpuidle_sysfs_switch" boot parameter.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1f60a0e80bf83cf6b55c8845bbe5596ed8f6307b
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:37:00 2007 -0400
cpuidle: menu governor: change the early break condition
Change the C-state early break out algorithm in menu governor.
We only look at early breakouts that result in wakeups shorter than idle
state's target_residency. If such a breakout is frequent enough, eliminate
the particular idle state upto a timeout period.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 45a42095cf64b003b4a69be3ce7f434f97d7af51
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:35:38 2007 -0400
cpuidle: fix uninitialized variable in sysfs routine
Fix the uninitialized usage of ret.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 80dca7cdba3e6ee13eae277660873ab9584eb3be
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:34:16 2007 -0400
cpuidle: reenable /proc/acpi//power interface for the time being
Keep /proc/acpi/processor/CPU*/power around for a while as powertop depends
on it. It will be marked deprecated and removed in future. powertop can use
cpuidle interfaces instead.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 589c37c2646c5e3813a51255a5ee1159cb4c33fc
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Jul 3 00:32:37 2007 -0400
cpuidle: menu governor and hrtimer compile fix
Compile fix for menu governor.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0ba80bd9ab3ed304cb4f19b722e4cc6740588b5e
Author: Len Brown <len.brown@intel.com>
Date: Thu May 31 22:51:43 2007 -0400
cpuidle: build fix - cpuidle vs ipw2100 module
ERROR: "acpi_set_cstate_limit" [drivers/net/wireless/ipw2100.ko] undefined!
Signed-off-by: Len Brown <len.brown@intel.com>
commit d7d8fa7f96a7f7682be7c6cc0cc53fa7a18c3b58
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:47:07 2007 -0400
cpuidle: add the 'menu' governor
Here is my first take at implementing an idle PM governor that takes
full advantage of NO_HZ. I call it the 'menu' governor because it
considers the full list of idle states before each entry.
I've kept the implementation fairly simple. It attempts to guess the
next residency time and then chooses a state that would meet at least
the break-even point between power savings and entry cost. To this end,
it selects the deepest idle state that satisfies the following
constraints:
1. If the idle time elapsed since bus master activity was detected
is below a threshold (currently 20 ms), then limit the selection
to C2-type or above.
2. Do not choose a state with a break-even residency that exceeds
the expected time remaining until the next timer interrupt.
3. Do not choose a state with a break-even residency that exceeds
the elapsed time between the last pair of break events,
excluding timer interrupts.
This governor has an advantage over "ladder" governor because it
proactively checks how much time remains until the next timer interrupt
using the tick infrastructure. Also, it handles device interrupt
activity more intelligently by not including timer interrupts in break
event calculations. Finally, it doesn't make policy decisions using the
number of state entries, which can have variable residency times (NO_HZ
makes these potentially very large), and instead only considers sleep
time deltas.
The menu governor can be selected during runtime using the cpuidle sysfs
interface like so:
"echo "menu" > /sys/devices/system/cpu/cpuidle/current_governor"
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit a4bec7e65aa3b7488b879d971651cc99a6c410fe
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:47:03 2007 -0400
cpuidle: export time until next timer interrupt using NO_HZ
Expose information about the time remaining until the next
timer interrupt expires by utilizing the dynticks infrastructure.
Also modify the main idle loop to allow dynticks to handle
non-interrupt break events (e.g. DMA). Finally, expose sleep ticks
information to external code. Thomas Gleixner is responsible for much
of the code in this patch. However, I've made some additional changes,
so I'm probably responsible if there are any bugs or oversights :)
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 2929d8996fbc77f41a5ff86bb67cdde3ca7d2d72
Author: Adam Belay <abelay@novell.com>
Date: Sat Mar 24 03:46:58 2007 -0400
cpuidle: governor API changes
This patch prepares cpuidle for the menu governor. It adds an optional
stage after idle state entry to give the governor an opportunity to
check why the state was exited. Also it makes sure the idle loop
returns after each state entry, allowing the appropriate dynticks code
to run.
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 3a7fd42f9825c3b03e364ca59baa751bb350775f
Author: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Apr 26 00:03:59 2007 -0700
cpuidle: hang fix
Prevent hang on x86-64, when ACPI processor driver is added as a module on
a system that does not support C-states.
x86-64 expects all idle handlers to enable interrupts before returning from
idle handler. This is due to enter_idle(), exit_idle() races. Make
cpuidle_idle_call() confirm to this when there is no pm_idle_old.
Also, cpuidle look at the return values of attch_driver() and set
current_driver to NULL if attach fails on all CPUs.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 4893339a142afbd5b7c01ffadfd53d14746e858e
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:09 2007 +0800
cpuidle: add support for max_cstate limit
With CPUIDLE framework, the max_cstate (to limit max cpu c-state)
parameter is ingored. Some systems require it to ignore C2/C3
and some drivers like ipw require it too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 43bbbbe1cb998cbd2df656f55bb3bfe30f30e7d1
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:13 2007 +0800
cpuidle: add cpuidle_fore_redetect_devices API
add cpuidle_force_redetect_devices API,
which forces all CPU redetect idle states.
Next patch will use it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit d1edadd608f24836def5ec483d2edccfb37b1d19
Author: Shaohua Li <shaohua.li@intel.com>
Date: Thu Apr 26 10:40:01 2007 +0800
cpuidle: fix sysfs related issue
Fix the cpuidle sysfs issue.
a. make kobject dynamicaly allocated
b. fixed sysfs init issue to avoid suspend/resume issue
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 7169a5cc0d67b263978859672e86c13c23a5570d
Author: Randy Dunlap <randy.dunlap@oracle.com>
Date: Wed Mar 28 22:52:53 2007 -0400
cpuidle: 1-bit field must be unsigned
A 1-bit bitfield has no room for a sign bit.
drivers/cpuidle/governors/ladder.c:54:16: error: dubious bitfield without explicit `signed' or `unsigned'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 4658620158dc2fbd9e4bcb213c5b6fb5d05ba7d4
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Wed Mar 28 22:52:41 2007 -0400
cpuidle: fix boot hang
Patch for cpuidle boot hang reported by Larry Finger here.
http://www.ussg.iu.edu/hypermail/linux/kernel/0703.2/2025.html
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Larry Finger <larry.finger@lwfinger.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit c17e168aa6e5fe3851baaae8df2fbc1cf11443a9
Author: Len Brown <len.brown@intel.com>
Date: Wed Mar 7 04:37:53 2007 -0500
cpuidle: ladder does not depend on ACPI
build fix for CONFIG_ACPI=n
In file included from drivers/cpuidle/governors/ladder.c:21:
include/acpi/processor.h:88: error: expected specifier-qualifier-list before âacpi_integerâ
include/acpi/processor.h:106: error: expected specifier-qualifier-list before âacpi_integerâ
include/acpi/processor.h:168: error: expected specifier-qualifier-list before âacpi_handleâ
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8c91d958246bde68db0c3f0c57b535962ce861cb
Author: Adrian Bunk <bunk@stusta.de>
Date: Tue Mar 6 02:29:40 2007 -0800
cpuidle: make code static
This patch makes the following needlessly global code static:
- driver.c: __cpuidle_find_driver()
- governor.c: __cpuidle_find_governor()
- ladder.c: struct ladder_governor
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 0c39dc3187094c72c33ab65a64d2017b21f372d2
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Wed Mar 7 02:38:22 2007 -0500
cpu_idle: fix build break
This patch fixes a build breakage with !CONFIG_HOTPLUG_CPU and
CONFIG_CPU_IDLE.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 8112e3b115659b07df340ef170515799c0105f82
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Tue Mar 6 02:29:39 2007 -0800
cpuidle: build fix for !CPU_IDLE
Fix the compile issues when CPU_IDLE is not configured.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 1eb4431e9599cd25e0d9872f3c2c8986821839dd
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:54:57 2007 -0800
cpuidle take2: Basic documentation for cpuidle
Documentation for cpuidle infrastructure
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit ef5f15a8b79123a047285ec2e3899108661df779
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:54:03 2007 -0800
cpuidle take2: Hookup ACPI C-states driver with cpuidle
Hookup ACPI C-states onto generic cpuidle infrastructure.
drivers/acpi/procesor_idle.c is now a ACPI C-states driver that registers as
a driver in cpuidle infrastructure and the policy part is removed from
drivers/acpi/processor_idle.c. We use governor in cpuidle instead.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Len Brown <len.brown@intel.com>
commit 987196fa82d4db52c407e8c9d5dec884ba602183
Author: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Date: Thu Feb 22 13:52:57 2007 -0800
cpuidle take2: Core cpuidle infrastructure
Announcing 'cpuidle', a new CPU power management infrastructure to manage
idle CPUs in a clean and efficient manner.
cpuidle separates out the drivers that can provide support for multiple types
of idle states and policy governors that decide on what idle state to use
at run time.
A cpuidle driver can support multiple idle states based on parameters like
varying power consumption, wakeup latency, etc (ACPI C-states for example).
A cpuidle governor can be usage model specific (laptop, server,
laptop on battery etc).
Main advantage of the infrastructure being, it allows independent development
of drivers and governors and allows for better CPU power management.
A huge thanks to Adam Belay and Shaohua Li who were part of this mini-project
since its beginning and are greatly responsible for this patchset.
This patch:
Core cpuidle infrastructure.
Introduces a new abstraction layer for cpuidle:
* which manages drivers that can support multiple idles states. Drivers
can be generic or particular to specific hardware/platform
* allows pluging in multiple policy governors that can take idle state policy
decision
* The core also has a set of sysfs interfaces with which administrato can know
about supported drivers and governors and switch them at run time.
Signed-off-by: Adam Belay <abelay@novell.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Async signals should not be reported as sent by current in audit log. As
it is, we call audit_signal_info() too early in check_kill_permission().
Note that check_kill_permission() has that test already - it needs to know
if it should apply current-based permission checks. So the solution is to
move the call of audit_signal_info() between those.
Bogosity in question is easily reproduced - add a rule watching for e.g.
kill(2) from specific process (so that audit_signal_info() would not
short-circuit to nothing), say load_policy, watch the bogus OBJ_PID entry
in audit logs claiming that write(2) on selinuxfs file issued by
load_policy(8) had somehow managed to send a signal to syslogd...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using /proc/timer_stats on ppc64 I noticed the events/sec field wasnt
accurate. Sometimes the integer part was incorrect due to rounding (we
werent taking the fractional seconds into consideration).
The fraction part is also wrong, we need to pad the printf statement and
take the bottom three digits of 1000 times the value.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling handle_futex_death in exit_robust_list for the different robust
mutexes of a thread basically frees the mutex. Another thread might grab
the lock immediately which updates the next pointer of the mutex.
fetch_robust_entry over the next pointer might therefore branch into the
robust mutex list of a different thread. This can cause two problems: 1)
some mutexes held by the dead thread are not getting freed and 2) some
mutexs held by a different thread are freed.
The next point need to be read before calling handle_futex_death.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to disable all CPUs other than the boot CPU (usually 0) before
attempting to power-off modern SMP machines. This fixes the
hang-on-poweroff issue on my MythTV SMP box, and also on Thomas Gleixner's
new toybox.
Signed-off-by: Mark Lord <mlord@pobox.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a desparate attempt to fix the suspend/resume problem on Andrews
VAIO I added a workaround which enforced the broadcast of the oneshot
timer on resume. This was actually resolving the problem on the VAIO
but was just a stupid workaround, which was not tackling the root
cause: the assignement of lower idle C-States in the ACPI processor_idle
code. The cpuidle patches, which utilize the dynamic tick feature and
go faster into deeper C-states exposed the problem again. The correct
solution is the previous patch, which prevents lower C-states across
the suspend/resume.
Remove the enforcement code, including the conditional broadcast timer
arming, which helped to pamper over the real problem for quite a time.
The oneshot broadcast flag for the cpu, which runs the resume code can
never be set at the time when this code is executed. It only gets set,
when the CPU is entering a lower idle C-State.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This simplifies signalfd code, by avoiding it to remain attached to the
sighand during its lifetime.
In this way, the signalfd remain attached to the sighand only during
poll(2) (and select and epoll) and read(2). This also allows to remove
all the custom "tsk == current" checks in kernel/signal.c, since
dequeue_signal() will only be called by "current".
I think this is also what Ben was suggesting time ago.
The external effect of this, is that a thread can extract only its own
private signals and the group ones. I think this is an acceptable
behaviour, in that those are the signals the thread would be able to
fetch w/out signalfd.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using rt_mutex, a NULL pointer dereference is occurred at
enqueue_task_rt. Here is a scenario;
1) there are two threads, the thread A is fair_sched_class and
thread B is rt_sched_class.
2) Thread A is boosted up to rt_sched_class, because the thread A
has a rt_mutex lock and the thread B is waiting the lock.
3) At this time, when thread A create a new thread C, the thread
C has a rt_sched_class.
4) When doing wake_up_new_task() for the thread C, the priority
of the thread C is out of the RT priority range, because the
normal priority of thread A is not the RT priority. It makes
data corruption by overflowing the rt_prio_array.
The new thread C should be fair_sched_class.
The new thread should be valid scheduler class before queuing.
This patch fixes to set the suitable scheduler class.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
more agressive, by moving the yielding task to the last position
in the rbtree.
with sched_compat_yield=0:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop
with sched_compat_yield=1:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
It turned out, that the user namespace is released during the do_exit() in
exit_task_namespaces(), but the struct user_struct is released only during the
put_task_struct(), i.e. MUCH later.
On debug kernels with poisoned slabs this will cause the oops in
uid_hash_remove() because the head of the chain, which resides inside the
struct user_namespace, will be already freed and poisoned.
Since the uid hash itself is required only when someone can search it, i.e.
when the namespace is alive, we can safely unhash all the user_struct-s from
it during the namespace exiting. The subsequent free_uid() will complete the
user_struct destruction.
For example simple program
#include <sched.h>
char stack[2 * 1024 * 1024];
int f(void *foo)
{
return 0;
}
int main(void)
{
clone(f, stack + 1 * 1024 * 1024, 0x10000000, 0);
return 0;
}
run on kernel with CONFIG_USER_NS turned on will oops the
kernel immediately.
This was spotted during OpenVZ kernel testing.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Surprisingly, but (spotted by Alexey Dobriyan) the uid hash still uses
list_heads, thus occupying twice as much place as it could. Convert it to
hlist_heads.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct utsname is copied from master one without any exclusion.
Here is sample output from one proggie doing
sethostname("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
sethostname("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb");
and another
clone(,, CLONE_NEWUTS, ...)
uname()
hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaabbbbb'
hostname = 'bbbaaaaaaaaaaaaaaaaaaaaaaaaaaa'
hostname = 'aaaaaaaabbbbbbbbbbbbbbbbbbbbbb'
hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaabbbb'
hostname = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaabb'
hostname = 'aaabbbbbbbbbbbbbbbbbbbbbbbbbbb'
hostname = 'bbbbbbbbbbbbbbbbaaaaaaaaaaaaaa'
Hostname is sometimes corrupted.
Yes, even _the_ simplest namespace activity had bug in it. :-(
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Taking a cpu offline removes the cpu from the online mask before the
CPU_DEAD notification is done. The clock events layer does the cleanup
of the dead CPU from the CPU_DEAD notifier chain. tick_do_timer_cpu is
used to avoid xtime lock contention by assigning the task of jiffies
xtime updates to one CPU. If a CPU is taken offline, then this
assignment becomes stale. This went unnoticed because most of the time
the offline CPU went dead before the online CPU reached __cpu_die(),
where the CPU_DEAD state is checked. In the case that the offline CPU did
not reach the DEAD state before we reach __cpu_die(), the code in there
goes to sleep for 100ms. Due to the stale time update assignment, the
system is stuck forever.
Take the assignment away when a cpu is not longer in the cpu_online_mask.
We do this in the last call to tick_nohz_stop_sched_tick() when the offline
CPU is on the way to the final play_dead() idle entry.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a cpu goes offline it is removed from the broadcast masks. If the
mask becomes empty the code shuts down the broadcast device. This is
wrong, because the broadcast device needs to be ready for the online
cpu going idle (into a c-state, which stops the local apic timer).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The jinxed VAIO refuses to resume without hitting keys on the keyboard
when this is not enforced. It is unclear why the cpu ends up in a lower
C State without notifying the clock events layer, but enforcing the
oneshot broadcast here is safe.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Timekeeping resume adjusts xtime by adding the slept time in seconds and
resets the reference value of the clock source (clock->cycle_last).
clock->cycle last is used to calculate the delta between the last xtime
update and the readout of the clock source in __get_nsec_offset(). xtime
plus the offset is the current time. The resume code ignores the delta
which had already elapsed between the last xtime update and the actual
time of suspend. If the suspend time is short, then we can see time
going backwards on resume.
Suspend:
offs_s = clock->read() - clock->cycle_last;
now = xtime + offs_s;
timekeeping_suspend_time = read_rtc();
Resume:
sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;
if sleep_time_seconds == 0 and offs_r < offs_s, then time goes
backwards.
Fix this by storing the offset from the last xtime update and add it to
xtime during resume, when we reset clock->cycle_last:
sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
xtime += offs_s; /* Fixup xtime offset at suspend time */
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;
Thanks to Marcelo for tracking this down on the OLPC and providing the
necessary details to analyze the root cause.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Tosatti <marcelo@kvack.org>
Lockdep complains about the access of rtc in timekeeping_suspend
inside the interrupt disabled region of the write locked xtime lock.
Move the access outside.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Seems to me that this timer will only get started on platforms that say
they don't want it?
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gabriel Paubert <paubert@iram.es>
Cc: Zachary Amsden <zach@vmware.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The semantics of call_usermodehelper_pipe() used to be that it would fork
the helper, and wait for the kernel thread to be started. This was
implemented by setting sub_info.wait to 0 (implicitly), and doing a
wait_for_completion().
As part of the cleanup done in 0ab4dc9227,
call_usermodehelper_pipe() was changed to pass 1 as the value for wait to
call_usermodehelper_exec().
This is equivalent to setting sub_info.wait to 1, which is a change from
the previous behaviour. Using 1 instead of 0 causes
__call_usermodehelper() to start the kernel thread running
wait_for_helper(), rather than directly calling ____call_usermodehelper().
The end result is that the calling kernel code blocks until the user mode
helper finishes. As the helper is expecting input on stdin, and now no one
is writing anything, everything locks up (observed in do_coredump).
The fix is to change the 1 to UMH_WAIT_EXEC (aka 0), indicating that we
want to wait for the kernel thread to be started, but not for the helper to
finish.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The futex list traversal on the compat side appears to have
a bug.
It's loop termination condition compares:
while (compat_ptr(uentry) != &head->list)
But that can't be right because "uentry" has the special
"pi" indicator bit still potentially set at bit 0. This
is cleared by fetch_robust_entry() into the "entry"
return value.
What this seems to mean is that the list won't terminate
when list iteration gets back to the the head. And we'll
also process the list head like a normal entry, which could
cause all kinds of problems.
So we should check for equality with "entry". That pointer
is of the non-compat type so we have to do a little casting
to keep the compiler and sparse happy.
The same problem can in theory occur with the 'pending'
variable, although that has not been reported from users
so far.
Based on the original patch from David Miller.
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When PTRACE_SYSCALL was used and then PTRACE_DETACH is used, the
TIF_SYSCALL_TRACE flag is left set on the formerly-traced task. This
means that when a new tracer comes along and does PTRACE_ATTACH, it's
possible he gets a syscall tracing stop even though he's never used
PTRACE_SYSCALL. This happens if the task was in the middle of a system
call when the second PTRACE_ATTACH was done. The symptom is an
unexpected SIGTRAP when the tracer thinks that only SIGSTOP should have
been provoked by his ptrace calls so far.
A few machines already fixed this in ptrace_disable (i386, ia64, m68k).
But all other machines do not, and still have this bug. On x86_64, this
constitutes a regression in IA32 compatibility support.
Since all machines now use TIF_SYSCALL_TRACE for this, I put the
clearing of TIF_SYSCALL_TRACE in the generic ptrace_detach code rather
than adding it to every other machine's ptrace_disable.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fix ideal_runtime:
- do not scale it using niced_granularity()
it is against sum_exec_delta, so its wall-time, not fair-time.
- move the whole check into __check_preempt_curr_fair()
so that wakeup preemption can also benefit from the new logic.
this also results in code size reduction:
text data bss dec hex filename
13391 228 1204 14823 39e7 sched.o.before
13369 228 1204 14801 39d1 sched.o.after
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Second preparatory patch for fix-ideal runtime:
Mark prev_sum_exec_runtime at the beginning of our run, the same spot
that adds our wait period to wait_runtime. This seems a more natural
location to do this, and it also reduces the code a bit:
text data bss dec hex filename
13397 228 1204 14829 39ed sched.o.before
13391 228 1204 14823 39e7 sched.o.after
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Preparatory patch for fix-ideal-runtime:
simplify __check_preempt_curr_fair(): get rid of the integer return.
text data bss dec hex filename
13404 228 1204 14836 39f4 sched.o.before
13393 228 1204 14825 39e9 sched.o.after
functionality is unchanged.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the cfs_rq->wait_runtime debug/statistics counter was not maintained
properly - fix this.
this also removes some code:
text data bss dec hex filename
13420 228 1204 14852 3a04 sched.o.before
13404 228 1204 14836 39f4 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
fix niced_granularity(). This resulted in under-scheduling for
CPU-bound negative nice level tasks (and this in turn caused
higher than necessary latencies in nice-0 tasks).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
First fix the check
if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task)
with this
if (*imbalance < busiest_load_per_task)
As the current check is always false for nice 0 tasks (as
SCHED_LOAD_SCALE_FUZZ is same as busiest_load_per_task for nice 0
tasks).
With the above change, imbalance was getting reset to 0 in the corner
case condition, making the FUZZ logic fail. Fix it by not corrupting the
imbalance and change the imbalance, only when it finds that the HT/MC
optimization is needed.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
sched: clean up task_new_fair()
sched: small schedstat fix
sched: fix wait_start_fair condition in update_stats_wait_end()
sched: call update_curr() in task_tick_fair()
sched: make the scheduler converge to the ideal latency
sched: fix sleeper bonus limit
Spotted by taoyue <yue.tao@windriver.com> and Jeremy Katz <jeremy.katz@windriver.com>.
collect_signal: sigqueue_free:
list_del_init(&first->list);
if (!list_empty(&q->list)) {
// not taken
}
q->flags &= ~SIGQUEUE_PREALLOC;
__sigqueue_free(first); __sigqueue_free(q);
Now, __sigqueue_free() is called twice on the same "struct sigqueue" with the
obviously bad implications.
In particular, this double free breaks the array_cache->avail logic, so the
same sigqueue could be "allocated" twice, and the bug can manifest itself via
the "impossible" BUG_ON(!SIGQUEUE_PREALLOC) in sigqueue_free/send_sigqueue.
Hopefully this can explain these mysterious bug-reports, see
http://marc.info/?t=118766926500003http://marc.info/?t=118466273000005
Alexey Dobriyan reports this patch makes the difference for the testcase, but
nobody has an access to the application which opened the problems originally.
Also, this patch removes tasklist lock/unlock, ->siglock is enough.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: taoyue <yue.tao@windriver.com>
Cc: Jeremy Katz <jeremy.katz@windriver.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mariusz Kozlowski reported lockdep's warning:
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.23-rc2-mm1 #7
> ---------------------------------
> inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> ifconfig/5492 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&tp->lock){+...}, at: [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too]
> {in-hardirq-W} state was registered at:
> [<c0138eeb>] __lock_acquire+0x949/0x11ac
> [<c01397e7>] lock_acquire+0x99/0xb2
> [<c0452ff3>] _spin_lock+0x35/0x42
> [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too]
> [<c0147a5d>] handle_IRQ_event+0x28/0x59
> [<c01493ca>] handle_level_irq+0xad/0x10b
> [<c0105a13>] do_IRQ+0x93/0xd0
> [<c010441e>] common_interrupt+0x2e/0x34
...
> other info that might help us debug this:
> 1 lock held by ifconfig/5492:
> #0: (rtnl_mutex){--..}, at: [<c0451778>] mutex_lock+0x1c/0x1f
>
> stack backtrace:
...
> [<c0452ff3>] _spin_lock+0x35/0x42
> [<de8706e0>] rtl8139_interrupt+0x27/0x46b [8139too]
> [<c01480fd>] free_irq+0x11b/0x146
> [<de871d59>] rtl8139_close+0x8a/0x14a [8139too]
> [<c03bde63>] dev_close+0x57/0x74
...
This shows that a driver's irq handler was running both in hard interrupt
and process contexts with irqs enabled. The latter was done during
free_irq() call and was possible only with CONFIG_DEBUG_SHIRQ enabled.
This was fixed by another patch.
But similar problem is possible with request_irq(): any locks taken from
irq handler could be vulnerable - especially with soft interrupts. This
patch fixes it by disabling local interrupts during handler's run. (It
seems, disabling softirqs should be enough, but it needs more checking
on possible races or other special cases).
Reported-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION introduced by commit
296699de6b "Introduce CONFIG_SUSPEND for
suspend-to-Ram and standby" are incorrect, as they don't cover the facts that
(1) not all architectures support suspend and (2) SMP hibernation is only
possible on X86 and PPC64 (if CONFIG_PPC64_SWSUSP is set).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spotted by Marcin Kowalczyk <qrczak@knm.org.pl>.
sys_setpgid(child) fails if the child was forked by sub-thread.
Fix the "is it our child" check. The previous commit
ee0acf90d3 was not complete.
(this patch asks for the new same_thread_group() helper, but mainline doesn't
have it yet).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: <stable@kernel.org>
Tested-by: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
taskstats.ac_exitcode is assigned to task_struct.exit_code in bacct_add_tsk()
through the following kernel function calls:
do_exit()
taskstats_exit()
fill_pid()
bacct_add_tsk()
The problem is that in do_exit(), task_struct.exit_code is set to 'code' only
after taskstats_exit() has been called. So we need to move the assignment
before taskstats_exit().
Signed-off-by: Jonathan Lim <jlim@sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cleanup: we have the 'se' and 'curr' entity-pointers already,
no need to use p->se and current->se.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
small schedstat fix: the cfs_rq->wait_runtime 'sum of all runtimes'
statistics counters missed newly forked tasks and thus had a constant
negative skew. Fix this.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Peter Zijlstra noticed the following bug in SCHED_FEAT_SKIP_INITIAL (which
is disabled by default at the moment): it relies on se.wait_start_fair
being 0 while update_stats_wait_end() did not recognize a 0 value,
so instead of 'skipping' the initial interval we gave the new child
a maximum boost of +runtime-limit ...
(No impact on the default kernel, but nice to fix for completeness.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
update the fair-clock before using it for the key value.
[ mingo@elte.hu: small cleanups. ]
Signed-off-by: Ting Yang <tingy@cs.umass.edu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
de-HZ-ification of the granularity defaults unearthed a pre-existing
property of CFS: while it correctly converges to the granularity goal,
it does not prevent run-time fluctuations in the range of
[-gran ... 0 ... +gran].
With the increase of the granularity due to the removal of HZ
dependencies, this becomes visible in chew-max output (with 5 tasks
running):
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40
out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40
out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40
out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
average slice is the ideal 13 msecs and the period is picture-perfect 40
msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no
mechanism in CFS to keep that from happening: it's a perfectly valid
solution that CFS finds.
to fix this we add a granularity/preemption rule that knows about
the "target latency", which makes tasks that run longer than the ideal
latency run a bit less. The simplest approach is to simply decrease the
preemption granularity when a task overruns its ideal latency. For this
we have to track how much the task executed since its last preemption.
( this adds a new field to task_struct, but we can eliminate that
overhead in 2.6.24 by putting all the scheduler timestamps into an
anonymous union. )
with this change in place, chew-max output is fluctuation-less all
around:
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
this patch has no impact on any fastpath or on any globally observable
scheduling property. (unless you have sharp enough eyes to see
millisecond-level ruckles in glxgears smoothness :-)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
There is an Amarok song switch time increase (regression) under
hefty load.
What is happening is that sleeper_bonus is never consumed, and only
rarely goes below runtime_limit, so for the most part, Amarok isn't
getting any bonus at all. We're keeping sleeper_bonus right at
runtime_limit (sched_latency == sched_runtime_limit == 40ms) forever, ie
we don't consume if we're lower that that, and don't add if we're above
it. One Amarok thread waking (or anybody else) will push us past the
threshold, so the next thread waking gets nada, but will reap pain from
the previous thread waking until we drop back to runtime_limit. It
looks to me like under load, some random task gets a bonus, and
everybody else pays, whether deserving or not.
This diff fixed the regression for me at any load rate.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fix bogus DEBUG_PREEMPT warning on x86_64, when cpu brought online after
bootup: current_is_keventd is right to note its use of smp_processor_id
is preempt-safe, but should use raw_smp_processor_id to avoid the warning.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
runtime limit and wakeup granularity used to be a function of
granularity and that was incorrect changed to sched_latency.
Fix this to make wakeup granularity a function of min-granularity,
and the runtime limit equal to latency.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
due to adaptive granularity scheduling the role of sched_granularity
has changed to "minimum granularity", so rename the variable (and the
tunable) accordingly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Instead of specifying the preemption granularity, specify the wanted
latency. By fixing the granlarity to a constany the wakeup latency
it a function of the number of running tasks on the rq.
Invert this relation.
sysctl_sched_granularity becomes a minimum for the dynamic granularity
computed from the new sysctl_sched_latency.
Then use this latency to do more intelligent granularity decisions: if
there are fewer tasks running then we can schedule coarser. This helps
performance while still always keeping the latency target.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the lockdep sysctls not depend on CONFIG_SCHED_DEBUG.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>