Add a prototype for driver_init() in include/linux/device.h.
Also remove a static function of the same name in drivers/acpi/ibm_acpi.c to
ibm_acpi_driver_init() to fix the namespace collision.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since kobject_uevent() function does not return an integer value to
indicate if its operation was completed with success or not, it is worth
changing it in order to report a proper status (success or error) instead
of returning void.
[randy.dunlap@oracle.com: Fix inline kobject functions]
Cc: Mauricio Lin <mauriciolin@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since commit 368c73d4f6 the kernel will try
to update the non-writeable BAR registers 0..3 of PIIX4 IDE adapters if
pci_assign_unassigned_resources() is used to do full resource assignment of
the bus. This fails because in the PIIX4 these BAR registers have
implicitly assumed values and read back as zero; it used to work because
the kernel used to just write zero to that register the read back value did
match what was written.
The fix is a new resource flag IORESOURCE_PCI_FIXED used to mark a resource
as non-movable. This will also be useful to keep other import system
resources from being moved around - for example system consoles on PCI
busses.
[akpm@osdl.org: cleanup]
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I don't see any good reason for exporting device IDs to userspace.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch is designed to fix:
- Disk eating corruptor on KT7 after resume from RAM
- VIA IRQ handling
- VIA fixups for bus lockups after resume from RAM
The core of this is to add a table of resume fixups run at resume time.
We need to do this for a variety of boards and features, but particularly
we need to do this to get various critical VIA fixups done on resume.
The second part of the problem is to handle VIA IRQ number rules which
are a bit odd and need special handling for PIC interrupts. Various
patches broke various boxes and while this one may not be perfect
(hopefully it is) it ensures the workaround is applied to the right
devices only.
From: Jean Delvare <khali@linux-fr.org>
Now that PCI quirks are replayed on software resume, we can safely
re-enable the Asus SMBus unhiding quirk even when software suspend support
is enabled.
[akpm@osdl.org: fix const warning]
Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a few #defines for grabbing and working with the address fields
in a HT_CAPTYPE_MSI_MAPPING capability. All from the HT spec v3.00.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are already several places in the kernel that want to search a PCI
device for a given Hypertransport capability. Although this is possible
using pci_find_capability() etc., it makes sense to encapsulate that
logic in a helper - pci_find_ht_capability().
To cater for searching exhaustively for a capability, we also provide
pci_find_next_ht_capability().
We also need to cater for the fact that the HT capability fields may be
either 3 or 5 bits wide. pci_find_ht_capability() deals with this for you,
but callers using the #defines directly must handle that themselves.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This works like pci_dev_present but instead of returning boolean returns
the matching pci_device_id entry. This makes it much more useful. Code
bloat is basically nil as the old boolean function is rewritten in terms of
the new one.
This will be used by the updated VIA PCI quirks for one
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
pci: add class codes for Wireless RF controllers
Add PCI codes to include/linux/pci_ids.h for RF controllers; first
batch of these devices seem to be the Ultra-Wide-Band and Wireless USB
controllers (WHCI spec).
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently we allow any merge, even if the io originates from different
processes. This can cause really bad starvation and unfairness, if those
ios happen to be synchronous (reads or direct writes).
So add a allow_merge hook to the io scheduler ops, so an io scheduler can
help decide whether a bio/process combination may be merged with an
existing request.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Add generic abstract layer for display output switch control. The output
sysfs class driver provides an abstract video output layer that can be used to
hook platform specific methods to enable/disable video output device through
common sysfs interface.
Signed-off-by: Luming Yu <Luming.yu@intel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch set adds generic abstract layer support for acpi video driver to
have generic user interface to control backlight and output switch control by
leveraging the existing backlight sysfs class driver, and by adding a new
video output sysfs class driver.
This patch:
Add dev argument for backlight_device_register to link the class device to
real device object. The platform specific driver should find a way to get the
real device object for their video device.
[akpm@osdl.org: build fix]
[akpm@osdl.org: fix msi-laptop.c]
Signed-off-by: Luming Yu <Luming.yu@intel.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
[PATCH] Generic HID layer - update MAINTAINERS
input/hid: Supporting more keys from the HUT Consumer Page
[PATCH] Generic HID layer - build: USB_HID should select HID
The blk_rq_unmap_user() API is not very nice. It expects the caller to
know that rq->bio has to be reset to the original bio, and it will
silently do nothing if that is not done. Instead make it explicit that
we need to pass in the first bio, by expecting a bio argument.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We have full flexibility of merging parameters now, so we can remove the
hooks that define back/front/request merge strategies. Nobody is using
them anymore.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
On architectures where the atomicity of the bit operations is handled by
external means (ie a separate spinlock to protect concurrent accesses),
just doing a direct assignment on the workqueue data field (as done by
commit 4594bf159f) can cause the
assignment to be lost due to lack of serialization with the bitops on
the same word.
So we need to serialize the assignment with the locks on those
architectures (notably older ARM chips, PA-RISC and sparc32).
So rather than using an "unsigned long", let's use "atomic_long_t",
which already has a safe assignment operation (atomic_long_set()) on
such architectures.
This requires that the atomic operations use the same atomicity locks as
the bit operations do, but that is largely the case anyway. Sparc32
will probably need fixing.
Architectures (including modern ARM with LL/SC) that implement sane
atomic operations for SMP won't see any of this matter.
Cc: Russell King <rmk+lkml@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: David Miller <davem@davemloft.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Linux Arch Maintainers <linux-arch@vger.kernel.org>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nobody uses it, but it was still wrong. Using the macro argument name
'work' meant that when we used 'work' as a member name, that would also
get replaced by the macro argument.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It has caused more problems than it ever really solved, and is
apparently not getting cleaned up and fixed. We can put it back when
it's stable and isn't likely to make warning or bug events worse.
In the meantime, enable frame pointers for more readable stack traces.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On USB keyboards lots of hot/internet keys are not working. This patch
adds support for a number of keys from the USB HID Usage Table
(http://www.usb.org/developers/devclass_docs/Hut1_12.pdf).
It also adds several new key codes. Most of them are used on real world
keyboards I know. I added some others (KEY_+ EDITOR, GRAPHICSEDITOR, DATABASE,
NEWS, VOICEMAIL, VIDEOPHONE) to avoid "holes".
I also added KEY_ZOOMRESET as it is possible to have a inet keyboard and a
remote control in parallel and it makes sense to have them behave differently.
Signed-off-by: Florian Festi <ffesti@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Remove the deferred hooks and all related code as scheduled in
feature-removal-schedule.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
platform_device_add_data() makes a copy of the data that is given to it,
and thus the parameter can be const. This removes a warning when data
from get_property() on powerpc is handed to platform_device_add_data(),
as get_property() returns a const pointer.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Problem:
1. There is a process containing two thread (T1 and T2). The
thread T1 calls fork(). Then dup_mmap() function called on T1 context.
static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
...
flush_cache_mm(current->mm);
... /* A */
(write-protect all Copy-On-Write pages)
... /* B */
flush_tlb_mm(current->mm);
...
2. When preemption happens between A and B (or on SMP kernel), the
thread T2 can run and modify data on COW pages without page fault
(modified data will stay in cache).
3. Some time after fork() completed, the thread T2 may cause a page
fault by write-protect on a COW page.
4. Then data of the COW page will be copied to newly allocated
physical page (copy_cow_page()). It reads data via kernel mapping.
The kernel mapping can have different 'color' with user space
mapping of the thread T2 (dcache aliasing). Therefore
copy_cow_page() will copy stale data. Then the modified data in
cache will be lost.
In order to allow architecture code to deal with this problem allow
architecture code to override copy_user_highpage() by defining
__HAVE_ARCH_COPY_USER_HIGHPAGE in <asm/page.h>.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6:
hwmon: Add MAINTAINERS entry for new ams driver
hwmon: New AMS hardware monitoring driver
hwmon/w83793: Add documentation and maintainer
hwmon: New Winbond W83793 hardware monitoring driver
hwmon: Update Rudolf Marek's e-mail address
hwmon/f71805f: Fix the device address decoding
hwmon/f71805f: Always create all fan inputs
hwmon/f71805f: Add support for the Fintek F71872F/FG chip
hwmon: New PC87427 hardware monitoring driver
hwmon/it87: Remove the SMBus interface support
hwmon/hdaps: Update the list of supported devices
hwmon/hdaps: Move the DMI detection data to .data
hwmon/pc87360: Autodetect the VRM version
hwmon/f71805f: Document the fan control features
hwmon/f71805f: Add support for "speed mode" fan speed control
hwmon/f71805f: Support DC fan speed control mode
hwmon/f71805f: Let the user adjust the PWM base frequency
hwmon/f71805f: Add manual fan speed control
hwmon/f71805f: Store the fan control registers
Run this:
#!/bin/sh
for f in $(grep -Erl "\([^\)]*\) *k[cmz]alloc" *) ; do
echo "De-casting $f..."
perl -pi -e "s/ ?= ?\([^\)]*\) *(k[cmz]alloc) *\(/ = \1\(/" $f
done
And then go through and reinstate those cases where code is casting pointers
to non-pointers.
And then drop a few hunks which conflicted with outstanding work.
Cc: Russell King <rmk@arm.linux.org.uk>, Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Paul Fulghum <paulkf@microgate.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Karsten Keil <kkeil@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define an op descriptor struct, use it to simplify nfsd4_proc_compound().
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tuck away the replay_owner in the cstate while we're at it.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the saved and current filehandles together into all the nfsd4 compound
operations.
I want a unified interface to these operations so we can just call them by
pointer and throw out the huge switch statement.
Also I'll eventually want a structure like this--that holds the state used
during compound processing--for deferral.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A comment here incorrectly states that "slack_space" is measured in words, not
bytes. Remove the comment, and adjust a variable name and a few comments to
clarify the situation.
This is pure cleanup; there should be no change in functionality.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the tracking of the user space watchdog process from using
a pid_t to use struct pid. This makes us safe from pid wrap around issues and
prepares the way for the pid namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
smbfs keeps track of the user space server process in conn_pid. This converts
that track to use a struct pid instead of pid_t. This keeps us safe from pid
wrap around issues and prepares the way for the pid namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently this driver tracks user space clients it should send signals to. In
the presenct of file descriptor passing this is appears susceptible to
confusion from pid wrap around issues.
Replacing this with a struct pid prevents us from getting confused, and
prepares for a pid namespace implementation.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Annotated, all places switched to keeping status net-endian.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All kcalloc() calls of the form "kcalloc(1,...)" are converted to the
equivalent kzalloc() calls, and a few kcalloc() calls with the incorrect
ordering of the first two arguments are fixed.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we print an assert due to scheduling-in-atomic bugs, and if lockdep
is enabled, then the IRQ tracing information of lockdep can be printed
to pinpoint the code location that disabled interrupts. This saved me
quite a bit of debugging time in cases where the backtrace did not
identify the irq-disabling site well enough.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most distributions enable sysrq support but set it to 0 by default. Add a
sysrq_always_enabled boot option to always-enable sysrq keys. Useful for
debugging - without having to modify the disribution's config files (which
might not be possible if the kernel is on a live CD, etc.).
Also, while at it, clean up the sysrq interfaces.
[bunk@stusta.de: make sysrq_always_enabled_setup() static]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement block device specific .direct_IO method instead of going through
generic direct_io_worker for block device.
direct_io_worker() is fairly complex because it needs to handle O_DIRECT on
file system, where it needs to perform block allocation, hole detection,
extents file on write, and tons of other corner cases. The end result is
that it takes tons of CPU time to submit an I/O.
For block device, the block allocation is much simpler and a tight triple
loop can be written to iterate each iovec and each page within the iovec in
order to construct/prepare bio structure and then subsequently submit it to
the block layer. This significantly speeds up O_D on block device.
[akpm@osdl.org: small speedup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add "relatime" (relative atime) support. Relative atime only updates the
atime if the previous atime is older than the mtime or ctime. Like
noatime, but useful for applications like mutt that need to know when a
file has been read since it was last modified.
A corresponding patch against mount(8) is available at
http://userweb.kernel.org/~akpm/mount-relative-atime.txt
Signed-off-by: Valerie Henson <val_henson@linux.intel.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, to tell a task that it should go to the refrigerator, we set the
PF_FREEZE flag for it and send a fake signal to it. Unfortunately there
are two SMP-related problems with this approach. First, a task running on
another CPU may be updating its flags while the freezer attempts to set
PF_FREEZE for it and this may leave the task's flags in an inconsistent
state. Second, there is a potential race between freeze_process() and
refrigerator() in which freeze_process() running on one CPU is reading a
task's PF_FREEZE flag while refrigerator() running on another CPU has just
set PF_FROZEN for the same task and attempts to reset PF_FREEZE for it. If
the refrigerator wins the race, freeze_process() will state that PF_FREEZE
hasn't been set for the task and will set it unnecessarily, so the task
will go to the refrigerator once again after it's been thawed.
To solve first of these problems we need to stop using PF_FREEZE to tell
tasks that they should go to the refrigerator. Instead, we can introduce a
special TIF_*** flag and use it for this purpose, since it is allowed to
change the other tasks' TIF_*** flags and there are special calls for it.
To avoid the freeze_process()-refrigerator() race we can make
freeze_process() to always check the task's PF_FROZEN flag after it's read
its "freeze" flag. We should also make sure that refrigerator() will
always reset the task's "freeze" flag after it's set PF_FROZEN for it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When some objects are allocated by one CPU but freed by another CPU we can
consume lot of cycles doing divides in obj_to_index().
(Typical load on a dual processor machine where network interrupts are
handled by one particular CPU (allocating skbufs), and the other CPU is
running the application (consuming and freeing skbufs))
Here on one production server (dual-core AMD Opteron 285), I noticed this
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
quite modern cpus and the divide is much more expensive on oldest
architectures :
On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1
cycle for a multiply.
Doing some math, we can use a reciprocal multiplication instead of a divide.
If we want to compute V = (A / B) (A and B being u32 quantities)
we can instead use :
V = ((u64)A * RECIPROCAL(B)) >> 32 ;
where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B
Note :
I wrote pure C code for clarity. gcc output for i386 is not optimal but
acceptable :
mull 0x14(%ebx)
mov %edx,%eax // part of the >> 32
xor %edx,%edx // useless
mov %eax,(%esp) // could be avoided
mov %edx,0x4(%esp) // useless
mov (%esp),%ebx
[akpm@osdl.org: small cleanups]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:
cpuset_zone_allowed_hardwall()
cpuset_zone_allowed_softwall()
Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.
If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.
Unfortunately, this meant that users would end up with the softwall version
without thinking about it. Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.
The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)
The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)
This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.
If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.
This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines. It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.
For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.
Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
More cleanups for slab.h
1. Remove tabs from weird locations as suggested by Pekka
2. Drop the check for NUMA and SLAB_DEBUG from the fallback section
as suggested by Pekka.
3. Uses static inline for the fallback defs as also suggested by Pekka.
4. Make kmem_ptr_valid take a const * argument.
5. Separate the NUMA fallback definitions from the kmalloc_track fallback
definitions.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is a response to an earlier discussion on linux-mm about splitting
slab.h components per allocator. Patch is against 2.6.19-git11. See
http://marc.theaimsgroup.com/?l=linux-mm&m=116469577431008&w=2
This patch cleans up the slab header definitions. We define the common
functions of slob and slab in slab.h and put the extra definitions needed
for slab's kmalloc implementations in <linux/slab_def.h>. In order to get
a greater set of common functions we add several empty functions to slob.c
and also rename slob's kmalloc to __kmalloc.
Slob does not need any special definitions since we introduce a fallback
case. If there is no need for a slab implementation to provide its own
kmalloc mess^H^H^Hacros then we simply fall back to __kmalloc functions.
That is sufficient for SLOB.
Sort the function in slab.h according to their functionality. First the
functions operating on struct kmem_cache * then the kmalloc related
functions followed by special debug and fallback definitions.
Also redo a lot of comments.
Signed-off-by: Christoph Lameter <clameter@sgi.com>?
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fields of struct pipe_buf_operations have not a precise layout (ie not
optimized to fit cache lines nor reduce cache line ping pongs)
The bufs[] array is *large* and is placed near the beginning of the
structure, so all following fields have a large offset. This is
unfortunate because many archs have smaller instructions when using small
offsets relative to a base register. On x86 for example, 7 bits offsets
have smaller instruction lengths.
Moving bufs[] at the end of pipe_buf_operations permits all fields to have
small offsets, and reduce text size, and icache pressure.
# size vmlinux.pre vmlinux
text data bss dec hex filename
3268989 664356 492196 4425541 438745 vmlinux.pre
3268765 664356 492196 4425317 438665 vmlinux
So this patch reduces text size by 224 bytes on my x86_64 machine. Similar
results on ia32.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 373beb35cd.
No one is using this identifier yet. The purpose of this identifier is to
export nsproxy to user space which is wrong. nsproxy is an internal
implementation optimization, which should keep our fork times from getting
slower as we increase the number of global namespaces you don't have to
share.
Adding a global identifier like this is inappropriate because it makes
namespaces inherently non-recursive, greatly limiting what we can do with
them in the future.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- pipe/splice should use const pipe_buf_operations and file_operations
- struct pipe_inode_info has an unused field "start" : get rid of it.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
Fix inotify maintainers entry
Fix typo in new debug options.
Jon needs a new shift key.
fs: Convert kmalloc() + memset() to kzalloc() in fs/.
configfs.h: Remove dead macro definitions.
kconfig: Standardize "depends" -> "depends on" in Kconfig files
e100: replace kmalloc with kcalloc
um: replace kmalloc+memset with kzalloc
fix typo in net/ipv4/ip_fragment.c
include/linux/compiler.h: reject gcc 3 < gcc 3.2
Kconfig: fix spelling error in config KALLSYMS help text
Remove duplicate "have to" in comment
Fix small typo in drivers/serial/icom.c
Use consistent casing in help message
EXT{2,3,4}_FS: remove outdated part of the help text
Delete the __ATTR-related macro definitions since these are now
defined in include/linux/sysfs.h.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
The kernel doesn't compile with GCC <3.2, do not allow it to succeed if GCC
3.0.x or 3.1.x are used.
Signed-off-by: Alistair John Strachan <s0348365@sms.ed.ac.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Introduced in commit 7cc13edc13.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
* 'i2c-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6:
i2c: Fix OMAP clock prescaler to match the comment
i2c: Refactor a kfree in i2c-dev
i2c: Fix return value check in i2c-dev
i2c: Enable PEC on more i2c-i801 devices
i2c: Discard the i2c algo del_bus wrappers
i2c: New ARM Versatile/Realview bus driver
i2c: fix broken ds1337 initialization
i2c: i2c-i801 documentation update
i2c: Use the __ATTR macro where possible
i2c: Whitespace cleanups
i2c: Use put_user instead of copy_to_user where possible
i2c: New Atmel AT91 bus driver
i2c: Add support for nested i2c bus locking
i2c: Cleanups to the i2c-nforce2 bus driver
i2c: Add request/release_mem_region to i2c-ibm_iic bus driver
i2c: New Philips PNX bus driver
i2c: Delete the broken i2c-ite bus driver
i2c: Update the list of driver IDs
i2c: Fix documentation typos
This interface was useless as the LPC ISA-like interface is always
available, is faster, and is more reliable. This cuts the driver
size by some 20%.
This change is also required to later convert the it87 driver to a
platform driver, so that we can get rid of i2c-isa in a near future.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
seqlock_init() needs to use spin_lock_init() for dynamic locks, so that
lockdep is notified about the presence of a new lock.
(this is a fallout of the recent networking merge, which started using
the so-far unused seqlock_init() API.)
This fix solves the following lockdep-internal warning on current -git:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
__lock_acquire+0x10c/0x9f9
lock_acquire+0x56/0x72
_spin_lock+0x35/0x42
neigh_destroy+0x9d/0x12e
neigh_periodic_timer+0x10a/0x15c
run_timer_softirq+0x126/0x18e
__do_softirq+0x6b/0xe6
do_softirq+0x64/0xd2
ksoftirqd+0x82/0x138
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While working on bidi support at struct request level
I have found that blk_queue_activity_fn is actually never used.
The only user is in ide-probe.c with this code:
/* enable led activity for disk drives only */
if (drive->media == ide_disk && hwif->led_act)
blk_queue_activity_fn(q, hwif->led_act, drive);
And led_act is never initialized anywhere.
(Looking back at older kernels it was used in the PPC arch, but was removed around 2.6.18)
Unless it is all for future use off course.
(this patch is against linux-2.6-block.git as off 2006/12/4)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (36 commits)
[POWERPC] Generic BUG for powerpc
[PPC] Fix compile failure do to introduction of PHY_POLL
[POWERPC] Only export __mtdcr/__mfdcr if CONFIG_PPC_DCR is set
[POWERPC] Remove old dcr.S
[POWERPC] Fix SPU coredump code for max_fdset removal
[POWERPC] Fix irq routing on some 32-bit PowerMacs
[POWERPC] ps3: Add vuart support
[POWERPC] Support ibm,dynamic-reconfiguration-memory nodes
[POWERPC] dont allow pSeries_probe to succeed without initialising MMU
[POWERPC] micro optimise pSeries_probe
[POWERPC] Add SPURR SPR to sysfs
[POWERPC] Add DSCR SPR to sysfs
[POWERPC] Fix 440SPe CPU table entry
[POWERPC] Add support for FP emulation for the e300c2 core
[POWERPC] of_device_register: propagate device_create_file return code
[POWERPC] Fix mmap of PCI resource with hack for X
[POWERPC] iSeries: head_64.o needs to depend on lparmap.s
[POWERPC] cbe_thermal: Fix initialization of sysfs attribute_group
[POWERPC] Remove QE header files from lite5200.c
[POWERPC] of_platform_make_bus_id(): make `magic' int
...
That accumulated over the last months hackaton, shame on me for not
using git-apply whitespace helping hand, will do that from now on.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
This patch
* resolves a bug where packets smaller than 32/64 bytes resulted in sending rates of 0
* supports all sending rates from 1/64 bytes/second up to 4Gbyte/second
* simplifies the present overflow problems in calculations
Current sending rate X and the cached value X_recv of the receiver-estimated
sending rate are both scaled by 64 (2^6) in order to
* cope with low sending rates (minimally 1 byte/second)
* allow upgrading to use a packets-per-second implementation of CCID 3
* avoid calculation errors due to integer arithmetic cut-off
The patch implements a revised strategy from
http://www.mail-archive.com/dccp@vger.kernel.org/msg01040.html
The only difference with regard to that strategy is that t_ipi is already
used in the calculation of the nofeedback timeout, which saves one division.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
We should not initialize rootfs before all the core initializers have
run. So do it as a separate stage just before starting the regular
driver initializers.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As reported by Andy Whitcroft, at least the SLES9 initrd build process
depends on getting the kernel version from the kernel binary. It does
that by simply trawling the binary and looking for the signature of the
"linux_banner" string (the string "Linux version " to be exact. Which
is really broken in itself, but whatever..)
That got broken when the string was changed to allow /proc/version to
change the UTS release information dynamically, and "get_kernel_version"
thus returned "%s" (see commit a2ee8649ba:
"[PATCH] Fix linux banner utsname information").
This just restores "linux_banner" as a static string, which should fix
the version finding. And /proc/version simply uses a different string.
To avoid wasting even that miniscule amount of memory, the early boot
string should really be marked __initdata, but that just causes the same
bug in SLES9 to re-appear, since it will then find other occurrences of
"Linux version " first.
Cc: Andy Whitcroft <apw@shadowen.org>
Acked-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Steve Fox <drfickle@us.ibm.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
PHY_POLL is defined in <linux/phy.h> include it in <linux/fsl_devices.h> so
board code will have it defined.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Remove extraneous whitespace from various i2c headers and core files,
like space-before-tab and whitespace at end of line.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
This patch adds the 'level' field into the i2c_adapter structure, which is
used to represent the 'logical' level of nesting for the purposes of
lockdep. This field is then used in the i2c_transfer() function, to
acquire the per-adapter bus_lock with correct nesting level.
Signed-off-by: Jiri Kosina <jikos@jikos.cz>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
New I2C bus driver for Philips ARM boards (Philips IP3204 I2C IP
block). This I2C controller can be found on (at least) PNX010x,
PNX52xx and PNX4008 Philips boards.
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
The rest of the ITE8172 support was already removed from MIPS tree.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
* A chip driver ID was assigned to the Radeon, while it is an adapter
so it needs an i2c adapter ID.
* The SAA7191 is a video decoder, not encoder.
* The icspll driver is dead, and will never be ported to Linux 2.6.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (132 commits)
V4L/DVB 4949b: Fix container_of pointer retreival
V4L/DVB (4949a): Fix INIT_WORK
V4L/DVB (4949): Cxusb: codingstyle cleanups
V4L/DVB (4948): Cxusb: Convert tuner functions to use dvb_pll_attach
V4L/DVB (4947): Cx88: trivial cleanups
V4L/DVB (4946): Cx88: Move cx88_dvb_bus_ctrl out of the card-specific area
V4L/DVB (4945): Cx88: consolidate cx22702_config structs
V4L/DVB (4944): Cx88: Convert DViCO FusionHDTV Hybrid to use dvb_pll_attach
V4L/DVB (4943): Cx88: cleanup dvb_pll_attach for lgdt3302 tuners
V4L/DVB (4953): Usbvision minor fixes
V4L/DVB (4951): Add version.h, since it is required for VIDIOC_QUERYCAP
V4L/DVB (4940): Or51211: Changed SNR and signal strength calculations
V4L/DVB (4939): Or51132: Changed SNR and signal strength reporting
V4L/DVB (4938): Cx88: Convert lgdt3302 tuning function to use dvb_pll_attach
V4L/DVB (4941): Remove LINUX_VERSION_CODE and fix identations
V4L/DVB (4942): Whitespace cleanups
V4L/DVB (4937): Usbvision cleanup and code reorganization
V4L/DVB (4936): Make MT4049FM5 tuner to set FM Gain to Normal
V4L/DVB (4935): Added the capability of selecting fm gain by tuner
V4L/DVB (4934): Usbvision radio requires GainNormal at e register
...
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mostly changing alignment. Just some general cleanup.
[akpm@osdl.org: build fix]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a round_jiffies() function as well as a round_jiffies_relative()
function. These functions round a jiffies value to the next whole second.
The primary purpose of this rounding is to cause all "we don't care exactly
when" timers to happen at the same jiffy.
This avoids multiple timers firing within the second for no real reason;
with dynamic ticks these extra timers cause wakeups from deep sleep CPU
sleep states and thus waste power.
The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
waking up at the exact same time (and hitting the same lock/cachelines
there)
[akpm@osdl.org: fix variable type]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides an improved fdtable allocation scheme, useful for
expanding fdtable file descriptor entries. The main focus is on the fdarray,
as its memory usage grows 128 times faster than that of an fdset.
The allocation algorithm sizes the fdarray in such a way that its memory usage
increases in easy page-sized chunks. The overall algorithm expands the allowed
size in powers of two, in order to amortize the cost of invoking vmalloc() for
larger allocation sizes. Namely, the following sizes for the fdarray are
considered, and the smallest that accommodates the requested fd count is
chosen:
pagesize / 4
pagesize / 2
pagesize <- memory allocator switch point
pagesize * 2
pagesize * 4
...etc...
Unlike the current implementation, this allocation scheme does not require a
loop to compute the optimal fdarray size, and can be done in efficient
straightline code.
Furthermore, since the fdarray overflows the pagesize boundary long before any
of the fdsets do, it makes sense to optimize run-time by allocating both
fdsets in a single swoop. Even together, they will still be, by far, smaller
than the fdarray. The fdtable->open_fds is now used as the anchor for the
fdset memory allocation.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded). When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.
Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded. We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets. The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).
In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.
Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal. This
patch removes fdtable->max_fdset. As an added bonus, most of the supporting
code becomes simpler.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If a bypass-the-cache read fails, we simply try again through the cache. If
it fails again it will trigger normal recovery precedures.
update 1:
From: NeilBrown <neilb@suse.de>
1/
chunk_aligned_read and retry_aligned_read assume that
data_disks == raid_disks - 1
which is not true for raid6.
So when an aligned read request bypasses the cache, we can get the wrong data.
2/ The cloned bio is being used-after-free in raid5_align_endio
(to test BIO_UPTODATE).
3/ We forgot to add rdev->data_offset when submitting
a bio for aligned-read
4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
so we need to invalidate the segment counts.
5/ We don't de-reference the rdev when the read completes.
This means we need to record the rdev to so it is still
available in the end_io routine. Fortunately
bi_next in the original bio is unused at this point so
we can stuff it in there.
6/ We leak a cloned bio if the target rdev is not usable.
From: NeilBrown <neilb@suse.de>
update 2:
1/ When aligned requests fail (read error) they need to be retried
via the normal method (stripe cache). As we cannot be sure that
we can process a single read in one go (we may not be able to
allocate all the stripes needed) we store a bio-being-retried
and a list of bioes-that-still-need-to-be-retried.
When find a bio that needs to be retried, we should add it to
the list, not to single-bio...
2/ We were never incrementing 'scnt' when resubmitting failed
aligned requests.
[akpm@osdl.org: build fix]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ESB2 appears to emit spurious DMA interrupts when configured for native
mode and handling ATAPI devices. Stratus were able to pin this bug down and
produce a patch. This is a rework which applies the fixup only to the ESB2
(for now). We can apply it to other chips later if the same problem is found.
This code has been tested and confirmed to fix the problem on the tested
systems.
Signed-off-by: Alan Cox <alan@redhat.com>
(Most of the hard work done by Stratus however)
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove scheduler stats lb_stopbalance counter. This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to
create gazillion counters while we can derive the value.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently at a particular domain, each cpu in the sched group will do a
load balance at the frequency of balance_interval. More the cores and
threads, more the cpus will be in each sched group at SMP and NUMA domain.
And we endup spending quite a bit of time doing load balancing in those
domains.
Fix this by making only one cpu(first idle cpu or first cpu in the group if
all the cpus are busy) in the sched group do the load balance at that
particular sched domain and this load will slowly percolate down to the
other cpus with in that group(when they do load balancing at lower
domains).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Large sched domains can be very expensive to scan. Add an option SD_SERIALIZE
to the sched domain flags. If that flag is set then we make sure that no
other such domain is being balanced.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
softirq.
We calculate the earliest time for each layer of sched domains to be rescanned
(this is the rescan time for idle) and use the earliest of those to schedule
the softirq via a new field "next_balance" added to struct rq.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With SMT, if the logical processor is busy, load balance happens for every
8msec(min)-16msec(max). There is no need to do this often, as this is just
for fairness(to maintain uniform runqueue lengths) and default time slice
anyhow is 100msec.
Appended patch increases this interval to 64msec(min)-128msec(max) when the
logical processor is busy.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The present per-task IO accounting isn't very useful. It simply counts the
number of bytes passed into read() and write(). So if a process reads 1MB
from an already-cached file, it is accused of having performed 1MB of I/O,
which is wrong.
(David Wright had some comments on the applicability of the present logical IO accounting:
For billing purposes it is useless but for workload analysis it is very
useful
read_bytes/read_calls average read request size
write_bytes/write_calls average write request size
read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
write_bytes/write_blocks ie logical/physical guess since pdflush writes can
be missed
I often look for logical larger than physical to see filesystem cache
problems. And the bytes/cpusec can help find applications that are
dominating the cache and causing slow interactive response from page cache
contention.
I want to find the IO intensive applications and make sure they are doing
efficient IO. Thus the acctcms(sysV) or csacms command would give the high
IO commands).
This patchset adds new accounting which tries to be more accurate. We account
for three things:
reads:
attempt to count the number of bytes which this process really did cause
to be fetched from the storage layer. Done at the submit_bio() level, so it
is accurate for block-backed filesystems. I also attempt to wire up NFS and
CIFS.
writes:
attempt to count the number of bytes which this process caused to be sent
to the storage layer. This is done at page-dirtying time.
The big inaccuracy here is truncate. If a process writes 1MB to a file
and then deletes the file, it will in fact perform no writeout. But it will
have been accounted as having caused 1MB of write.
So...
cancelled_writes:
account the number of bytes which this process caused to not happen, by
truncating pagecache.
We _could_ just subtract this from the process's `write' accounting. But
that means that some processes would be reported to have done negative
amounts of write IO, which is silly.
So we just report the raw number and punt this decision up to userspace.
Now, we _could_ account for writes at the physical I/O level. But
- This would require that we track memory-dirtying tasks at the per-page
level (would require a new pointer in struct page).
- It would mean that IO statistics for a process are usually only available
long after that process has exitted. Which means that we probably cannot
communicate this info via taskstats.
This patch:
Wire up the kernel-private data structures and the accessor functions to
manipulate them.
Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are some kernel-only bits in the middle of <linux/futex.h> which
should be removed in what we export to userspace.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add rtc_merge_alarm(), which can be used by rtc drivers to turn a partially
specified alarm expiry (i.e. most significant fields set to -1, as with the
RTC_ALM_SET ioctl()) into a fully specified expiry.
If the most significant specified field is earlier than the current time, the
least significant unspecified field is incremented.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
freezer.h uses task_struct fields so it should include sched.h.
CC [M] fs/jfs/jfs_txnmgr.o
In file included from fs/jfs/jfs_txnmgr.c:49:
include/linux/freezer.h: In function 'frozen':
include/linux/freezer.h:9: error: dereferencing pointer to incomplete type
include/linux/freezer.h:9: error: 'PF_FROZEN' undeclared (first use in this function)
include/linux/freezer.h:9: error: (Each undeclared identifier is reported only once
include/linux/freezer.h:9: error: for each function it appears in.)
include/linux/freezer.h: In function 'freezing':
include/linux/freezer.h:17: error: dereferencing pointer to incomplete type
include/linux/freezer.h:17: error: 'PF_FREEZE' undeclared (first use in this function)
include/linux/freezer.h: In function 'freeze':
include/linux/freezer.h:26: error: dereferencing pointer to incomplete type
include/linux/freezer.h:26: error: 'PF_FREEZE' undeclared (first use in this function)
include/linux/freezer.h: In function 'do_not_freeze':
include/linux/freezer.h:34: error: dereferencing pointer to incomplete type
include/linux/freezer.h:34: error: 'PF_FREEZE' undeclared (first use in this function)
include/linux/freezer.h: In function 'thaw_process':
include/linux/freezer.h:43: error: dereferencing pointer to incomplete type
include/linux/freezer.h:43: error: 'PF_FROZEN' undeclared (first use in this function)
include/linux/freezer.h:44: warning: implicit declaration of function 'wake_up_process'
include/linux/freezer.h: In function 'frozen_process':
include/linux/freezer.h:55: error: dereferencing pointer to incomplete type
include/linux/freezer.h:55: error: dereferencing pointer to incomplete type
include/linux/freezer.h:55: error: 'PF_FREEZE' undeclared (first use in this function)
include/linux/freezer.h:55: error: 'PF_FROZEN' undeclared (first use in this function)
fs/jfs/jfs_txnmgr.c: In function 'freezing':
include/linux/freezer.h:18: warning: control reaches end of non-void function
make[2]: *** [fs/jfs/jfs_txnmgr.o] Error 1
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a V4L2 driver for the OmniVision OV7670 camera.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
A driver for the Marvell M88ALP01 "CAFE" CMOS integrated camera
controller. This driver has been renamed "cafe_ccic" since my previous
patch set.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Two defines for V4L2, needed by the Cafe camera driver:
1) Add the RGB444 image format
2) Add the "init" internal command which is separate from "reset".
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
GLIBC uses them etc.
They are guarded by ifndef __KERNEL__ so nobody will start
accidently using them in the kernel again, it's just for
userspace.
Signed-off-by: David S. Miller <davem@davemloft.net>
XFRMGRP_REPORT uses 0x10 which is a group that belongs
to events. The correct value is 0x20.
We should really be using xfrm_nlgroups going forward; it was tempting
to delete the definition of XFRMGRP_REPORT but it would break at
least iproute2.
Signed-off-by: J Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
hh_lock was converted from rwlock to seqlock by Stephen.
To have a 100% benefit of this change, I suggest to place read mostly fields
of hh_cache in a separate cache line, because hh_refcnt may be changed quite
frequently on some busy machines.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restore API compatibility due to bits moved from rtnetlink.h to
separate headers.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hard header cache is in the main output path, so using
seqlock instead of reader/writer lock should reduce overhead.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pb_fnmode parameter has to be passed to usbhid, both for compatibility reasons
and also because it logically belongs there.
Also removes empty hid-input.c file in drivers/usb/input.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
hid_input_report() was needlessly USB-specific in USB HID. This patch
makes the function independent of HID implementation and fixes all
the current users. Bluetooth patches comply with this prototype.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
- hiddev is USB-only (agreed with Marcel Holtmann that Bluetooth currently
doesn't need it, and future planned interface (rawhid) will be more flexible
and usable)
- both HID and USB-hid can be now compiled as modules (wasn't possible before
hiddev was fully separated from generic HID layer)
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
- 'dev' in struct hid_device changed from struct usb_device to
struct device and fixed all the users
- renamed functions which are part of USB HID API from 'hid_*' to
'usbhid_*'
- force feedback initialization moved from common part into USB-specific
driver
- added usbhid.h header for USB HID API users
- removed USB-specific fields from struct hid_device and moved them
to new usbhid_device, which is pointed to by hid_device->driver_data
- fixed all USB users to use this new structure
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The "big main" split of USB HID code into generic HID code and
USB-transport specific HID handling.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In device-mapper I/O is sometimes queued within targets for later processing.
For example the multipath target can be configured to store I/O when no paths
are available instead of returning it -EIO.
This patch allows the device-mapper core to instruct a target to transfer the
contents of any such in-target queue back into the core. This frees up the
resources used by the target so the core can replace that target with an
alternative one and then resend the I/O to it. Without this patch the only
way to change the target in such circumstances involves returning the I/O with
an error back to the filesystem/application. In the multipath case, this
patch will let us add new paths for existing I/O to try after all the existing
paths have failed.
DMF_NOFLUSH_SUSPENDING
----------------------
If the DM_NOFLUSH_FLAG ioctl option is specified at suspend time, the
DMF_NOFLUSH_SUSPENDING flag is set in md->flags during dm_suspend(). It
is always cleared before dm_suspend() returns.
The flag must be visible while the target is flushing pending I/Os so it
is set before presuspend where the flush starts and unset after the wait
for md->pending where the flush ends.
Target drivers can check this flag by calling dm_noflush_suspending().
DM_MAPIO_REQUEUE / DM_ENDIO_REQUEUE
-----------------------------------
A target's map() function can now return DM_MAPIO_REQUEUE to request the
device mapper core queue the bio.
Similarly, a target's end_io() function can return DM_ENDIO_REQUEUE to request
the same. This has been labelled 'pushback'.
The __map_bio() and clone_endio() functions in the core treat these return
values as errors and call dec_pending() to end the I/O.
dec_pending
-----------
dec_pending() saves the pushback request in struct dm_io->error. Once all
the split clones have ended, dec_pending() will put the original bio on
the md->pushback list. Note that this supercedes any I/O errors.
It is possible for the suspend with DM_NOFLUSH_FLAG to be aborted while
in progress (e.g. by user interrupt). dec_pending() checks for this and
returns -EIO if it happened.
pushdback list and pushback_lock
--------------------------------
The bio is queued on md->pushback temporarily in dec_pending(), and after
all pending I/Os return, md->pushback is merged into md->deferred in
dm_suspend() for re-issuing at resume time.
md->pushback_lock protects md->pushback.
The lock should be held with irq disabled because dec_pending() can be
called from interrupt context.
Queueing bios to md->pushback in dec_pending() must be done atomically
with the check for DMF_NOFLUSH_SUSPENDING flag. So md->pushback_lock is
held when checking the flag. Otherwise dec_pending() may queue a bio to
md->pushback after the interrupted dm_suspend() flushes md->pushback.
Then the bio would be left in md->pushback.
Flag setting in dm_suspend() can be done without md->pushback_lock because
the flag is checked only after presuspend and the set value is already
made visible via the target's presuspend function.
The flag can be checked without md->pushback_lock (e.g. the first part of
the dec_pending() or target drivers), because the flag is checked again
with md->pushback_lock held when the bio is really queued to md->pushback
as described above. So even if the flag is cleared after the lockless
checkings, the bio isn't left in md->pushback but returned to applications
with -EIO.
Other notes on the current patch
--------------------------------
- md->pushback is added to the struct mapped_device instead of using
md->deferred directly because md->io_lock which protects md->deferred is
rw_semaphore and can't be used in interrupt context like dec_pending(),
and md->io_lock protects the DMF_BLOCK_IO flag of md->flags too.
- Don't issue lock_fs() in dm_suspend() if the DM_NOFLUSH_FLAG
ioctl option is specified, because I/Os generated by lock_fs() would be
pushed back and never return if there were no valid devices.
- If an error occurs in dm_suspend() after the DMF_NOFLUSH_SUSPENDING
flag is set, md->pushback must be flushed because I/Os may be queued to
the list already. (flush_and_out label in dm_suspend())
Test results
------------
I have tested using multipath target with the next patch.
The following tests are for regression/compatibility:
- I/Os succeed when valid paths exist;
- I/Os fail when there are no valid paths and queue_if_no_path is not
set;
- I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set;
- The queued I/Os above fail when suspend is issued without the
DM_NOFLUSH_FLAG ioctl option. I/Os spanning 2 multipath targets also
fail.
The following tests are for the normal code path of new pushback feature:
- Queued I/Os in the multipath target are flushed from the target
but don't return when suspend is issued with the DM_NOFLUSH_FLAG
ioctl option;
- The I/Os above are queued in the multipath target again when
resume is issued without path recovery;
- The I/Os above succeed when resume is issued after path recovery
or table load;
- Queued I/Os in the multipath target succeed when resume is issued
with the DM_NOFLUSH_FLAG ioctl option after table load. I/Os
spanning 2 multipath targets also succeed.
The following tests are for the error paths of the new pushback feature:
- When the bdget_disk() fails in dm_suspend(), the
DMF_NOFLUSH_SUSPENDING flag is cleared and I/Os already queued to the
pushback list are flushed properly.
- When suspend with the DM_NOFLUSH_FLAG ioctl option is interrupted,
o I/Os which had already been queued to the pushback list
at the time don't return, and are re-issued at resume time;
o I/Os which hadn't been returned at the time return with EIO.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide a dm ioctl option to request noflush suspending. (See next patch for
what this is for.) As the interface is extended, the version number is
incremented.
Other than accepting the new option through the interface, There is no change
to existing behaviour.
Test results:
Confirmed the option is given from user-space correctly.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tighten the use of return values from the target map and end_io functions.
Values of 2 and above are now explictly reserved for future use. There are no
existing targets using such values.
The patch has no effect on existing behaviour.
o Reserve return values of 2 and above from target map functions.
Any positive value currently indicates "mapping complete", but all
existing drivers use the value 1. We now make that a requirement
so we can assign new meaning to higher values in future.
The new definition of return values from target map functions is:
< 0 : error
= 0 : The target will handle the io (DM_MAPIO_SUBMITTED).
= 1 : Mapping completed (DM_MAPIO_REMAPPED).
> 1 : Reserved (undefined). Previously this was the same as '= 1'.
o Reserve return values of 2 and above from target end_io functions
for similar reasons.
DM_ENDIO_INCOMPLETE is introduced for a return value of 1.
Test results:
I have tested by using the multipath target.
I/Os succeed when valid paths exist.
I/Os are queued in the multipath target when there are no valid paths and
queue_if_no_path is set.
I/Os fail when there are no valid paths and queue_if_no_path is not set.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the interface of dm_suspend() so that we can pass several options
without increasing the number of parameters. The existing 'do_lockfs' integer
parameter is replaced by a flag DM_SUSPEND_LOCKFS_FLAG.
There is no functional change to the code.
Test results:
I have tested 'dmsetup suspend' command with/without the '--nolockfs'
option and confirmed the do_lockfs value is correctly set.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Mark the default colormaps read-only, as nobody should be allowed to
modify them
- Additionally mark color values as __read_mostly since they will only be
modified (very seldom) by fb_invert_cmaps()
- Add named C99-initializers in fb_cmap structs and use the ARRAY_SIZE()
macro
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: James Simmons <jsimmons@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Assign defaults most likely to please a new user:
1) generate some logging output
(verbose=2)
2) avoid injecting failures likely to lock up UI
(ignore_gfp_wait=1, ignore_gfp_highmem=1)
Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use bool-true-false throughout.
Signed-off-by: Don Mullis <dwm@meer.net>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides stacktrace filtering feature.
The stacktrace filter allows failing only for the caller you are
interested in.
For example someone may want to inject kmalloc() failures into
only e100 module. they want to inject not only direct kmalloc() call,
but also indirect allocation, too.
- e100_poll --> netif_receive_skb --> packet_rcv_spkt --> skb_clone
--> kmem_cache_alloc
This patch enables to detect function calls like this by stacktrace
and inject failures. The script Documentaion/fault-injection/failmodule.sh
helps it.
The range of text section of loaded e100 is expected to be
[/sys/module/e100/sections/.text, /sys/module/e100/sections/.exit.text)
So failmodule.sh stores these values into /debug/failslab/address-start
and /debug/failslab/address-end. The maximum stacktrace depth is specified
by /debug/failslab/stacktrace-depth.
Please see the example that demonstrates how to inject slab allocation
failures only for a specific module
in Documentation/fault-injection/fault-injection.txt
[dwm@meer.net: reject failure if any caller lies within specified range]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Don Mullis <dwm@meer.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides process filtering feature.
The process filter allows failing only permitted processes
by /proc/<pid>/make-it-fail
Please see the example that demostrates how to inject slab allocation
failures into module init/cleanup code
in Documentation/fault-injection/fault-injection.txt
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides fault-injection capability for disk IO.
Boot option:
fail_make_request=<probability>,<interval>,<space>,<times>
<interval> -- specifies the interval of failures.
<probability> -- specifies how often it should fail in percent.
<space> -- specifies the size of free space where disk IO can be issued
safely in bytes.
<times> -- specifies how many times failures may happen at most.
Debugfs:
/debug/fail_make_request/interval
/debug/fail_make_request/probability
/debug/fail_make_request/specifies
/debug/fail_make_request/times
Example:
fail_make_request=10,100,0,-1
echo 1 > /sys/blocks/hda/hda1/make-it-fail
generic_make_request() on /dev/hda1 fails once per 10 times.
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides base functions implement to fault-injection
capabilities.
- The function should_fail() is taken from failmalloc-1.0
(http://www.nongnu.org/failmalloc/)
[akpm@osdl.org: cleanups, comments, add __init]
Cc: <okuji@enbug.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Don Mullis <dwm@meer.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- wipe gcc -W warnings by int -> uint conversion
- move 2 global variables into their local place
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use only struct <name> instead of defining a new type <name_t>.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix `gcc -W' un/signed warnings by converting some ints -> uints.
- move 3 global variables into functions, where are they used.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the grungy swap all the occurrences in the right places patch that
goes with the updates. At this point we have the same functionality as
before (except that sgttyb() returns speeds not zero) and are ready to
begin turning new stuff on providing nobody reports lots of bugs
If you are a tty driver author converting an out of tree driver the only
impact should be termios->ktermios name changes for the speed/property
setting functions from your upper layers.
If you are implementing your own TCGETS function before then your driver
was broken already and its about to get a whole lot more painful for you so
please fix it 8)
Also fill in c_ispeed/ospeed on init for most devices, although the current
code will do this for you anyway but I'd like eventually to lose that extra
paranoia
[akpm@osdl.org: bluetooth fix]
[mp3@de.ibm.com: sclp fix]
[mp3@de.ibm.com: warning fix for tty3270]
[hugh@veritas.com: fix tty_ioctl powerpc build]
[jdike@addtoit.com: uml: fix ->set_termios declaration]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@de.ibm.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the core of the switch to the new framework. I've split it from the
driver patches which are mostly search/replace and would encourage people to
give this one a good hard stare.
The references to BOTHER and ISHIFT are the termios values that must be
defined by a platform once it wants to turn on "new style" ioctl support. The
code patches here ensure that providing
1. The termios overlays the ktermios in memory
2. The only new kernel only fields are c_ispeed/c_ospeed (or none)
the existing behaviour is retained. This is true for the patches at this
point in time.
Future patches will define BOTHER, ISHIFT and enable newer termios structures
for each architecture, and once they are all done some of the ifdefs also
vanish.
[akpm@osdl.org: warning fix]
[akpm@osdl.org: IRDA fix]
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Typedefs are considered ugly in the kernel. Eliminate them.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change cloned experimental driver according to original 1.9.1 moxa driver.
Some int->ulong conversions, outb ~UART_IER_THRI constant. Remove commented
stuff.
I also added printk line with info, if somebody wants to test it, he may
contact me as I can potentially debug the driver with him or just to confirm
it works properly.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a per pid_namespace child-reaper. This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space. Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.
This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the pid namespace framework to the nsproxy object. The copy of the pid
namespace only increases the refcount on the global pid namespace,
init_pid_ns, and unshare is not implemented.
There is no configuration option to activate or deactivate this feature
because this not relevant for the moment.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename struct pspace to struct pid_namespace for consistency with other
namespaces (uts_namespace and ipc_namespace). Also rename
include/linux/pspace.h to include/linux/pid_namespace.h and variables from
pspace to pid_ns.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and
allocated nsproxies are given -1.
This identifier will be used by a new syscall sys_bind_ns.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'
Signed-off-by: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an anonymous union and ((deprecated)) to catch direct usage of the
session field.
[akpm@osdl.org: fix various missed conversions]
[jdike@addtoit.com: fix UML bug]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace occurences of task->signal->session by a new process_session() helper
routine.
It will be useful for pid namespaces to abstract the session pid number.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Alter roundup_pow_of_two() so that it can make use of ilog2() on a constant to
produce a constant value, retaining the ability for an arch to override it in
the non-const case.
This permits the function to be used to initialise variables.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This facility provides three entry points:
ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64
These facilities can either be used inside functions on dynamic data:
int do_something(long q)
{
...;
y = ilog2(x)
...;
}
Or can be used to statically initialise global variables with constant values:
unsigned n = ilog2(27);
When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.
When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.
[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.
Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Moved struct path from fs/namei.c to include/linux/namei.h. This allows many
places in the VFS, as well as any stackable filesystem to easily keep track of
dentry-vfsmount pairs.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename Reiserfs's struct path to struct treepath to prevent name collision
between it and struct path from fs/namei.c.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Cc: <reiserfs-dev@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce several fsstack_copy_* functions which allow stackable filesystems
(such as eCryptfs and Unionfs) to easily copy over (currently only) inode
attributes. This prevents code duplication and allows for code reuse.
[akpm@osdl.org: Remove unneeded wrapper]
[bunk@stusta.de: fs/stack.c should #include <linux/fs_stack.h>]
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch replaces bitreverse() by bitrev32. The only users of bitreverse()
are crc32 itself and via-velocity.
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides two bit reverse functions and bit reverse table.
- reverse the order of bits in a u32 value
u8 bitrev8(u8 x);
- reverse the order of bits in a u32 value
u32 bitrev32(u32 x);
- byte reverse table
const u8 byte_rev_table[256];
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds common handling for kernel BUGs, for use by architectures as
they wish. The code is derived from arch/powerpc.
The advantages of having common BUG handling are:
- consistent BUG reporting across architectures
- shared implementation of out-of-line file/line data
- implement CONFIG_DEBUG_BUGVERBOSE consistently
This means that in inline impact of BUG is just the illegal instruction
itself, which is an improvement for i386 and x86-64.
A BUG is represented in the instruction stream as an illegal instruction,
which has file/line information associated with it. This extra information is
stored in the __bug_table section in the ELF file.
When the kernel gets an illegal instruction, it first confirms it might
possibly be from a BUG (ie, in kernel mode, the right illegal instruction).
It then calls report_bug(). This searches __bug_table for a matching
instruction pointer, and if found, prints the corresponding file/line
information. If report_bug() determines that it wasn't a BUG which caused the
trap, it returns BUG_TRAP_TYPE_NONE.
Some architectures (powerpc) implement WARN using the same mechanism; if the
illegal instruction was the result of a WARN, then report_bug(Q) returns
CONFIG_DEBUG_BUGVERBOSE; otherwise it returns BUG_TRAP_TYPE_BUG.
lib/bug.c keeps a list of loaded modules which can be searched for __bug_table
entries. The architecture must call
module_bug_finalize()/module_bug_cleanup() from its corresponding
module_finalize/cleanup functions.
Unsetting CONFIG_DEBUG_BUGVERBOSE will reduce the kernel size by some amount.
At the very least, filename and line information will not be recorded for each
but, but architectures may decide to store no extra information per BUG at
all.
Unfortunately, gcc doesn't have a general way to mark an asm() as noreturn, so
architectures will generally have to include an infinite loop (or similar) in
the BUG code, so that gcc knows execution won't continue beyond that point.
gcc does have a __builtin_trap() operator which may be useful to achieve the
same effect, unfortunately it cannot be used to actually implement the BUG
itself, because there's no way to get the instruction's address for use in
generating the __bug_table entry.
[randy.dunlap@oracle.com: Handle BUG=n, GENERIC_BUG=n to prevent build errors]
[bunk@stusta.de: include/linux/bug.h must always #include <linux/module.h]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md_open takes ->reconfig_mutex which causes lockdep to complain. This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.
I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen. However that causes bigger problems than a deadlock
and should be fixed independently.
So we flag the lock in md_open as a nested lock. This requires defining
mutex_lock_interruptible_nested.
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the old complex and crufty bd_mutex annotation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a sysfs and debugfs interface to the pktcdvd driver.
Look into the Documentation/ABI/testing/* files in the patch for more info.
Signed-off-by: Thomas Maier <balagi@justmail.de>
Signed-off-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a bio write queue congestion control to the pktcdvd driver with
fixed on/off marks. It prevents that the driver consumes a unlimited
amount of write requests.
[akpm@osdl.org: sync with congestion_wait() renaming]
Signed-off-by: Thomas Maier <balagi@justmail.de>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make set_special_pids() static, the only caller is daemonize().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the locking of signal->tty.
Use ->sighand->siglock to protect ->signal->tty; this lock is already used
by most other members of ->signal/->sighand. And unless we are 'current'
or the tasklist_lock is held we need ->siglock to access ->signal anyway.
(NOTE: sys_unshare() is broken wrt ->sighand locking rules)
Note that tty_mutex is held over tty destruction, so while holding
tty_mutex any tty pointer remains valid. Otherwise the lifetime of ttys
are governed by their open file handles. This leaves some holes for tty
access from signal->tty (or any other non file related tty access).
It solves the tty SLAB scribbles we were seeing.
(NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
be examined by someone familiar with the security framework, I think
it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
invocations)
[schwidefsky@de.ibm.com: 3270 fix]
[akpm@osdl.org: various post-viro fixes]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (76 commits)
[ARM] 4002/1: S3C24XX: leave parent IRQs unmasked
[ARM] 4001/1: S3C24XX: shorten reboot time
[ARM] 3983/2: remove unused argument to __bug()
[ARM] 4000/1: Osiris: add third serial port in
[ARM] 3999/1: RX3715: suspend to RAM support
[ARM] 3998/1: VR1000: LED platform devices
[ARM] 3995/1: iop13xx: add iop13xx support
[ARM] 3968/1: iop13xx: add iop13xx_defconfig
[ARM] Update mach-types
[ARM] Allow gcc to optimise arm_add_memory a little more
[ARM] 3991/1: i.MX/MX1 high resolution time source
[ARM] 3990/1: i.MX/MX1 more precise PLL decode
[ARM] 3986/1: H1940: suspend to RAM support
[ARM] 3985/1: ixp4xx clocksource cleanup
[ARM] 3984/1: ixp4xx/nslu2: Fix disk LED numbering (take 2)
[ARM] 3994/1: ixp23xx: fix handling of pci master aborts
[ARM] 3981/1: sched_clock for PXA2xx
[ARM] 3980/1: extend the ARM Versatile sched_clock implementation from 32 to 63 bit
[ARM] 3979/1: extend the SA11x0 sched_clock implementation from 32 to 63 bit period
[ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter
...
* 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] replace kmalloc+memset with kzalloc
[IA64] resolve name clash by renaming is_available_memory()
[IA64] Need export for csum_ipv6_magic
[IA64] Fix DISCONTIGMEM without VIRTUAL_MEM_MAP
[PATCH] Add support for type argument in PAL_GET_PSTATE
[IA64] tidy up return value of ip_fast_csum
[IA64] implement csum_ipv6_magic for ia64.
[IA64] More Itanium PAL spec updates
[IA64] Update processor_info features
[IA64] Add se bit to Processor State Parameter structure
[IA64] Add dp bit to cache and bus check structs
[IA64] SN: Correctly update smp_affinty mask
[IA64] sparse cleanups
[IA64] IA64 Kexec/kdump
Changes and updates.
1. Remove fake rendz path and related code according to discuss with Khalid Aziz.
2. fc.i offset fix in relocate_kernel.S.
3. iospic shutdown code eoi and mask race fix from Fujitsu.
4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner.
5. Send slave to SAL slave loop patch from Jay Lan.
6. Kdump on non-recoverable MCA event patch from Jay Lan
7. Use CTL_UNNUMBERED in kdump_on_init sysctl.
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This allows workqueue users to run just their own pending work, rather
than wait for the whole workqueue to finish running. This solves the
deadlock with networking libphy that was due to other workqueue entries
possibly needing a lock that was held by the routine that wanted to
flush its own work.
It's not wonderful: if you absolutely need to synchronize with the work
function having been executed, any user strictly speaking should have
its own completion tracking logic, since when we run things explicitly
by hand, the generic workqueue layer can no longer help us synchronize.
Also, this is strictly only usable for work that has been scheduled
without any delayed timers. You can not mix the new interface with
schedule_delayed_work().
But it's better than what we had currently.
Acked-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (73 commits)
[DLM] Clean up lowcomms
[GFS2] Change gfs2_fsync() to use write_inode_now()
[GFS2] Fix indent in recovery.c
[GFS2] Don't flush everything on fdatasync
[GFS2] Add a comment about reading the super block
[GFS2] Mount problem with the GFS2 code
[GFS2] Remove gfs2_check_acl()
[DLM] fix format warnings in rcom.c and recoverd.c
[GFS2] lock function parameter
[DLM] don't accept replies to old recovery messages
[DLM] fix size of STATUS_REPLY message
[GFS2] fs/gfs2/log.c:log_bmap() fix printk format warning
[DLM] fix add_requestqueue checking nodes list
[GFS2] Fix recursive locking in gfs2_getattr
[GFS2] Fix recursive locking in gfs2_permission
[GFS2] Reduce number of arguments to meta_io.c:getbuf()
[GFS2] Move gfs2_meta_syncfs() into log.c
[GFS2] Fix journal flush problem
[GFS2] mark_inode_dirty after write to stuffed file
[GFS2] Fix glock ordering on inode creation
...
In file included from include/asm/patch.h:14,
from arch/ia64/kernel/patch.c:10:
include/linux/elf.h:375: warning: "struct file" declared inside parameter list
include/linux/elf.h:375: warning: its scope is only this definition or declaration, which is probably not what you want
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The IPMI BT subdriver has been patched to survive "long busy" timeouts seen
during firmware upgrades and resets. The patch never returns the HOSED state,
synthesizes response messages with meaningful completion codes, and recovers
gracefully when the hardware finishes the long busy. The subdriver now issues
a "Get BT Capabilities" command and properly uses those results. More
informative completion codes are returned on error from transaction starts;
this logic was propogated to the KCS and SMIC subdrivers. Finally, indent and
other style quirks were normalized.
Signed-off-by: Rocky Craig <rocky.craig@hp.com>
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some commands and operations on a BMC can cause the BMC to "go away" for a
while. This can cause the automatic flag processing and other things of that
nature to timeout and generate annoying logs, or possibly cause other bad
things to happen when in firmware update mode.
Add detection of those commands (cold reset, warm reset, and any firmware
command) and turns off automatic processing for 30 seconds. It also add a
manual override either way.
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass in the sysfs name from the lower-level IPMI driver, as the coming IPMI
serial driver will need that to link properly from the serial device sysfs
directory.
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow nbd to expose the nbd-client daemon's PID in /sys/block/nbd<x>/pid.
This is helpful for tracking connection status of a device and for
determining which nbd devices are currently in use.
Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the ki_retried member from struct kiocb. I think the idea was
bounced around a while back, but Arnaldo pointed out another reason that we
should dig it up when he pointed out that the last cacheline of struct
kiocb only contains 4 bytes. By removing the debugging member, we save
more than the 8 byte on 64 bit machines.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The elf note saving code is currently duplicated over several
architectures. This cleanup patch simply adds code to a common file and
then replaces the arch-specific code with calls to the newly added code.
The only drawback with this approach is that s390 doesn't fully support
kexec-on-panic which for that arch leads to introduction of unused code.
Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- move some file_operations structs into the .rodata section
- move static strings from policy_types[] array into the .rodata section
- fix generic seq_operations usages, so that those structs may be defined
as "const" as well
[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add 3 more files to get "unifdef"ed when creating sanitized headers with
"make headers_install".
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: "John W. Linville" <linville@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename a poorly worded PCI ID for the Geode GX and CS5535 companion chips.
The graphics processor and host bridge actually live in the northbridge on
the integrated processor, not in the companion chip.
Signed-off-by: Jordan Crouse <jordan.crouse@amd.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a proper prototype for remove_inode_dquot_ref() in
include/linux/quotaops.h
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the locking self-test failures (of 'FAILURE' type) easier to debug by
printing more information.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the carta_random32.h header file. The carta_random32() function was
was put in and removed in favor of random32(). In the removal process, the
header file was forgotten.
Signed-off-by: Stephane Eranian <eranian@hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide a common interface for all the subsystems to lock and unlock their
per-subsystem hotcpu mutexes.
When CONFIG_HOTPLUG_CPU is not set, these operations would be no-ops.
[akpm@osdl.org: macros -> inlines]
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On 32bits SMP platforms, 64bits i_size is protected by a seqcount
(i_size_seqcount).
When i_size is read or written, i_size_seqcount is read/written as well, so
it make sense to group these two fields together in the same cache line.
This patch moves i_size_seqcount next to i_size, and also moves i_version
to let offsetof(struct inode, i_size) being 0x40 instead of 0x3c (for
32bits platforms).
For 64 bits platforms, i_size_seqcount doesnt exist, and the move of a
'long i_version' should not introduce a new hole because of padding.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the following needlessly global functions static:
- nlm_lookup_host()
- nsm_find()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the no longer used lockdep_internal().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.
the compiler can skip truly unused functions just fine:
text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after
[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
smp_call_function_single() needs to be visible in non-SMP builds, to fix:
arch/x86_64/kernel/vsyscall.c:283: warning: implicit declaration of function 'smp_call_function_single'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we are unregistering a kprobe-booster, we can't release its
instruction buffer immediately on the preemptive kernel, because some
processes might be preempted on the buffer. The freeze_processes() and
thaw_processes() functions can clean most of processes up from the buffer.
There are still some non-frozen threads who have the PF_NOFREEZE flag. If
those threads are sleeping (not preempted) at the known place outside the
buffer, we can ensure safety of freeing.
However, the processing of this check routine takes a long time. So, this
patch introduces the garbage collection mechanism of insn_slot. It also
introduces the "dirty" flag to free_insn_slot because of efficiency.
The "clean" instruction slots (dirty flag is cleared) are released
immediately. But the "dirty" slots which are used by boosted kprobes, are
marked as garbages. collect_garbage_slots() will be invoked to release
"dirty" slots if there are more than INSNS_PER_PAGE garbage slots or if
there are no unused slots.
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "bibo,mao" <bibo.mao@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Yumiko Sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi Oshima <soshima@redhat.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define elf_addr_t in linux/elf.h. The size of the type is determined using
ELF_CLASS. This allows us to remove the defines that today are spread all
over .c and .h files.
Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Daniel Jacobowitz <drow@false.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently we allocate 64k space on the user stack and use it the msgbuf for
sys_{msgrcv,msgsnd} for compat and the results are later copied in user [
by copy_in_user]. This patch introduces helper routines for
sys_{msgrcv,msgsnd} as below:
do_msgsnd() : Accepts the mtype and user space ptr to the buffer along with
the msqid and msgflg.
do_msgrcv() : Accepts a kernel space ptr to mtype and a userspace ptr to
the buffer. The mtype has to be copied back the user space msgbuf by the
caller.
These changes avoid the need to allocate the msgsize on the userspace (
thus removing the size limt ) and the overhead of an extra copy_in_user().
Signed-off-by: Suzuki K P <suzuki@in.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The 32 bit implementation of ktime_to_ns returns unsigned value, while the
64 bit version correctly returns an signed value. There is no current user
affected by this, but it has to be fixed, as ktime values can be negative.
Pointed-out-by: Helmut Duregger <Helmut.Duregger@student.uibk.ac.at>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It has no users and it's doubtful that we'll need it again.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Workqueue functions should not leak locks, assert so, printing the
last function ran.
Use macros in lockdep.h to avoid include dependency pains.
[akpm@osdl.org: build fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement prof=sleep profiling. TASK_UNINTERRUPTIBLE sleeps will be taken
as a profile hit, and every millisecond spent sleeping causes a profile-hit
for the call site that initiated the sleep.
Sample readprofile output on i386:
306 ps2_sendbyte 1.3973
432 call_usermodehelper_keys 1.9548
484 ps2_command 0.6453
790 __driver_attach 4.7879
1593 msleep 44.2500
3976 sync_buffer 64.1290
4076 do_lookup 12.4648
8587 sync_page 122.6714
20820 total 0.0067
(NOTE: architectures need to check whether get_wchan() can be called from
deep within the wakeup path.)
akpm: we need to mark more functions __sched. lock_sock(), msleep(), others..
akpm: the contention in do_lookup() is a surprise. Presumably doing disk
reads for directory contents while holding i_mutex.
[akpm@osdl.org: various fixes]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Name some of the remaning 'old_style_spin_init' locks
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is on our "Envoy" boxes which we have, according to the documentation, an
"Exar ST16C554/554D Quad UART with 16-byte Fifo's". The box also has two
other "on-board" serial ports and a modem chip.
The two on-board serial UARTs were being detected along with the first two
Exar UARTs. The last two Exar UARTs were not showing up and neither was the
modem.
This patch was the only way I could the kernel to see beyond the standard four
serial ports and get all four of the Exar UARTs to show up.
[akpm@osdl.org: build fix]
Signed-off-by: Paul B Schroeder <pschroeder@uplogix.com>
Cc: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of the mistakes a module_param() user can make is to supply default
value of module parameter as the last argument. module_param() accepts
permissions instead. If default value is, say, 3 (-------wx), parameter
becomes world-writeable.
So far, the only remedy was to apply grep(1) and read drivers submitted
to -mm. BTDT.
With this patch applied, compiler will finally do some job.
*) bounds checking on permissions
*) world-writeable bit checking on permissions
*) compile breakage if checks trigger
First version of this check (only "& 2" part) directly caught 4 out of 7
places during my last grep.
Subject: Neverending module_param() bugs
[X] drivers/acpi/sbs.c:101:module_param(capacity_mode, int, CAPACITY_UNIT);
[X] drivers/acpi/sbs.c:102:module_param(update_mode, int, UPDATE_MODE);
[ ] drivers/acpi/sbs.c:103:module_param(update_info_mode, int, UPDATE_INFO_MODE);
[ ] drivers/acpi/sbs.c:104:module_param(update_time, int, UPDATE_TIME);
[ ] drivers/acpi/sbs.c:105:module_param(update_time2, int, UPDATE_TIME2);
[X] drivers/char/watchdog/sbc8360.c:203:module_param(timeout, int, 27);
[X] drivers/media/video/tuner-simple.c:13:module_param(offset, int, 0666);
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allocate ->signal->stats on demand in taskstats_exit(), this allows us to
remove taskstats_tgid_alloc() (the last non-trivial inline) from taskstat's
public interface.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
do_exit:
taskstats_exit_alloc()
...
taskstats_exit_send()
taskstats_exit_free()
I think this is not good, let it be a single function exported to the core
kernel, taskstats_exit(), which does alloc + send + free itself.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
probe_kernel_address() purports to be generic, only it forgot to select
KERNEL_DS, so it presently won't work right on all architectures.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for the parallel port (implemented as separate PCI function) on
the Oxford Semiconductor OX16PCI952.
Signed-off-by: Ryan Underwood <nemesis@icequake.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
linux/cdev.h uses struct kobject and other structs and should therefore
include them. Currently, a module either needs to add the missing includes
itself, or, in case a module includes other headers already, needs to put
<linux/cdev.h> last, which goes against a alphabetically-sorted include
list.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a DESTROY operation for block device based filesystems. With the help of
this operation, such a filesystem can flush dirty data to the device
synchronously before the umount returns.
This is needed in situations where the filesystem is assumed to be clean
immediately after unmount (e.g. ejecting removable media).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for the BMAP operation for block device based filesystems. This
is needed to support swap-files and lilo.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a flag to the RELEASE message which specifies that a FLUSH operation
should be performed as well. This interface update is needed for the FreeBSD
port, and doesn't actually touch the Linux implementation at all.
Also rename the unused 'flush_flags' in the FLUSH message to 'unused'.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the signature of i_size_read(), IMINOR() and IMAJOR() because they,
or the functions they call, will never modify the argument.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a driver for the Xilinx uartlite serial controller used in boards with
the PPC405 core in the Xilinx V2P/V4 fpgas.
The hardware is very simple (baudrate/start/stopbits fixed and no break
support). See the datasheet for details:
http://www.xilinx.com/bvdocs/ipcenter/data_sheet/opb_uartlite.pdf
See http://thread.gmane.org/gmane.linux.serial/1237/ for the email thread.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the support for a large number of logical volumes. We will soon have
hardware that support up to 1024 logical volumes.
Signed-off-by: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make it possible to create a workqueue the worker thread of which will be
frozen during suspend, along with other kernel threads.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the loop from thaw_processes() to a separate function and call it
independently for kernel threads and user space processes so that the order
of thawing tasks is clearly visible.
Drop thaw_kernel_threads() which is never used.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify process thawing so that we can thaw kernel space without thawing
userspace, and thaw kernelspace first. This will be useful in later
patches, where I intend to get swsusp thawing kernel threads only before
seeking to free memory.
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg. it requires two normal pages
to be used for saving one highmem page). This may be improved by using
highmem for saving the contents of saveable highmem pages.
Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory. Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data. We use a memory bitmap to mark the allocated
free pages (ie. highmem as well as "normal" image pages).
Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages. Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.
During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames. Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image. While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.
Now, the image data are loaded, if possible, into their "original" page
frames. The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.
One list of PBEs is for the copies of "normal" suspend pages (ie. "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie. by the architecture-dependent parts of swsusp). The
other list of PBEs is for the copies of highmem suspend pages. The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.
This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg. by dropping rw_swap_page_sync().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:
(1) swap files need not be contiguous,
(2) the header of a swap file is not in the first block of the partition
that holds it. From the swsusp's point of view (1) is not a problem,
because it is already taken care of by the swap-handling code, but (2) has
to be taken into consideration.
In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver. Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk. For this reason we
need some other means by which swap areas can be identified.
For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.
The following patch allows swsusp to identify swap areas this way. It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make radix tree lookups safe to be performed without locks. Readers are
protected against nodes being deleted by using RCU based freeing. Readers
are protected against new node insertion by using memory barriers to ensure
the node itself will be properly written before it is visible in the radix
tree.
Each radix tree node keeps a record of their height (above leaf nodes).
This height does not change after insertion -- when the radix tree is
extended, higher nodes are only inserted in the top. So a lookup can take
the pointer to what is *now* the root node, and traverse down it even if
the tree is concurrently extended and this node becomes a subtree of a new
root.
"Direct" pointers (tree height of 0, where root->rnode points directly to
the data item) are handled by using the low bit of the pointer to signal
whether rnode is a direct pointer or a pointer to a radix tree node.
When a reader wants to traverse the next branch, they will take a copy of
the pointer. This pointer will be either NULL (and the branch is empty) or
non-NULL (and will point to a valid node).
[akpm@osdl.org: cleanups]
[Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
[clameter@sgi.com: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently we we use the lru head link of the second page of a compound page
to hold its destructor. This was ok when it was purely an internal
implmentation detail. However, hugetlbfs overrides this destructor
violating the layering. Abstract this out as explicit calls, also
introduce a type for the callback function allowing them to be type
checked. For each callback we pre-declare the function, causing a type
error on definition rather than on use elsewhere.
[akpm@osdl.org: cleanups]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_DMA is an alias of GFP_DMA. This is the last one so we
remove the leftover comment too.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_KERNEL is an alias of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_ATOMIC is an alias of GFP_ATOMIC
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_USER is an alias of GFP_USER
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_NOFS is an alias of GFP_NOFS.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_NOIO is an alias of GFP_NOIO with a single instance of use.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_LEVEL_MASK is only used internally to the slab and is
and alias of GFP_LEVEL_MASK.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is only used internally in the slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NUMA node ids are passed as either int or unsigned int almost exclusivly
page_to_nid and zone_to_nid both return unsigned long. This is a throw
back to when page_to_nid was a #define and was thus exposing the real type
of the page flags field.
In addition to fixing up the definitions of page_to_nid and zone_to_nid I
audited the users of these functions identifying the following incorrect
uses:
1) mm/page_alloc.c show_node() -- printk dumping the node id,
2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
against numa_node_id() which returns an int from cpu_to_node(), and
3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
uses bit_set which in generic code takes an int.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove all uses of kmem_cache_t (the most were left in slab.h). The
typedef for kmem_cache_t is then only necessary for other kernel
subsystems. Add a comment to that effect.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The names_cachep is used for getname() and putname(). So lets put it into
fs.h near those two definitions.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fs_cachep is only used in kernel/exit.c and in kernel/fork.c.
It is used to store fs_struct items so it should be placed in linux/fs_struct.h
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
filp_cachep is only used in fs/file_table.c and in fs/dcache.c where
it is defined.
Move it to related definitions in linux/file.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Proper place is in file.h since files_cachep uses are rated to file I/O.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
vm_area_cachep is used to store vm_area_structs. So move to mm.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move sighand_cachep definitioni to linux/signal.h
The sighand cache is only used in fs/exec.c and kernel/fork.c. It is defined
in kernel/fork.c but only used in fs/exec.c.
The sighand_cachep is related to signal processing. So add the definition to
signal.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove bio_cachep from slab.h - it no longer exists.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Node-aware allocation of skbs for the receive path.
Details:
- __alloc_skb gets a new node argument and cals the node-aware
slab functions with it.
- netdev_alloc_skb passed the node number it gets from dev_to_node
to it, everyone else passes -1 (any node)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For node-aware skb allocations we need information about the node in struct
net_device or struct device. Davem suggested to put it into struct device
which this patch does.
In particular:
- struct device gets a new int numa_node member if CONFIG_NUMA is set
- there are two new helpers, dev_to_node and set_dev_node to
transparently deal with the non-numa case
- for pci devices the node-info is set to the value we get from
pcibus_to_node.
Note that for some architectures pcibus_to_node doesn't work yet at the time
we call it currently. This is harmless and will just mean skb allocations
aren't node-local on this architectures until the implementation of
pcibus_to_node on these architectures have been updated (There are patches for
x86 and x86_64 floating around)
[akpm@osdl.org: cleanup]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We have variants of kmalloc and kmem_cache_alloc that leave leak tracking to
the caller. This is used for subsystem-specific allocators like skb_alloc.
To make skb_alloc node-aware we need similar routines for the node-aware slab
allocator, which this patch adds.
Note that the code is rather ugly, but it mirrors the non-node-aware code 1:1:
[akpm@osdl.org: add module export]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make kmap_atomic/kunmap_atomic denote a pagefault disabled scope. All non
trivial implementations already do this anyway.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce pagefault_{disable,enable}() and use these where previously we did
manual preempt increments/decrements to make the pagefault handler do the
atomic thing.
Currently they still rely on the increased preempt count, but do not rely on
the disabled preemption, this might go away in the future.
(NOTE: the extra barrier() in pagefault_disable might fix some holes on
machines which have too many registers for their own good)
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.
The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.
With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an arch_alloc_page to match arch_free_page.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The new swap token patches replace the current token traversal algo. The old
algo had a crude timeout parameter that was used to handover the token from
one task to another. This algo, transfers the token to the tasks that are in
need of the token. The urgency for the token is based on the number of times
a task is required to swap-in pages. Accordingly, the priority of a task is
incremented if it has been badly affected due to swap-outs. To ensure that
the token doesnt bounce around rapidly, the token holders are given a priority
boost. The priority of tasks is also decremented, if their rate of swap-in's
keeps reducing. This way, the condition to check whether to pre-empt the swap
token, is a matter of comparing two task's priority fields.
[akpm@osdl.org: cleanups]
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rearrange the struct members in the 'struct zonelist_cache' structure, so
as to put the readonly (once initialized) z_to_n[] array first, where it
will come right after the zones[] array in struct zonelist.
This pretty much eliminates the chance that the two frequently written
elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
times, will end up on the same cache line as the performance sensitive,
frequently read, never (after init) written zones[] array.
Keeping frequently written data off frequently read cache lines is good for
performance.
Thanks to Rohit Seth for the suggestion.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.
Remembers the zones in a zonelist that were short of free memory in the
last second. And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)
Recent changes:
This differs in a significant way from a similar patch that I
posted a week ago. Now, instead of having a nodemask_t of
recently full nodes, I have a bitmask of recently full zones.
This solves a problem that last weeks patch had, which on
systems with multiple zones per node (such as DMA zone) would
take seeing any of these zones full as meaning that all zones
on that node were full.
Also I changed names - from "zonelist faster" to "zonelist cache",
as that seemed to better convey what we're doing here - caching
some of the key zonelist state (for faster access.)
See below for some performance benchmark results. After all that
discussion with David on why I didn't need them, I went and got
some ;). I wanted to verify that I had not hurt the normal case
of memory allocation noticeably. At least for my one little
microbenchmark, I found (1) the normal case wasn't affected, and
(2) workloads that forced scanning across multiple nodes for
memory improved up to 10% fewer System CPU cycles and lower
elapsed clock time ('sys' and 'real'). Good. See details, below.
I didn't have the logic in get_page_from_freelist() for various
full nodes and zone reclaim failures correct. That should be
fixed up now - notice the new goto labels zonelist_scan,
this_zone_full, and try_next_zone, in get_page_from_freelist().
There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:
1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
have seen real customer loads where the cost to scan the zonelist
was a problem, due to many nodes being full of memory before
we got to a node we could use. Or at least, I think we have.
This was related to me by another engineer, based on experiences
from some time past. So this is not guaranteed. Most likely, though.
The following approach should help such real numa systems just as
much as it helps fake numa systems, or any combination thereof.
2) The effort to distinguish fake from real numa, using node_distance,
so that we could cache a fake numa node and optimize choosing
it over equivalent distance fake nodes, while continuing to
properly scan all real nodes in distance order, was going to
require a nasty blob of zonelist and node distance munging.
The following approach has no new dependency on node distances or
zone sorting.
See comment in the patch below for a description of what it actually does.
Technical details of note (or controversy):
- See the use of "zlc_active" and "did_zlc_setup" below, to delay
adding any work for this new mechanism until we've looked at the
first zone in zonelist. I figured the odds of the first zone
having the memory we needed were high enough that we should just
look there, first, then get fancy only if we need to keep looking.
- Some odd hackery was needed to add items to struct zonelist, while
not tripping up the custom zonelists built by the mm/mempolicy.c
code for MPOL_BIND. My usual wordy comments below explain this.
Search for "MPOL_BIND".
- Some per-node data in the struct zonelist is now modified frequently,
with no locking. Multiple CPU cores on a node could hit and mangle
this data. The theory is that this is just performance hint data,
and the memory allocator will work just fine despite any such mangling.
The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
(a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
all be self correcting after at most a one second delay.
- This still does a linear scan of the same lengths as before. All
I've optimized is making the scan faster, not algorithmically
shorter. It is now able to scan a compact array of 'unsigned
short' in the case of many full nodes, so one cache line should
cover quite a few nodes, rather than each node hitting another
one or two new and distinct cache lines.
- If both Andi and Nick don't find this too complicated, I will be
(pleasantly) flabbergasted.
- I removed the comment claiming we only use one cachline's worth of
zonelist. We seem, at least in the fake numa case, to have put the
lie to that claim.
- I pay no attention to the various watermarks and such in this performance
hint. A node could be marked full for one watermark, and then skipped
over when searching for a page using a different watermark. I think
that's actually quite ok, as it will tend to slightly increase the
spreading of memory over other nodes, away from a memory stressed node.
===============
Performance - some benchmark results and analysis:
This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.
Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.) System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.
Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added. The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.
number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
GBs N ------------ -------------- ---------------- systime
mem threads sys user real sys user real sys user real better
12 1 153 24 177 151 24 176 -2 0 -1 1%
12 19 99 22 8 99 22 8 0 0 0 0%
12 37 111 25 6 112 25 6 1 0 0 -0%
12 55 115 25 5 110 23 5 -5 -2 0 4%
38 1 502 74 576 497 73 570 -5 -1 -6 0%
38 19 426 78 48 373 76 39 -53 -2 -9 12%
38 37 544 83 36 547 82 36 3 -1 0 -0%
38 55 501 77 23 511 80 24 10 3 1 -1%
64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
64 19 1118 138 119 965 141 103 -153 3 -16 13%
64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
90 55 1948 210 104 1843 213 100 -105 3 -4 5%
Notes:
1) This test ran a memory hog program that started a specified number N of
threads, and had each thread allocate and touch 1/N'th of
the total memory to be used in the test run in a single loop,
writing a constant word to memory, one store every 4096 bytes.
Watching this test during some earlier trial runs, I would see
each of these threads sit down on one CPU and stay there, for
the remainder of the pass, a different CPU for each thread.
2) The 'real' column is not comparable to the 'sys' or 'user' columns.
The 'real' column is seconds wall clock time elapsed, from beginning
to end of that test pass. The 'sys' and 'user' columns are total
CPU seconds spent on that test pass. For a 19 thread test run,
for example, the sum of 'sys' and 'user' could be up to 19 times the
number of 'real' elapsed wall clock seconds.
3) Tests were run on a fresh, single-user boot, to minimize the amount
of memory already in use at the start of the test, and to minimize
the amount of background activity that might interfere.
4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
5) Notice that the 'real' time gets large for the single thread runs, even
though the measured 'sys' and 'user' times are modest. I'm not sure what
that means - probably something to do with it being slow for one thread to
be accessing memory along ways away. Perhaps the fake numa system, running
ostensibly the same workload, would not show this substantial degradation
of 'real' time for one thread on many nodes -- lets hope not.
6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
ran quite efficiently, as one might expect. Each pair of threads needed
to allocate and touch the memory on the node the two threads shared, a
pleasantly parallizable workload.
7) The intermediate thread count passes, when asking for alot of memory forcing
them to go to a few neighboring nodes, improved the most with this zonelist
caching patch.
Conclusions:
* This zonelist cache patch probably makes little difference one way or the
other for most workloads on real numa hardware, if those workloads avoid
heavy off node allocations.
* For memory intensive workloads requiring substantial off-node allocations
on real numa hardware, this patch improves both kernel and elapsed timings
up to ten per-cent.
* For fake numa systems, I'm optimistic, but will have to leave that up to
Rohit Seth to actually test (once I get him a 2.6.18 backport.)
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zone table is mostly not needed. If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.
In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field. In all of
the above cases there will be no need at all for the zone table.
The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags. In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.
For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table. The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.
[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>