Currently the fb_find_nearest_mode() function finds a mode with screen
resolution closest to that described by the 'var' argument and with some
arbitrary refresh rate (eg. in the following sequence of refresh rates: 70 60
53 85 75, 53 is selected).
This patch fixes the function so that it looks for the closest mode as far as
both resolution and refresh rate are concerned. The function's first argument
is changed to fb_videomode so that the refresh rate can be specified by the
caller, as fb_var_screeninfo doesn't have any fields that could directly hold
this data.
Signed-off-by: Michal Januszewski <spock@gentoo.org>
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
According to Jon Smirl, filling in the field fb_cursor with soft_cursor for
drivers that do not support hardware cursors is redundant. The soft_cursor
function is usable by all drivers because it is just a wrapper around
fb_imageblit. And because soft_cursor is an fbcon-specific hook, the file is
moved to the console directory.
Thus, drivers that do not support hardware cursors can leave the fb_cursor
field blank. For drivers that do, they can fill up this field with their own
version.
The end result is a smaller code size. And if the framebuffer console is not
loaded, module/kernel size is also reduced because the soft_cursor module will
also not be loaded.
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a couple of tests which could possibly be confused by extremely
large numbers appearing in 'xdr' packets. I think the closest to an exploit
you could get would be writing random data from a free page into a file - i.e.
leak data out of kernel space.
I'm fairly sure they cannot be used for remote compromise.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide a file in the NFSD filesystem that allows setting and querying of
which version of NFS are being exported. Changes are only allowed while no
server is running.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Updates the RIO messaging interface to pass a device instance into the
event registeration and callbacks.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Addresses issues raised with the 2.6.12-rc6-mm1 RIO support. Fix dma_mask
init, shrink some code, general cleanup.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add RapidIO core include files.
The core code implements enumeration/discovery, management of
devices/resources, and interfaces for RIO drivers.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reorganize the preempt_disable/enable calls to eliminate the extra preempt
depth. Changes based on Paul McKenney's review suggestions for the kprobes
RCU changeset.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the base kprobes infrastructure to use RCU for synchronization
during kprobe registration and unregistration. These changes coupled with the
arch kprobe changes (next in series):
a. serialize registration and unregistration of kprobes.
b. enable lockless execution of handlers. Handlers can now run in parallel.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the base kprobe infrastructure to track kprobe execution on a
per-cpu basis.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is rather large, but it really can't be done in smaller chunks
easily and I believe it is an important change. This has been out and tested
for a while in the latest IPMI driver release. There are no functional
changes, just changes as necessary to convert the locking over (and a few
minor style updates).
The IPMI driver uses read/write locks to ensure that things exist while they
are in use. This is bad from a number of points of view. This patch removes
the rwlocks and uses refcounts and RCU lists to manage what the locks did.
Signed-off-by: Corey Minyard <minyard@acm.org>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch passes the file handle supplied in iattr to userspace, in case the
->setattr() was invoked from sys_ftruncate(). This solves the permission
checking (or lack thereof) in ftruncate() for the class of filesystems served
by an unprivileged userspace process.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds an atomic create+open operation. This does not yet work if
the file type changes between lookup and create+open, but solves the
permission checking problems for the separte create and open methods.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new access call, which will only be called if ->permission is invoked
from sys_access(). In all other cases permission checking is delayed until
the actual filesystem operation.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Though the following changes are all backward compatible (from the kernel's as
well as the library's POV) change the minor version, so interested
applications can detect new features.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch extends the iattr structure with a file pointer memeber, and adds
an ATTR_FILE validity flag for this member.
This is set if do_truncate() is invoked from ftruncate() or from
do_coredump().
The change is source and binary compatible.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sys_ptrace boilerplate code (everything outside the big switch
statement for the arch-specific requests) is shared by most architectures.
This patch moves it to kernel/ptrace.c and leaves the arch-specific code as
arch_ptrace.
Some architectures have a too different ptrace so we have to exclude them.
They continue to keep their implementations. For sh64 I had to add a
sh64_ptrace wrapper because it does some initialization on the first call.
For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but
SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-By: David Howells <dhowells@redhat.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- "extern inline" -> "static inline"
- every file should #include the headers containing the prototypes for
it's global functions
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AIO was adding a new context's max requests to the global total before
testing if that resulting total was over the global limit. This let
innocent tasks get their new limit tested along with a racing guilty task
that was crossing the limit. This serializes the _nr accounting with a
spinlock It also switches to using unsigned long for the global totals.
Individual contexts are still limited to an unsigned int's worth of
requests by the syscall interface.
The problem and fix were verified with a simple program that spun creating
and destroying a context while holding on to another long lived context.
Before the patch a task creating a tiny context could get a spurious EAGAIN
if it raced with a task creating a very large context that overran the
limit.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs
requests.
Unfortunately, radix tree api lacks an operation suitable for modifying
existing entry. This patch adds radix_tree_lookup_slot which returns pointer
to found item within the tree. That location can be then updated.
Both Nick and Christoph Lameter have patches which need this as well.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a few comments surrounding the generic readahead API.
Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE
offsets into pagecache.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add SHM_NORESERVE functionality similar to MAP_NORESERVE for shared memory
segments.
This is mainly to avoid abuse of OVERCOMMIT_ALWAYS and this flag is ignored
for OVERCOMMIT_NEVER.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the hlist_for_each_rcu() API, which is used only in one place, and
is trivially converted to hlist_for_each_entry_rcu(), making the code
shorter and more readable. Any out-of-tree uses may be similarly
converted.
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a connector that reports fork, exec, id change, and exit
events for all processes to userspace. It replaces the fork_advisor patch
that ELSA is currently using. Applications that may find these events
useful include accounting/auditing (e.g. ELSA), system activity monitoring
(e.g. top), security, and resource management (e.g. CKRM).
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This extends the API somewhat to allow for platform-specific VCR reading and
writing. Some platforms (like SH4-202) implement the VCR in a split VCRL and
VCRH, but end up being in reverse order or have other quirks that need to be
dealt with, so we add a set of superhyway_ops per-bus to accomodate this.
We also have to extend the per-device resources somewhat, as some devices now
conveniently split control and data blocks. So we allow a platform to
register its set of SuperHyway devices via superhyway_add_devices() with the
control block always ordered as the first resource (as this is the one that
userspace cares about).
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If they get inlined into non __xipram functions we're screwed.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add density mask for better support of DDP chips.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Simple bad block table source and header files
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Simplify the debugging code further.
Update the TODO list
Signed-off-by: Artem B. Bityutskiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The changes introduced allow to suspend/resume NAND flash.
A new state (FL_PM_SUSPENDED) is introduced, as well as
routines for mtd->suspend and mtd->resume to put the flash in
suspended state from software pov.
Signed-off-by: Vitaly Wool <vwool@ru.mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The goal of summary is to speed up the mount time. Erase block summary (EBS)
stores summary information at the end of every (closed) erase block. It is
no longer necessary to scan all nodes separetly (and read all pages of them)
just read this "small" summary, where every information is stored which is
needed at mount time.
This summary information is stored in a JFFS2_FEATURE_RWCOMPAT_DELETE. During
the mount process if there is no summary info the orignal scan process will
be executed. EBS works with NAND and NOR flashes, too.
There is a user space tool called sumtool to generate this summary
information for a JFFS2 image.
Signed-off-by: Ferenc Havasi <havasi@inf.u-szeged.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- Update OMAP OneNAND mapping file using device driver model
- Remove board specific macro and values.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add OneNAND Sync. Burst Read support
Tested with OMAP platform
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
OneNAND is a new flash technology from Samsung with integrated SRAM
buffers and logic interface.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This updates the Primary Vendor-Specific Extended Query parsing to
version 1.4 in order to get the information about the Configurable
Programming Mode regions implemented in the Sibley flash, as well as
selecting the appropriate write command code.
This flash does not behave like traditional NOR flash when writing data.
While mtdblock should just work, further changes are needed for JFFS2 use.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Adds some new PCI IDs for new IPR adapters.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
- Update raid class to use nested classes for raid components (this will
allow us to move to a component control model now)
- Make the raid level an enumeration rather than and int.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Rename functions to a name matching the functionality.
Remove stall debug code
Signed-off-by: Artem B. Bityutskiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
JFFS2 uses f->dents to store the pointer to the symlink target string (in case
the inode is symlink). This is somewhat ugly to use the same field for
different reasons. Introduce distinct field f->target for this purpose.
Note, f->fragtree, f->dents, f->target may probably be put in a union.
Signed-off-by: Artem B. Bityutskiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Include autoconf.h into every kernel compilation via the gcc command line
using -imacros. This ensures that we have the kernel configuration
included from the start, rather than relying on each file having #include
<linux/config.h> as appropriate. History has shown that this is something
which is difficult to get right.
Since we now include the kernel configuration automatically, make
configcheck becomes meaningless, so remove it.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
The offsets of the registers are in a different place, and
some parts cannot handle a full set of modem control signals.
Signed-off-by: Pantelis Antoniou <pantelis@embeddedalley.ocm>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Add 5708 copper and serdes basic support, including 2.5 Gbps support
on 5708 serdes. SPEED_2500 is also added to ethtool.h
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Change netem to support packets getting reordered because of variations in
delay. Introduce a special case version of FIFO that queues packets in order
based on the netem delay.
Since netem is classful, those users that don't want jitter based reordering
can just insert a pfifo instead of the default.
This required changes to generic skbuff code to allow finer grain manipulation
of sk_buff_head. Insertion into the middle and reverse walk.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Re-jig the simple platform device support to allow private data
to be attached to a platform device, as well as allowing the
parent device to be set.
Example usage:
pdev = platform_device_alloc("mydev", id);
if (pdev) {
err = platform_device_add_resources(pdev, &resources,
ARRAY_SIZE(resources));
if (err == 0)
err = platform_device_add_data(pdev, &platform_data,
sizeof(platform_data));
if (err == 0)
err = platform_device_add(pdev);
} else {
err = -ENOMEM;
}
if (err)
platform_device_put(pdev);
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Introduces a new flag TC_RED_HARDDROP which specifies that if ECN
marking is enabled packets should still be dropped once the
average queue length exceeds the maximum threshold.
This _may_ help to avoid global synchronisation during small
bursts of peers advertising but not caring about ECN. Use this
option very carefully, it does more harm than good if
(qth_max - qth_min) does not cover at least two average burst
cycles.
The difference to the current behaviour, in which we'd run into
the hard queue limit, is that due to the low pass filter of RED
short bursts are less likely to cause a global synchronisation.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Adds a new u8 flags in a unused padding area of the netlink
message. Adds ECN marking support to be used instead of dropping
packets immediately.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Removes unnecessary includes, initializers, and simplifies
the code a bit.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Adds a phy_mask field to struct mii_bus and uses it. This field
indicates each phy address to be ignored when probing the mdio bus.
This support is needed for the fs_enet and ibm_emac drivers to be
converted to the generic phy layer among other drivers. Many systems
lock up on probing certain phy addresses or probing doesn't return
0xffff when nothing is found at the address. A new driver I'm
working on also makes use of this mask.
Signed-off-by: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Use ata_pad_{alloc,free} in two drivers, to factor out common code.
Add ata_pad_{alloc,free} to two other drivers, which needed the padding
but had not been updated.
This adds support for the Nvidia Geforce 7800 series of cards to the
nvidiafb framebuffer driver. All it does is add the PCI device id for
the 7800, 7800 GTX, 7800 GO, and 7800 GTX GO cards to the module device
table for the nvidiafb.ko driver, so that nvidiafb.ko will actually work
on these cards.
I also added the relevant PCI device ids to linux/pci_ids.h
I tested it on my 7800 GTX here and it works like a charm. I now can
get framebuffer support on this card! Woo hoo!! Nothing like 200x75 text
mode to make your eyes BLEED. ;)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
RFC 3530 states that for OPEN_DOWNGRADE "The share_access and share_deny
bits specified must be exactly equal to the union of the share_access and
share_deny bits specified for some subset of the OPENs in effect for
current openowner on the current file.
Setattr is currently violating the NFSv4 rules for OPEN_DOWNGRADE in that
it may cause a downgrade from OPEN4_SHARE_ACCESS_BOTH to
OPEN4_SHARE_ACCESS_WRITE despite the fact that there exists no open file
with O_WRONLY access mode.
Fix the problem by replacing nfs4_find_state() with a modified version of
nfs_find_open_context().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not register statically allocated input devices to prevent
OOPS when attaching input interfaces since it requires class
device to be properly initialized.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
This is now used to issue a delayed allocation flush before reporting
quota, which allows the used space quota report to match reality.
Signed-off-by: Nathan Scott <nathans@sgi.com>
Fix up etherdevice docbook comments and make them (and other networking stuff)
get dragged into the kernel-api. Delete the old 8390 stuff, it really isn't
interesting anymore.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Optimize the match for broadcast address by using bit operations instead
of comparison. This saves a number of conditional branches, and generates
smaller code.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Instead of having ->read_sectors and ->write_sectors, combine the two
into ->sectors[2] and similar for the other fields. This saves a branch
several places in the io path, since we don't have to care for what the
actual io direction is. On my x86-64 box, that's 200 bytes less text in
just the core (not counting the various drivers).
Signed-off-by: Jens Axboe <axboe@suse.de>
Like ip_tables already has it for some time, this adds support for
having multiple revisions for each match/target. We steal one byte from
the name in order to accomodate a 8 bit version number.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
I recently picked up my older work to remove unnecessary #includes of
sched.h, starting from a patch by Dave Jones to not include sched.h
from module.h. This reduces the number of indirect includes of sched.h
by ~300. Another ~400 pointless direct includes can be removed after
this disentangling (patch to follow later).
However, quite a few indirect includes need to be fixed up for this.
In order to feed the patches through -mm with as little disturbance as
possible, I've split out the fixes I accumulated up to now (complete for
i386 and x86_64, more archs to follow later) and post them before the real
patch. This way this large part of the patch is kept simple with only
adding #includes, and all hunks are independent of each other. So if any
hunk rejects or gets in the way of other patches, just drop it. My scripts
will pick it up again in the next round.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch simplifies some checks for magic siginfo values. It should not
change the behaviour in any way.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Simplify the UP (1 CPU) implementatin of set_cpus_allowed.
The one CPU is hardcoded to be cpu 0 - so just test for that bit, and avoid
having to pick up the cpu_online_map.
Also, unexport cpu_online_map: it was only needed for set_cpus_allowed().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is a rewrite of the one submitted on October 1st, using modules
(http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2).
This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an
intense torture test of the RCU infratructure. This is needed due to the
continued changes to the RCU infrastructure to accommodate dynamic ticks,
CPU hotplug, realtime, and so on. Most of the code is in a separate file
that is compiled only if the CONFIG variable is set. Documentation on how
to run the test and interpret the output is also included.
This code has been tested on i386 and ppc64, and an earlier version of the
code has received extensive testing on a number of architectures as part of
the PREEMPT_RT patchset.
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sparse complains about every MODULE_PARM used in a module: warning: symbol
'__parm_foo' was not declared. Should it be static?
The fix is to split declaration and initialization. While MODULE_PARM is
obsolete, it's not something sparse should report.
Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Setting ctime is implicit in all setattr cases, so the FATTR_CTIME
definition is unnecessary.
It is used by neither the kernel nor by userspace.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch adds LSM hooks for key management facilities. The notable
changes are:
(1) The key struct now supports a security pointer for the use of security
modules. This will permit key labelling and restrictions on which
programs may access a key.
(2) Security modules get a chance to note (or abort) the allocation of a key.
(3) The key permission checking can now be enhanced by the security modules;
the permissions check consults LSM if all other checks bear out.
(4) The key permissions checking functions now return an error code rather
than a boolean value.
(5) An extra permission has been added to govern the modification of
attributes (UID, GID, permissions).
Note that there isn't an LSM hook specifically for each keyctl() operation,
but rather the permissions hook allows control of individual operations based
on the permission request bits.
Key management access control through LSM is enabled by automatically if both
CONFIG_KEYS and CONFIG_SECURITY are enabled.
This should be applied on top of the patch ensubjected:
[PATCH] Keys: Possessor permissions should be additive
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch automatically updates a tasks NUMA mempolicy when its cpuset
memory placement changes. It does so within the context of the task,
without any need to support low level external mempolicy manipulation.
If a system is not using cpusets, or if running on a system with just the
root (all-encompassing) cpuset, then this remap is a no-op. Only when a
task is moved between cpusets, or a cpusets memory placement is changed
does the following apply. Otherwise, the main routine below,
rebind_policy() is not even called.
When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
essential role of cpusets is to place jobs (several related tasks) on a set
of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
manage a jobs processor placement within its allowed cpuset, and the
essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
memory placement within its allowed cpuset.
However, CPU affinity and NUMA memory placement are managed within the
kernel using absolute system wide numbering, not cpuset relative numbering.
This is ok until a job is migrated to a different cpuset, or what's the
same, a jobs cpuset is moved to different CPUs and Memory Nodes.
Then the CPU affinity and NUMA memory placement of the tasks in the job
need to be updated, to preserve their cpuset-relative position. This can
be done for CPU affinity using sched_setaffinity() from user code, as one
task can modify anothers CPU affinity. This cannot be done from an
external task for NUMA memory placement, as that can only be modified in
the context of the task using it.
However, it easy enough to remap a tasks NUMA mempolicy automatically when
a task is migrated, using the existing cpuset mechanism to trigger a
refresh of a tasks memory placement after its cpuset has changed. All that
is needed is the old and new nodemask, and notice to the task that it needs
to rebind its mempolicy. The tasks mems_allowed has the old mask, the
tasks cpuset has the new mask, and the existing
cpuset_update_current_mems_allowed() mechanism provides the notice. The
bitmap/cpumask/nodemask remap operators provide the cpuset relative
calculations.
This patch leaves open a couple of issues:
1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
These mempolicies may reference nodes outside of those allowed to
the current task by its cpuset. Tasks are migrated as part of jobs,
which reside on what might be several cpusets in a subtree. When such
a job is migrated, all NUMA memory policy references to nodes within
that cpuset subtree should be translated, and references to any nodes
outside that subtree should be left untouched. A future patch will
provide the cpuset mechanism needed to mark such subtrees. With that
patch, we will be able to correctly migrate these other memory policies
across a job migration.
2) Updating cpuset, affinity and memory policies in user space:
This is harder. Any placement state stored in user space using
system-wide numbering will be invalidated across a migration. More
work will be required to provide user code with a migration-safe means
to manage its cpuset relative placement, while preserving the current
API's that pass system wide numbers, not cpuset relative numbers across
the kernel-user boundary.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the forthcoming task migration support, a key calculation will be
mapping cpu and node numbers from the old set to the new set while
preserving cpuset-relative offset.
For example, if a task and its pages on nodes 8-11 are being migrated to
nodes 24-27, then pages on node 9 (the 2nd node in the old set) should be
moved to node 25 (the 2nd node in the new set.)
As with other bitmap operations, the proper way to code this is to provide
the underlying calculation in lib/bitmap.c, and then to provide the usual
cpumask and nodemask wrappers.
This patch provides that. These operations are termed 'remap' operations.
Both remapping a single bit and a set of bits is supported.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Overhaul cpuset locking. Replace single semaphore with two semaphores.
The suggestion to use two locks was made by Roman Zippel.
Both locks are global. Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers. Before making actual
changes, the second semaphore, callback_sem must be acquired as well. Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.
The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.
This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention. Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)
This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman). I've never
seen these races fail in any use or test.
See further the comments in the code.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the recent timer rework we lost the check for an add_timer() of an
already-pending timer. That check was useful for networking, so put it back.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure we always return, as all syscalls should. Also move the common
prototype to <linux/syscalls.h>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This lock is used in sigqueue_free(), but it is always equal to
current->sighand->siglock, so we don't need to keep it in the struct
sigqueue.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead
in a memory location that is not anymore used at call_rcu(&f->f_rcuhead,
file_free_rcu) time, to reduce the size of this critical kernel object.
The trick I used is to move f_rcuhead and f_list in an union called f_u
The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list
becomes f_u.f_list
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove timer_list.magic and associated debugging code.
I originally added this when a spinlock was added to timer_list - this meant
that an all-zeroes timer became illegal and init_timer() was required.
That spinlock isn't even there any more, although timer.base must now be
initialised.
I'll keep this debugging code in -mm.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create a macro shift_right() that avoids the numerous ugly conditionals in the
NTP code that look like:
if(a < 0)
b = -(-a >> shift);
else
b = a >> shift;
Replacing it with:
b = shift_right(a, shift);
This should have zero effect on the logic, however it should probably have
a bit of testing just to be sure.
Also replace open-coded min/max with the macros.
Signed-off-by : John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Enhance the kthread API by adding kthread_stop_sem, for use in stopping
threads that spend their idle time waiting on a semaphore.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Every user of init_timer() also needs to initialize ->function and ->data
fields. This patch adds a simple setup_timer() helper for that.
The schedule_timeout() is patched as an example of usage.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the problem (BUG 4964) with unmapped buffers in transaction's
t_sync_data list. The problem is we need to call filesystem's own
invalidatepage() from block_write_full_page().
block_write_full_page() must call filesystem's invalidatepage(). Otherwise
following nasty race can happen:
proc 1 proc 2
------ ------
- write some new data to 'offset'
=> bh gets to the transactions data list
- starts truncate
=> i_size set to new size
- mpage_writepages()
- ext3_ordered_writepage() to 'offset'
- block_write_full_page()
- page->index > end_index+1
- block_invalidatepage()
- discard_buffer()
- clear_buffer_mapped()
- commit triggers and finds unmapped buffer - BOOM!
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add pm_ops.valid callback, so only the available pm states show in
/sys/power/state. And this also makes an earlier states error report at
enter_state before we do actual suspend/resume.
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Acked-by: Pavel Machek<pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to
mark pages that should be freed in case of an error during resume.
This allows us to simplify the code and to use swsusp_free() in all of the
swsusp's resume error paths, which makes them actually work.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch moves the functionality of swsusp related to creating and
handling the snapshot of memory to a separate file, snapshot.c
This should enable us to untangle the code in the future and eventually to
implement some parts of swsusp.c in the user space.
The patch does not change the code.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some modules creating sysfs entries under /sys/devices/system/cpu/cpuX/
need to know the parent sysfs entry to make devices under them. This will
just return the sysfs entry for a given cpu.
sysfs entries showing under each cpu sysfs can be easily created if such
entries can be created by registering a sysfs driver for cpuclass. The
issue is when the entry is created the CPU may not be online, hence we
would need to defer the creation until the online notification comes.
Current users: cache entries for Intel CPU's and cpufreq subsystem.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Zwane Mwaikambo <zwane@holomorphy.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch allows SELinux to canonicalize the value returned from
getxattr() via the security_inode_getsecurity() hook, which is called after
the fs level getxattr() function.
The purpose of this is to allow the in-core security context for an inode
to override the on-disk value. This could happen in cases such as
upgrading a system to a different labeling form (e.g. standard SELinux to
MLS) without needing to do a full relabel of the filesystem.
In such cases, we want getxattr() to return the canonical security context
that the kernel is using rather than what is stored on disk.
The implementation hooks into the inode_getsecurity(), adding another
parameter to indicate the result of the preceding fs-level getxattr() call,
so that SELinux knows whether to compare a value obtained from disk with
the kernel value.
We also now allow getxattr() to work for mountpoint labeled filesystems
(i.e. mount with option context=foo_t), as we are able to return the
kernel value to the user.
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add CONFIG_X86_32 for i386. This allows selecting options that only apply
to 32-bit systems.
(X86 && !X86_64) becomes X86_32
(X86 || X86_64) becomes X86
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The second argument to ata_qc_complete() was being used for two
purposes: communicate the ATA Status register to the completion
function, and indicate an error. On legacy PCI IDE hardware, the latter
is often implicit in the former. On more modern hardware, the driver
often completely emulated a Status register value, passing ATA_ERR as an
indication that something went wrong.
Now that previous code changes have eliminated the need to use drv_stat
arg to communicate the ATA Status register value, we can convert it to a
mask of possible error classes.
This will lead to more flexible error handling in the future.
This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.
Individual architecture patches will follow.
For now, disable memory hotplug when swsusp is enabled. There's a lot of
churn there right now. We'll fix it up properly once it calms down.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.
Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem(). However, they are all included for completeness. This
should also make any future consolidation of all of the implementations a
little more straightforward.
This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated. This is the place there pfn_valid() is made false
for a memory area that's being removed. The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A little helper that we use in the hotplug code.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Updated several references to page_table_lock in common code comments.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A couple of oddities were guarded by page_table_lock, no longer properly
guarded when that is split.
The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
atomic64_t if the architecture supports it, in such a case. Definitions by
courtesy of Christoph Lameter: who spent considerable effort on more scalable
ways of counting, but found insufficient benefit in practice.
And adding an mm with swap to the mmlist for swapoff: the list is well-
guarded by its own lock, but the list_empty check now has to be repeated
inside it.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Final step in pushing down common core's page_table_lock. follow_page no
longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
and so no page_table_lock is taken in get_user_pages itself.
But get_user_pages (and get_futex_key) do then need follow_page to pin the
page for them: take Daniel's suggestion of bitflags to follow_page.
Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps is
wrong to mark every page it finds as accessed), another for GET.
And another, ANON to dispose of untouched_anonymous_page: it seems silly for
that to descend a second time, let follow_page observe if there was no page
table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED -
make_pages_present ought to make readonly anonymous present.
Give get_numa_maps a cond_resched while we're there.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
check_user_page_readable is a problematic variant of follow_page. It's used
only by oprofile's i386 and arm backtrace code, at interrupt time, to
establish whether a userspace stackframe is currently readable.
This is problematic, because we want to push the page_table_lock down inside
follow_page, and later split it; whereas oprofile is doing a spin_trylock on
it (in the i386 case, forgotten in the arm case), and needs that to pin
perhaps two pages spanned by the stackframe (which might be covered by
different locks when we split).
I think oprofile is going about this in the wrong way: it doesn't need to know
the area is readable (neither i386 nor arm uses read protection of user
pages), it doesn't need to pin the memory, it should simply
__copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've
not got around to devising the sparse __user annotations for this.
Then we can eliminate check_user_page_readable, and return to a single
follow_page without the __follow_page variants.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
rmap's page_check_address descend without page_table_lock. First just
pte_offset_map in case there's no pte present worth locking for, then take
page_table_lock for the full check, and pass ptl back to caller in the same
style as pte_offset_map_lock. __xip_unmap, page_referenced_one and
try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also.
page_check_address reformatted to avoid progressive indentation. No use is
made of its one error code, return NULL when it fails.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.
Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
does unmap_vmas have much use for its mm arg now.
The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Second step in pushing down the page_table_lock. Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.
Convert their callers from common code. But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.
These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate. It appears that on
all arches we can safely descend without page_table_lock.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
add a little to kernel size, change them to macros testing inline, calling
__pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
as fastcalls, leave that to CONFIG_REGPARM or not.
It also seems more natural for the out-of-line functions to leave the offset
calculation and map to the inline, which has to do it anyway for the common
case. At least mremap move wants __pte_alloc without _map.
Macros rather than inline functions, certainly to avoid the header file issues
which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
architectures I haven't built would have other such problems.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
First step in pushing down the page_table_lock. init_mm.page_table_lock has
been used throughout the architectures (usually for ioremap): not to serialize
kernel address space allocation (that's usually vmlist_lock), but because
pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.
Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
and drop it when allocating a new one, to check lest a racing task already
did. Similarly no page_table_lock in vmalloc's map_vm_area.
Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
user mms, which are converted only by a later patch, for now they have to lock
differently according to whether or not it's init_mm.
If sources get muddled, there's a danger that an arch source taking
init_mm.page_table_lock will be mixed with common source also taking it (or
neither take it). So break the rules and make another change, which should
break the build for such a mismatch: remove the redundant mm arg from
pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).
Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
took page_table_lock for no good reason.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ia64 has expand_backing_store function for growing its Register Backing Store
vma upwards. But more complete code for this purpose is found in the
CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide
expand_upwards for ia64 as well as expand_stack for parisc.
The Register Backing Store vma should be marked VM_ACCOUNT. Implement the
intention of growing it only a page at a time, instead of passing an address
outside of the vma to handle_mm_fault, with unknown consequences.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
and total_vm, in the same cacheline on those arches verified.
There are likely to be more profitable rearrangements, but less obvious (is it
good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
from many others?), needing serious instrumentation.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove PageReserved() calls from core code by tightening VM_RESERVED
handling in mm/ to cover PageReserved functionality.
PageReserved special casing is removed from get_page and put_page.
All setting and clearing of PageReserved is retained, and it is now flagged
in the page_alloc checks to help ensure we don't introduce any refcount
based freeing of Reserved pages.
MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
deprecated. We never completely handled it correctly anyway, and is be
reintroduced in future if required (Hugh has a proof of concept).
Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
be trivially removed.
Last real user of PageReserved is swsusp, which uses PageReserved to
determine whether a struct page points to valid memory or not. This still
needs to be addressed (a generic page_is_ram() should work).
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
thus mapcounted and count towards shared rss). These writes to the struct
page could cause excessive cacheline bouncing on big systems. There are a
number of ways this could be addressed if it is an issue.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Refcount bug fix for filemap_xip.c
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I was lazy when we added anon_rss, and chose to change as few places as
possible. So currently each anonymous page has to be counted twice, in rss
and in anon_rss. Which won't be so good if those are atomic counts in some
configurations.
Change that around: keep file_rss and anon_rss separately, and add them
together (with get_mm_rss macro) when the total is needed - reading two
atomics is much cheaper than updating two atomics. And update anon_rss
upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Divide remove_vm_struct into two parts: first anon_vma_unlink plus
unlink_file_vma, to unlink the vma from the list and tree by which rmap or
vmtruncate might find it; then remove_vma to close, fput and free.
The intention here is to do the anon_vma_unlink and unlink_file_vma earlier,
in free_pgtables before freeing any page tables: so we can be sure that any
page tables traversed by rmap and vmtruncate are stable (and other, ordinary
cases are stabilized by holding mmap_sem).
This will be crucial to traversing pgd,pud,pmd without page_table_lock. But
testing the split-out patch showed that lifting the page_table_lock is
symbiotically necessary to make this change - the lock ordering is wrong to
move those unlinks into free_pgtables while it's under ptlock.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The original vm_stat_account has fallen into disuse, with only one user, and
only one user of vm_stat_unaccount. It's easier to keep track if we convert
them all to __vm_stat_account, then free it from its __shackles.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NUMA policy code predated nodemask_t so it used open coded bitmaps.
Convert everything to nodemask_t. Big patch, but shouldn't have any actual
behaviour changes (except I removed one unnecessary check against
node_online_map and one unnecessary BUG_ON)
Signed-off-by: "Andi Kleen" <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add sem_is_read/write_locked functions to the read/write semaphores, along the
same lines of the *_is_locked spinlock functions. The swap token tuning patch
uses sem_is_read_locked; sem_is_write_locked is added for completeness.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds
vmalloc_node(size, node) -> Allocate necessary memory on the specified node
and
get_vm_area_node(size, flags, node)
and the other functions that it depends on.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sg_init_one is a nice tool for the block layer. However, users
of struct scatterlist in other subsystems don't usually need the
DMA attributes. For them it's a waste of time and space to
initialise the whole struct scatterlist structure.
Therefore this patch adds a new function sg_set_buf to initialise
a scatterlist without zeroing the DMA attributes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- converted to platform bus
- removed pci dependencies
- removed virt_to_phys/phys_to_virt calls
System now can root off of a disk.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
diff --git a/Documentation/mips/AU1xxx_IDE.README b/Documentation/mips/AU1xxx_IDE.README
new file mode 100644