Commit Graph

2920 Commits

Author SHA1 Message Date
Paul Jackson
053199edf5 [PATCH] cpusets: dual semaphore locking overhaul
Overhaul cpuset locking.  Replace single semaphore with two semaphores.

The suggestion to use two locks was made by Roman Zippel.

Both locks are global.  Code that wants to modify cpusets must first
acquire the exclusive manage_sem, which allows them read-only access to
cpusets, and holds off other would-be modifiers.  Before making actual
changes, the second semaphore, callback_sem must be acquired as well.  Code
that needs only to query cpusets must acquire callback_sem, which is also a
global exclusive lock.

The earlier problems with double tripping are avoided, because it is
allowed for holders of manage_sem to nest the second callback_sem lock, and
only callback_sem is needed by code called from within __alloc_pages(),
where the double tripping had been possible.

This is not quite the same as a normal read/write semaphore, because
obtaining read-only access with intent to change must hold off other such
attempts, while allowing read-only access w/o such intention.  Changing
cpusets involves several related checks and changes, which must be done
while allowing read-only queries (to avoid the double trip), but while
ensuring nothing changes (holding off other would be modifiers.)

This overhaul of cpuset locking also makes careful use of task_lock() to
guard access to the task->cpuset pointer, closing a couple of race
conditions noticed while reading this code (thanks, Roman).  I've never
seen these races fail in any use or test.

See further the comments in the code.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Andrew Morton
15d2bace5e [PATCH] add_timer() of a pending timer is illegal
In the recent timer rework we lost the check for an add_timer() of an
already-pending timer.  That check was useful for networking, so put it back.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:21 -08:00
Christoph Hellwig
dfb7dac3af [PATCH] unify sys_ptrace prototype
Make sure we always return, as all syscalls should.  Also move the common
prototype to <linux/syscalls.h>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:20 -08:00
Christoph Hellwig
7024a9b884 [PATCH] adjust parisc sys_ptrace prototype
Make the pid argument a long as on every other arcihtecture.  Despite pid_t
beeing a 32bit type even on 64bit parisc this is not an ABI change due to
the parisc calling conventions.  And even if it did it wouldn't matter too
much because 64bit userspace on parisc is in an embrionic stage.

Acked-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:20 -08:00
Oleg Nesterov
19a4fcb531 [PATCH] kill sigqueue->lock
This lock is used in sigqueue_free(), but it is always equal to
current->sighand->siglock, so we don't need to keep it in the struct
sigqueue.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:19 -08:00
Eric Dumazet
2f51201662 [PATCH] reduce sizeof(struct file)
Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead
in a memory location that is not anymore used at call_rcu(&f->f_rcuhead,
file_free_rcu) time, to reduce the size of this critical kernel object.

The trick I used is to move f_rcuhead and f_list in an union called f_u

The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list
becomes f_u.f_list

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:19 -08:00
Andrew Morton
dfc4f94d2f [PATCH] remove timer debug field
Remove timer_list.magic and associated debugging code.

I originally added this when a spinlock was added to timer_list - this meant
that an all-zeroes timer became illegal and init_timer() was required.

That spinlock isn't even there any more, although timer.base must now be
initialised.

I'll keep this debugging code in -mm.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
Jeff Garzik
d61780c0d3 [PATCH] remove some more check_region stuff
Removed some more references to check_region().

I checked these changes into the 'checkreg' branch of
rsync://rsync.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git

The only valid references remaining are in:
drivers/scsi/advansys.c
drivers/scsi/BusLogic.c
drivers/cdrom/sbpcd.c
sound/oss/pss.c

  Remove last vestiges of ide_check_region()
  drivers/char/specialix: trim trailing whitespace
  drivers/char/specialix: eliminate use of check_region()
  Remove outdated and unused references to check_region()
  [sound oss] remove check_region() usage from cs4232, wavfront
  [netdrvr eepro] trim trailing whitespace
  [netdrvr eepro] remove check_region() usage

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
john stultz
1bb34a4127 [PATCH] NTP shift_right cleanup
Create a macro shift_right() that avoids the numerous ugly conditionals in the
NTP code that look like:

        if(a < 0)
                b = -(-a >> shift);
        else
                b = a >> shift;

Replacing it with:

        b = shift_right(a, shift);

This should have zero effect on the logic, however it should probably have
a bit of testing just to be sure.

Also replace open-coded min/max with the macros.

Signed-off-by : John Stultz <johnstul@us.ibm.com>

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:18 -08:00
Alan Stern
61e1a9ea4b [PATCH] Add kthread_stop_sem()
Enhance the kthread API by adding kthread_stop_sem, for use in stopping
threads that spend their idle time waiting on a semaphore.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Oleg Nesterov
a8db2db1e6 [PATCH] introduce setup_timer() helper
Every user of init_timer() also needs to initialize ->function and ->data
fields.  This patch adds a simple setup_timer() helper for that.

The schedule_timeout() is patched as an example of usage.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Jan Kara
aaa4059bc2 [PATCH] ext3: Fix unmapped buffers in transaction's lists
Fix the problem (BUG 4964) with unmapped buffers in transaction's
t_sync_data list.  The problem is we need to call filesystem's own
invalidatepage() from block_write_full_page().

block_write_full_page() must call filesystem's invalidatepage().  Otherwise
following nasty race can happen:

   proc 1                                        proc 2
   ------                                        ------
- write some new data to 'offset'
  => bh gets to the transactions data list
                                              - starts truncate
                                                => i_size set to new size
- mpage_writepages()
  - ext3_ordered_writepage() to 'offset'
    - block_write_full_page()
      - page->index > end_index+1
        - block_invalidatepage()
          - discard_buffer()
            - clear_buffer_mapped()

- commit triggers and finds unmapped buffer - BOOM!

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:17 -08:00
Heiko Carstens
1e8e338325 [PATCH] s390: export ipl device parameters
Sysfs interface to export ipl device parameters.  Dependent on the ipl type
the interface will look like this:

- ccw ipl:

/sys/firmware/ipl/device
		 /ipl_type

- fcp ipl:

/sys/firmware/ipl/binary_parameter
		 /bootprog
		 /br_lba
		 /device
		 /ipl_type
		 /lun
		 /scp_data
		 /wwpn

- otherwise (unknown that is):

/sys/firmware/ipl/ipl_type

Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:16 -08:00
Paolo 'Blaisorblade' Giarrusso
6a351cfead [PATCH] uml: remove old UM_FASTCALL, and make the thing work again
This was used in the old dark age of 2.4, ARCH_CFLAGS doesn't work any more
since some time, and UM_FASTCALL was never used in 2.6.

Instead, reintroduce the thing more properly now, directly in
include/asm-um/linkage.h.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:16 -08:00
Paolo 'Blaisorblade' Giarrusso
96d55b882b [PATCH] uml: reuse i386 cpu-specific tuning
Make UML share the underlying cpu-specific tuning done on i386.

Actually, for now many config options aren't used a lot - but that can be done
later.  Also, UML relies on GCC optimization for things like memcpy and such
more than i386, so specifying the correct -march and -mtune should be enough.
Later, we may want to correct some other stuff.

For instance, since FPU context switching, for us, is done (at least
partially, i.e.  between our kernelspace and userspace) by the host, we may
allow usage of FPU operations by GCC.  This doesn't hold for kernelspace vs.
kernelspace, but we don't support preemption.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:16 -08:00
Hirokazu Takata
c978b0179b [PATCH] m32r: fix #if warnings
Fix warnings for #if directives.

Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:15 -08:00
Shaohua Li
eb9289eb20 [PATCH] introduce .valid callback for pm_ops
Add pm_ops.valid callback, so only the available pm states show in
/sys/power/state.  And this also makes an earlier states error report at
enter_state before we do actual suspend/resume.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Acked-by: Pavel Machek<pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:15 -08:00
Rafael J. Wysocki
2c1b4a5ca4 [PATCH] swsusp: rework memory freeing on resume
The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to
mark pages that should be freed in case of an error during resume.

This allows us to simplify the code and to use swsusp_free() in all of the
swsusp's resume error paths, which makes them actually work.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Rafael J. Wysocki
25761b6eb7 [PATCH] swsusp: move snapshot functionality to separate file
The following patch moves the functionality of swsusp related to creating and
handling the snapshot of memory to a separate file, snapshot.c

This should enable us to untangle the code in the future and eventually to
implement some parts of swsusp.c in the user space.

The patch does not change the code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Ashok Raj
ad74557a49 [PATCH] introduce get_cpu_sysdev() to retrieve a sysfs entry for a cpu.
Some modules creating sysfs entries under /sys/devices/system/cpu/cpuX/
need to know the parent sysfs entry to make devices under them.  This will
just return the sysfs entry for a given cpu.

sysfs entries showing under each cpu sysfs can be easily created if such
entries can be created by registering a sysfs driver for cpuclass.  The
issue is when the entry is created the CPU may not be online, hence we
would need to defer the creation until the online notification comes.

Current users: cache entries for Intel CPU's and cpufreq subsystem.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Zwane Mwaikambo <zwane@holomorphy.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:14 -08:00
Ingo Molnar
bda98685b8 [PATCH] x86: inline spin_unlock if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPT
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:13 -08:00
Brian Gerst
c531178157 [PATCH] Clean up mtrr compat ioctl code
Handle 32-bit mtrr ioctls in the mtrr driver instead of the ia32
compatability layer.

Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:13 -08:00
Eric W. Biederman
f2b36db692 [PATCH] i386: move apic init in init_IRQs
All kinds of ugliness exists because we don't initialize
the apics during init_IRQs.
- We calibrate jiffies in non apic mode even when we are using apics.
- We have to have special code to initialize the apics when non-smp.
- The legacy i8259 must exist and be setup correctly, even
  when we won't use it past initialization.
- The kexec on panic code must restore the state of the io_apics.
- init/main.c needs a special case for !smp smp_init on x86

In addition to pure code movement I needed a couple
of non-obvious changes:
- Move setup_boot_APIC_clock into APIC_late_time_init for
  simplicity.
- Use cpu_khz to generate a better approximation of loops_per_jiffies
  so I can verify the timer interrupt is working.
- Call setup_apic_nmi_watchdog again after cpu_khz is initialized on
  the boot cpu.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:13 -08:00
Natalie.Protasevich@unisys.com
9338316c93 [PATCH] ES7000 platform update
This is platform code update for ES7000: disables IRQ overrides for the
recent ES7000 (Rascal/Zorro), cleans up the compile warning.  The patch
only affects the ES7000 subarch.

Signed-off-by: <Natalie.Protasevich@unisys.com>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:12 -08:00
Zachary Amsden
251e6912df [PATCH] x86: add an accessor function for getting the per-CPU gdt
Add an accessor function for getting the per-CPU gdt.  Callee must already
have the CPU.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:12 -08:00
Paolo 'Blaisorblade' Giarrusso
ca140fdadb [PATCH] i386: little pgtable.h consolidation vs 2/3level
Join together some common functions (pmd_page{,_kernel}) over 2level and
3level pages.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:12 -08:00
Jan Beulich
8896fab35e [PATCH] x86: cmpxchg improvements
This adjusts i386's cmpxchg patterns so that

- for word and long cmpxchg-es the compiler can utilize all possible
  registers

- cmpxchg8b gets disabled when the minimum specified hardware architectur
  doesn't support it (like was already happening for the byte, word, and
  long ones).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:11 -08:00
James Morris
d381d8a9a0 [PATCH] SELinux: canonicalize getxattr()
This patch allows SELinux to canonicalize the value returned from
getxattr() via the security_inode_getsecurity() hook, which is called after
the fs level getxattr() function.

The purpose of this is to allow the in-core security context for an inode
to override the on-disk value.  This could happen in cases such as
upgrading a system to a different labeling form (e.g.  standard SELinux to
MLS) without needing to do a full relabel of the filesystem.

In such cases, we want getxattr() to return the canonical security context
that the kernel is using rather than what is stored on disk.

The implementation hooks into the inode_getsecurity(), adding another
parameter to indicate the result of the preceding fs-level getxattr() call,
so that SELinux knows whether to compare a value obtained from disk with
the kernel value.

We also now allow getxattr() to work for mountpoint labeled filesystems
(i.e.  mount with option context=foo_t), as we are able to return the
kernel value to the user.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:11 -08:00
Brian Gerst
0d078f6f96 [PATCH] CONFIG_IA32
Add CONFIG_X86_32 for i386.  This allows selecting options that only apply
to 32-bit systems.

(X86 && !X86_64) becomes X86_32
(X86 ||  X86_64) becomes X86

Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:10 -08:00
Jeff Garzik
a7dac447bb [libata] change ata_qc_complete() to take error mask as second arg
The second argument to ata_qc_complete() was being used for two
purposes: communicate the ATA Status register to the completion
function, and indicate an error.  On legacy PCI IDE hardware, the latter
is often implicit in the former.  On more modern hardware, the driver
often completely emulated a Status register value, passing ATA_ERR as an
indication that something went wrong.

Now that previous code changes have eliminated the need to use drv_stat
arg to communicate the ATA Status register value, we can convert it to a
mask of possible error classes.

This will lead to more flexible error handling in the future.
2005-10-30 04:44:42 -05:00
Jeff Garzik
81cfb8864c Merge branch 'master' 2005-10-30 01:56:31 -05:00
Linus Torvalds
9f75e1eff3 Merge master.kernel.org:/pub/scm/linux/kernel/git/herbert/crypto-2.6 2005-10-29 21:48:06 -07:00
Dave Hansen
3947be1969 [PATCH] memory hotplug: sysfs and add/remove functions
This adds generic memory add/remove and supporting functions for memory
hotplug into a new file as well as a memory hotplug kernel config option.

Individual architecture patches will follow.

For now, disable memory hotplug when swsusp is enabled.  There's a lot of
churn there right now.  We'll fix it up properly once it calms down.

Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
bdc8cb9845 [PATCH] memory hotplug locking: zone span seqlock
See the "fixup bad_range()" patch for more information, but this actually
creates a the lock to protect things making assumptions about a zone's size
staying constant at runtime.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
208d54e551 [PATCH] memory hotplug locking: node_size_lock
pgdat->node_size_lock is basically only neeeded in one place in the normal
code: show_mem(), which is the arch-specific sysrq-m printing function.

Strictly speaking, the architectures not doing memory hotplug do no need this
locking in show_mem().  However, they are all included for completeness.  This
should also make any future consolidation of all of the implementations a
little more straightforward.

This lock is also held in the sparsemem code during a memory removal, as
sections are invalidated.  This is the place there pfn_valid() is made false
for a memory area that's being removed.  The lock is only required when doing
pfn_valid() operations on memory which the user does not already have a
reference on the page, such as in show_mem().

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
4ca644d970 [PATCH] memory hotplug prep: __section_nr helper
A little helper that we use in the hotplug code.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Dave Hansen
2774812f41 [PATCH] memory hotplug prep: kill local_mapnr
The following series implements memory hot-add for ppc64 and i386.  There are
x86_64 and ia64 implementations that will be submitted shortly as well,
through the normal maintainers.

This patch:

local_mapnr is unused, except for in an alpha header.  Keep the alpha one,
kill the rest.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:44 -07:00
Hugh Dickins
b8072f099b [PATCH] mm: update comments to pte lock
Updated several references to page_table_lock in common code comments.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins
f412ac08c9 [PATCH] mm: fix rss and mmlist locking
A couple of oddities were guarded by page_table_lock, no longer properly
guarded when that is split.

The mm_counters of file_rss and anon_rss: make those an atomic_t, or an
atomic64_t if the architecture supports it, in such a case.  Definitions by
courtesy of Christoph Lameter: who spent considerable effort on more scalable
ways of counting, but found insufficient benefit in practice.

And adding an mm with swap to the mmlist for swapoff: the list is well-
guarded by its own lock, but the list_empty check now has to be repeated
inside it.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins
4c21e2f244 [PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.

This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock.  (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)

In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.

Splitting the lock is not quite for free: another cacheline access.  Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS.  But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.

There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins
92dc6fcc84 [PATCH] mm: parisc pte atomicity
There's a worrying function translation_exists in parisc cacheflush.h,
unaffected by split ptlock since flush_dcache_page is using it on some other
mm, without any relevant lock.  Oh well, make it a slightly more robust by
factoring the pfn check within it.  And it looked liable to confuse a
camouflaged swap or file entry with a good pte: fix that too.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:42 -07:00
Hugh Dickins
deceb6cd17 [PATCH] mm: follow_page with inner ptlock
Final step in pushing down common core's page_table_lock.  follow_page no
longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
and so no page_table_lock is taken in get_user_pages itself.

But get_user_pages (and get_futex_key) do then need follow_page to pin the
page for them: take Daniel's suggestion of bitflags to follow_page.

Need one for WRITE, another for TOUCH (it was the accessed flag before:
vanished along with check_user_page_readable, but surely get_numa_maps is
wrong to mark every page it finds as accessed), another for GET.

And another, ANON to dispose of untouched_anonymous_page: it seems silly for
that to descend a second time, let follow_page observe if there was no page
table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
make_pages_present ought to make readonly anonymous present.

Give get_numa_maps a cond_resched while we're there.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
c34d1b4d16 [PATCH] mm: kill check_user_page_readable
check_user_page_readable is a problematic variant of follow_page.  It's used
only by oprofile's i386 and arm backtrace code, at interrupt time, to
establish whether a userspace stackframe is currently readable.

This is problematic, because we want to push the page_table_lock down inside
follow_page, and later split it; whereas oprofile is doing a spin_trylock on
it (in the i386 case, forgotten in the arm case), and needs that to pin
perhaps two pages spanned by the stackframe (which might be covered by
different locks when we split).

I think oprofile is going about this in the wrong way: it doesn't need to know
the area is readable (neither i386 nor arm uses read protection of user
pages), it doesn't need to pin the memory, it should simply
__copy_from_user_inatomic, and see if that succeeds or not.  Sorry, but I've
not got around to devising the sparse __user annotations for this.

Then we can eliminate check_user_page_readable, and return to a single
follow_page without the __follow_page variants.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
c0718806cf [PATCH] mm: rmap with inner ptlock
rmap's page_check_address descend without page_table_lock.  First just
pte_offset_map in case there's no pte present worth locking for, then take
page_table_lock for the full check, and pass ptl back to caller in the same
style as pte_offset_map_lock.  __xip_unmap, page_referenced_one and
try_to_unmap_one use pte_unmap_unlock.  try_to_unmap_cluster also.

page_check_address reformatted to avoid progressive indentation.  No use is
made of its one error code, return NULL when it fails.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
508034a32b [PATCH] mm: unmap_vmas with inner ptlock
Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range.  Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway.  Nor
does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:41 -07:00
Hugh Dickins
663b97f7ef [PATCH] mm: flush_tlb_range outside ptlock
There was one small but very significant change in the previous patch:
mprotect's flush_tlb_range fell outside the page_table_lock: as it is in 2.4,
but that doesn't prove it safe in 2.6.

On some architectures flush_tlb_range comes to the same as flush_tlb_mm, which
has always been called from outside page_table_lock in dup_mmap, and is so
proved safe.  Others required a deeper audit: I could find no reliance on
page_table_lock in any; but in ia64 and parisc found some code which looks a
bit as if it might want preemption disabled.  That won't do any actual harm,
so pending a decision from the maintainers, disable preemption there.

Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range and
flush_tlb_page entries in cachetlb.txt: they were rather misleading (what
generic code does is different from what usually happens), the rules are now
changing, and it's not yet clear where we'll end up (will the generic
tlb_flush_mmu happen always under lock?  never under lock?  or sometimes under
and sometimes not?).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
705e87c0c3 [PATCH] mm: pte_offset_map_lock loops
Convert those common loops using page_table_lock on the outside and
pte_offset_map within to use just pte_offset_map_lock within instead.

These all hold mmap_sem (some exclusively, some not), so at no level can a
page table be whipped away from beneath them.  But whereas pte_alloc loops
tested with the "atomic" pmd_present, these loops are testing with pmd_none,
which on i386 PAE tests both lower and upper halves.

That's now unsafe, so add a cast into pmd_none to test only the vital lower
half: we lose a little sensitivity to a corrupt middle directory, but not
enough to worry about.  It appears that i386 and UML were the only
architectures vulnerable in this way, and pgd and pud no problem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
c74df32c72 [PATCH] mm: ptd_alloc take ptlock
Second step in pushing down the page_table_lock.  Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code.  But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate.  It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
1bb3630e89 [PATCH] mm: ptd_alloc inline and out
It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
add a little to kernel size, change them to macros testing inline, calling
__pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
as fastcalls, leave that to CONFIG_REGPARM or not.

It also seems more natural for the out-of-line functions to leave the offset
calculation and map to the inline, which has to do it anyway for the common
case.  At least mremap move wants __pte_alloc without _map.

Macros rather than inline functions, certainly to avoid the header file issues
which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
architectures I haven't built would have other such problems.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00
Hugh Dickins
872fec16d9 [PATCH] mm: init_mm without ptlock
First step in pushing down the page_table_lock.  init_mm.page_table_lock has
been used throughout the architectures (usually for ioremap): not to serialize
kernel address space allocation (that's usually vmlist_lock), but because
pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.

Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
and drop it when allocating a new one, to check lest a racing task already
did.  Similarly no page_table_lock in vmalloc's map_vm_area.

Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
user mms, which are converted only by a later patch, for now they have to lock
differently according to whether or not it's init_mm.

If sources get muddled, there's a danger that an arch source taking
init_mm.page_table_lock will be mixed with common source also taking it (or
neither take it).  So break the rules and make another change, which should
break the build for such a mismatch: remove the redundant mm arg from
pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).

Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
took page_table_lock for no good reason.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:40 -07:00