Commit Graph

394 Commits

Author SHA1 Message Date
David S. Miller
1339713a32 [SPARC]: Wire up sys_splice() into the syscall tables.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-31 23:03:38 -08:00
Linus Torvalds
78cd9e04e0 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Implement futex_atomic_cmpxchg_inatomic().
2006-03-28 09:25:22 -08:00
Alexey Dobriyan
7f927fcc2f [PATCH] Typo fixes
Fix a lot of typos.  Eyeballed by jmc@ in OpenBSD.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:08 -08:00
David S. Miller
6e57a3a897 [SPARC64]: Implement futex_atomic_cmpxchg_inatomic().
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-28 01:00:08 -08:00
Alan Stern
e041c68341 [PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar
e9056f13bf [PATCH] lightweight robust futexes: arch defaults
This patchset provides a new (written from scratch) implementation of robust
futexes, called "lightweight robust futexes".  We believe this new
implementation is faster and simpler than the vma-based robust futex solutions
presented before, and we'd like this patchset to be adopted in the upstream
kernel.  This is version 1 of the patchset.

  Background
  ----------

What are robust futexes?  To answer that, we first need to understand what
futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having to
enter the kernel.

A futex is in essence a user-space address, e.g.  a 32-bit lock variable
field.  If userspace notices contention (the lock is already owned and someone
else wants to grab it too) then the lock is marked with a value that says
"there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to
wait for the other guy to release it.  The kernel creates a 'futex queue'
internally, so that it can later on match up the waiter with the waker -
without them having to know about each other.  When the owner thread releases
the futex, it notices (via the variable value) that there were waiter(s)
pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up.  Once all
waiters have taken and released the lock, the futex is again back to
'uncontended' state, and there's no in-kernel state associated with it.  The
kernel completely forgets that there ever was a futex at that address.  This
method makes futexes very lightweight and scalable.

"Robustness" is about dealing with crashes while holding a lock: if a process
exits prematurely while holding a pthread_mutex_t lock that is also shared
with some other process (e.g.  yum segfaults while holding a pthread_mutex_t,
or yum is kill -9-ed), then waiters for that lock need to be notified that the
last owner of the lock exited in some irregular way.

To solve such types of problems, "robust mutex" userspace APIs were created:
pthread_mutex_lock() returns an error value if the owner exits prematurely -
and the new owner can decide whether the data protected by the lock can be
recovered safely.

There is a big conceptual problem with futex based mutexes though: it is the
kernel that destroys the owner task (e.g.  due to a SEGFAULT), but the kernel
cannot help with the cleanup: if there is no 'futex queue' (and in most cases
there is none, futexes being fast lightweight locks) then the kernel has no
information to clean up after the held lock!  Userspace has no chance to clean
up after the lock either - userspace is the one that crashes, so it has no
opportunity to clean up.  Catch-22.

In practice, when e.g.  yum is kill -9-ed (or segfaults), a system reboot is
needed to release that futex based lock.  This is one of the leading
bugreports against yum.

To solve this problem, 'Robust Futex' patches were created and presented on
lkml: the one written by Todd Kneisel and David Singleton is the most advanced
at the moment.  These patches all tried to extend the futex abstraction by
registering futex-based locks in the kernel - and thus give the kernel a
chance to clean up.

E.g.  in David Singleton's robust-futex-6.patch, there are 3 new syscall
variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER.
The kernel attaches such robust futexes to vmas (via
vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are
searched to see whether they have a robust_head set.

Lots of work went into the vma-based robust-futex patch, and recently it has
improved significantly, but unfortunately it still has two fundamental
problems left:

 - they have quite complex locking and race scenarios.  The vma-based
   patches had been pending for years, but they are still not completely
   reliable.

 - they have to scan _every_ vma at sys_exit() time, per thread!

The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas every
pthread_exit() takes a millisecond or more, also totally destroying the CPU's
L1 and L2 caches!

This is very much noticeable even for normal process sys_exit_group() calls:
the kernel has to do the vma scanning unconditionally!  (this is because the
kernel has no knowledge about how many robust futexes there are to be cleaned
up, because a robust futex might have been registered in another task, and the
futex variable might have been simply mmap()-ed into this process's address
space).

This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than
that: the overhead makes robust futexes impractical for any type of generic
Linux distribution.

So it became clear to us, something had to be done.  Last week, when Thomas
Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he
found a handful of new races and we were talking about it and were analyzing
the situation.  At that point a fundamentally different solution occured to
me.  This patchset (written in the past couple of days) implements that new
solution.  Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing.  Parental advice
recommended ;-)

  New approach to robust futexes
  ------------------------------

At the heart of this new approach there is a per-thread private list of robust
locks that userspace is holding (maintained by glibc) - which userspace list
is registered with the kernel via a new syscall [this registration happens at
most once per thread lifetime].  At do_exit() time, the kernel checks this
user-space list: are there any robust futex locks to be cleaned up?

In the common case, at do_exit() time, there is no list registered, so the
cost of robust futexes is just a simple current->robust_list != NULL
comparison.  If the thread has registered a list, then normally the list is
empty.  If the thread/process crashed or terminated in some incorrect way then
the list might be non-empty: in this case the kernel carefully walks the list
[not trusting it], and marks all locks that are owned by this thread with the
FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any).

The list is guaranteed to be private and per-thread, so it's lockless.  There
is one race possible though: since adding to and removing from the list is
done after the futex is acquired by glibc, there is a few instructions window
for the thread (or process) to die there, leaving the futex hung.  To protect
against this possibility, userspace (glibc) also maintains a simple per-thread
'list_op_pending' field, to allow the kernel to clean up if the thread dies
after acquiring the lock, but just before it could have added itself to the
list.  Glibc sets this list_op_pending field before it tries to acquire the
futex, and clears it after the list-add (or list-remove) has finished.

That's all that is needed - all the rest of robust-futex cleanup is done in
userspace [just like with the previous patches].

Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes.  (Ulrich plans to commit these
changes to glibc-HEAD later today.)

Key differences of this userspace-list based approach, compared to the vma
based method:

 - it's much, much faster: at thread exit time, there's no need to loop
   over every vma (!), which the VM-based method has to do.  Only a very
   simple 'is the list empty' op is done.

 - no VM changes are needed - 'struct address_space' is left alone.

 - no registration of individual locks is needed: robust mutexes dont need
   any extra per-lock syscalls.  Robust mutexes thus become a very lightweight
   primitive - so they dont force the application designer to do a hard choice
   between performance and robustness - robust mutexes are just as fast.

 - no per-lock kernel allocation happens.

 - no resource limits are needed.

 - no kernel-space recovery call (FUTEX_RECOVER) is needed.

 - the implementation and the locking is "obvious", and there are no
   interactions with the VM.

  Performance
  -----------

I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:

 - with FUTEX_WAIT set [contended mutex]: 130 msecs
 - without FUTEX_WAIT set [uncontended mutex]: 30 msecs

I have also measured an approach where glibc does the lock notification [which
it currently does for !pshared robust mutexes], and that took 256 msecs -
clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do.

(1 million held locks are unheard of - we expect at most a handful of locks to
be held at a time.  Nevertheless it's nice to know that this approach scales
nicely.)

  Implementation details
  ----------------------

The patch adds two new syscalls: one to register the userspace list, and one
to query the registered list pointer:

 asmlinkage long
 sys_set_robust_list(struct robust_list_head __user *head,
                     size_t len);

 asmlinkage long
 sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
                     size_t __user *len_ptr);

List registration is very fast: the pointer is simply stored in
current->robust_list.  [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head for new
threads, without the need of another syscall.]

So there is virtually zero overhead for tasks not using robust futexes, and
even for robust futex users, there is only one extra syscall per thread
lifetime, and the cleanup operation, if it happens, is fast and
straightforward.  The kernel doesnt have any internal distinction between
robust and normal futexes.

If a futex is found to be held at exit time, the kernel sets the highest bit
of the futex word:

	#define FUTEX_OWNER_DIED        0x40000000

and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.

Otherwise, robust futexes are acquired by glibc by putting the TID into the
futex field atomically.  Waiters set the FUTEX_WAITERS bit:

	#define FUTEX_WAITERS           0x80000000

and the remaining bits are for the TID.

  Testing, architecture support
  -----------------------------

I've tested the new syscalls on x86 and x86_64, and have made sure the parsing
of the userspace list is robust [ ;-) ] even if the list is deliberately
corrupted.

i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the
new glibc code (on x86_64 and i386), and it works for his robust-mutex
testcases.

All other architectures should build just fine too - but they wont have the
new syscalls yet.

Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline
function before writing up the syscalls (that function returns -ENOSYS right
now).

This patch:

Add placeholder futex_atomic_cmpxchg_inuser() implementations to every
architecture that supports futexes.  It returns -ENOSYS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
KAMEZAWA Hiroyuki
a117e66ed4 [PATCH] unify pfn_to_page: generic functions
There are 3 memory models, FLATMEM, DISCONTIGMEM, SPARSEMEM.
Each arch has its own page_to_pfn(), pfn_to_page() for each models.
But most of them can use the same arithmetic.

This patch adds asm-generic/memory_model.h, which includes generic
page_to_pfn(), pfn_to_page() definitions for each memory model.

When CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y, out-of-line functions are
used instead of macro. This is enabled by some archs and  reduces
text size.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:44 -08:00
Akinobu Mita
2d78d4beb6 [PATCH] bitops: sparc64: use generic bitops
- remove __{,test_and_}{set,clear,change}_bit() and test_bit()
- remove ffz()
- remove __ffs()
- remove generic_fls()
- remove generic_fls64()
- remove sched_find_first_bit()
- remove ffs()

- unless defined(ULTRA_HAS_POPULATION_COUNT)

  - remove generic_hweight{64,32,16,8}()

- remove find_{next,first}{,_zero}_bit()
- remove ext2_{set,clear,test,find_first_zero,find_next_zero}_bit()
- remove minix_{test,set,test_and_clear,test,find_first_zero}_bit()

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:14 -08:00
Akinobu Mita
67b0ad574b [PATCH] bitops: use non atomic operations for minix_*_bit() and ext2_*_bit()
Bitmap functions for the minix filesystem and the ext2 filesystem except
ext2_set_bit_atomic() and ext2_clear_bit_atomic() do not require the atomic
guarantees.

But these are defined by using atomic bit operations on several architectures.
 (cris, frv, h8300, ia64, m32r, m68k, m68knommu, mips, s390, sh, sh64, sparc,
sparc64, v850, and xtensa)

This patch switches to non atomic bit operation.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:10 -08:00
Davide Libenzi
f348d70a32 [PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications
Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense.  The same thing was discussed and conceptually
agreed quite some time ago:

http://lkml.org/lkml/2003/7/12/116

Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it.  As far
as the existing POLLHUP handling, the patch leaves it as is.  The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files.  The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.

There is "a stupid program" to test POLLRDHUP delivery here:

 http://www.xmailserver.org/pollrdhup-test.c

It tests poll(2), but since the delivery is same epoll(2) will work equally.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:56 -08:00
Andrew Morton
394e3902c5 [PATCH] more for_each_cpu() conversions
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all.  The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().

This patch is a kernel-wide sweep of all instances of NR_CPUS.  I found very
few instances of this bug, if any.  But the patch converts lots of open-coded
test to use the preferred helper macros.

Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:17 -08:00
Nick Piggin
0b2fcfdb8b [PATCH] atomic: add_unless cmpxchg optimise
Without branch hints, the very unlikely chance of the loop repeating due to
cmpxchg failure is unrolled with gcc-4 that I have tested.

Improve this for architectures with a native cas/cmpxchg.  llsc archs
should try to implement this natively.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:17 -08:00
Kyle McMartin
804f1594cc [PATCH] Move read_mostly definition to asm/cache.h
Seems like needless clutter having a bunch of #if defined(CONFIG_$ARCH) in
include/linux/cache.h.  Move the per architecture section definition to
asm/cache.h, and keep the if-not-defined dummy case in linux/cache.h to
catch architectures which don't implement the section.

Verified that symbols still go in .data.read_mostly on parisc,
and the compile doesn't break.

Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
David S. Miller
dcc1e8dd88 [SPARC64]: Add a secondary TSB for hugepage mappings.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-22 01:15:14 -08:00
David S. Miller
f6b83f070e [SPARC64]: Fix 2 bugs in huge page support.
1) huge_pte_offset() did not check the page table hierarchy
   elements as being empty correctly, resulting in an OOPS

2) Need platform specific hugetlb_get_unmapped_area() to handle
   the top-down vs. bottom-up address space allocation strategies.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:17:17 -08:00
David S. Miller
bb8646d834 [SPARC64]: Optimized TSB table initialization.
We only need to write an invalid tag every 16 bytes,
so taking advantage of this can save many instructions
compared to the simple memset() call we make now.

A prefetching implementation is implemented for sun4u
and a block-init store version if implemented for Niagara.

The next trick is to be able to perform an init and
a copy_tsb() in parallel when growing a TSB table.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:16:41 -08:00
David S. Miller
d61e16df94 [SPARC64]: Increase top of 32-bit process stack.
Put it one page below the top of the 32-bit address space.
This gives us ~16MB more address space to work with.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:16:36 -08:00
David S. Miller
a91690ddd0 [SPARC64]: Top-down address space allocation for 32-bit tasks.
Currently allocations are very constrained for 32-bit processes.
It grows down-up from 0x70000000 to 0xf0000000 which gives about
2GB of stack + dynamic mmap() space.

So support the top-down method, and we need to override the
generic helper function in order to deal with D-cache coloring.

With these changes I was able to squeeze out a mmap() just over
3.6GB in size in a 32-bit process.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:16:35 -08:00
David S. Miller
7a1ac52641 [SPARC64]: Fix and re-enable dynamic TSB sizing.
This is good for up to %50 performance improvement of some test cases.
The problem has been the race conditions, and hopefully I've plugged
them all up here.

1) There was a serious race in switch_mm() wrt. lazy TLB
   switching to and from kernel threads.

   We could erroneously skip a tsb_context_switch() and thus
   use a stale TSB across a TSB grow event.

   There is a big comment now in that function describing
   exactly how it can happen.

2) All code paths that do something with the TSB need to be
   guarded with the mm->context.lock spinlock.  This makes
   page table flushing paths properly synchronize with both
   TSB growing and TLB context changes.

3) TSB growing events are moved to the end of successful fault
   processing.  Previously it was in update_mmu_cache() but
   that is deadlock prone.  At the end of do_sparc64_fault()
   we hold no spinlocks that could deadlock the TSB grow
   sequence.  We also have dropped the address space semaphore.

While we're here, add prefetching to the copy_tsb() routine
and put it in assembler into the tsb.S file.  This piece of
code is quite time critical.

There are some small negative side effects to this code which
can be improved upon.  In particular we grab the mm->context.lock
even for the tsb insert done by update_mmu_cache() now and that's
a bit excessive.  We can get rid of that locking, and the same
lock taking in flush_tsb_user(), by disabling PSTATE_IE around
the whole operation including the capturing of the tsb pointer
and tsb_nentries value.  That would work because anyone growing
the TSB won't free up the old TSB until all cpus respond to the
TSB change cross call.

I'm not quite so confident in that optimization to put it in
right now, but eventually we might be able to and the description
is here for reference.

This code seems very solid now.  It passes several parallel GCC
bootstrap builds, and our favorite "nut cruncher" stress test which is
a full "make -j8192" build of a "make allmodconfig" kernel.  That puts
about 256 processes on each cpu's run queue, makes lots of process cpu
migrations occur, causes lots of page table and TLB flushing activity,
incurs many context version number changes, and it swaps the machine
real far out to disk even though there is 16GB of ram on this test
system. :-)

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:16:33 -08:00
David S. Miller
90a6646bf6 [SPARC64]: Fix system type in /proc/cpuinfo and remove bogus OBP check.
Report 'sun4v' when appropriate in /proc/cpuinfo

Remove all the verifications of the OBP version string.  Just
make sure it's there, and report it raw in the bootup logs and
via /proc/cpuinfo.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:25 -08:00
David S. Miller
8935dced54 [SPARC64]: Add SMT scheduling support for Niagara.
The mapping is a simple "(cpuid >> 2) == core" for now.
Later we'll add more sophisticated code that will walk
the sun4v machine description and figure this out from
there.

We should also add core mappings for jaguar and panther
processors.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:24 -08:00
David S. Miller
d1112018b4 [SPARC64]: Move over to sparsemem.
This has been pending for a long time, and the fact
that we waste a ton of ram on some configurations
kind of pushed things over the edge.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:22 -08:00
David S. Miller
ee29074d3b [SPARC64]: Fix new context version SMP handling.
Don't piggy back the SMP receive signal code to do the
context version change handling.

Instead allocate another fixed PIL number for this
asynchronous cross-call.  We can't use smp_call_function()
because this thing is invoked with interrupts disabled
and a few spinlocks held.

Also, fix smp_call_function_mask() to count "cpus" correctly.
There is no guarentee that the local cpu is in the mask
yet that is exactly what this code was assuming.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:21 -08:00
David S. Miller
a77754b4d0 [SPARC64]: Bulletproof MMU context locking.
1) Always spin_lock_init() in init_context().  The caller essentially
   clears it out, or copies the mm info from the parent.  In both
   cases we need to explicitly initialize the spinlock.

2) Always do explicit IRQ disabling while taking mm->context.lock
   and ctx_alloc_lock.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:20 -08:00
David S. Miller
8bcd174116 [SPARC64]: Do not allow mapping pages within 4GB of 64-bit VA hole.
The UltraSPARC T1 manual recommends this because the chip
could instruction prefetch into the VA hole, and this would
also make decoding  certain kinds of memory access traps
more difficult (because the chip sign extends certain pieces
of trap state).

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:14 -08:00
David S. Miller
e22990451a [SPARC64]: Kill bogus function externs in asm/pgtable.h
These are all implemented inline earlier in the file.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:11 -08:00
David S. Miller
b830ab665a [SPARC64]: Fix bugs in SUN4V cpu mondo dispatch.
There were several bugs in the SUN4V cpu mondo dispatch code.

In fact, if we ever got a EWOULDBLOCK or other error from
the hypervisor call, we'd potentially send a cpu mondo multiple
times to the same cpu and even worse we could loop until the
timeout resending the same mondo over and over to such cpus.

So let's bulletproof this thing as follows:

1) Implement cpu_mondo_send() and cpu_state() hypervisor calls
   in arch/sparc64/kernel/entry.S, add prototypes to asm/hypervisor.h

2) Don't build and update the cpulist using inline functions, this
   was causing the cpu mask to not get updated in the caller.

3) Disable interrupts during the entire mondo send, otherwise our
   cpu list and/or mondo block could get overwritten if we take
   an interrupt and do a cpu mondo send on the current cpu.

4) Check for all possible error return types from the cpu_mondo_send()
   hypervisor call.  In particular:

   HV_EOK) Our work is done, all cpus have received the mondo.
   HV_CPUERROR) One or more of the cpus in the cpu list we passed
                to the hypervisor are in error state.  Use cpu_state()
                calls over the entries in the cpu list to see which
		ones.  Record them in "error_mask" and report this
		after we are done sending the mondo to cpus which are
		not in error state.
   HV_EWOULDBLOCK) We need to keep trying.

   Any other error we consider fatal, we report the event and exit
   immediately.

5) We only timeout if forward progress is not made.  Forward progress
   is defined as having at least one cpu get the mondo successfully
   in a given cpu_mondo_send() call.  Otherwise we bump a counter
   and delay a little.  If the counter hits a limit, we signal an
   error and report the event.

Also, smp_call_function_mask() error handling reports the number
of cpus incorrectly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:09 -08:00
David S. Miller
97c4b6f95a [SPARC64]: Use 13-bit context size always.
We no longer have the problems that require using the smaller
sizes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:06 -08:00
David S. Miller
3634476239 [SPARC64]: Niagara optimized XOR functions for RAID.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:03 -08:00
David S. Miller
a0663a79ad [SPARC64]: Fix TLB context allocation with SMT style shared TLBs.
The context allocation scheme we use depends upon there being a 1<-->1
mapping from cpu to physical TLB for correctness.  Chips like Niagara
break this assumption.

So what we do is notify all cpus with a cross call when the context
version number changes, and if necessary this makes them allocate
a valid context for the address space they are running at the time.

Stress tested with make -j1024, make -j2048, and make -j4096 kernel
builds on a 32-strand, 8 core, T2000 with 16GB of ram.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:14:00 -08:00
David S. Miller
0f05da6d57 [SPARC64]: Fix %tstate ASI handling in start_thread{,32}()
Niagara helps us find a ancient bug in the sparc64 port :-)

The ASI_* values are plain constant defines, thus signed 32-bit
on sparc64.  To put shift this into the regs->tstate value we were
doing or'ing "(ASI_PNF << 24)" into there.

ASI_PNF is 0x82 and shifted left by 24 makes that topmost bit the
sign bit in a 32-bit value.  This would get sign extended to 64-bits
and thus corrupt the top-half of the reg->tstate value.

This never caused problems in pre-Niagara cpus because the only thing
up there were the condition code values.  But Niagara has the global
register level field, and this all 1's value is illegal there so
Niagara gives an illegal instruction trap due to this bug.

I'm pretty sure this bug is about as old as the sparc64 port itself.

This also points out that we weren't setting ASI_PNF for 32-bit tasks.
We should, so fix that while we're here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:57 -08:00
David S. Miller
d7744a0950 [SPARC64]: Create a seperate kernel TSB for 4MB/256MB mappings.
It can map all of the linear kernel mappings with zero TSB hash
conflicts for systems with 16GB or less ram.  In such cases, on
SUN4V, once we load up this TSB the first time with all the
mappings, we never take a linear kernel mapping TLB miss ever
again, the hypervisor handles them all.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:56 -08:00
David S. Miller
6f5374c91f [SPARC64]: Add sun4v_cpu_yield().
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:52 -08:00
David S. Miller
1bd0cd74d1 [SPARC64]: Kill cpudata->idle_volume.
Set, but never used.

We used to use this for dynamic IRQ retargetting, but that
code died a long time ago.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:51 -08:00
David S. Miller
0f15952ac8 [SPARC64]: Export a PAGE_SHARED symbol.
For drivers/media/*, noticed by Fabbione.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:36 -08:00
Fabio M. Di Nitto
f6c1fe5292 [SPARC64] Fix build if CONFIG_HUGETLB_PAGE is not set
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:35 -08:00
David S. Miller
8b23427441 [SPARC64]: More TLB/TSB handling fixes.
The SUN4V convention with non-shared TSBs is that the context
bit of the TAG is clear.  So we have to choose an "invalid"
bit and initialize new TSBs appropriately.  Otherwise a zero
TAG looks "valid".

Make sure, for the window fixup cases, that we use the right
global registers and that we don't potentially trample on
the live global registers in etrap/rtrap handling (%g2 and
%g6) and that we put the missing virtual address properly
in %g5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:34 -08:00
David S. Miller
3763be32d5 [SPARC64]: Define ARCH_HAS_READ_CURRENT_TIMER.
This gives more consistent bogomips and delay() semantics,
especially on sun4v.  It gives weird looking values though...

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:29 -08:00
David S. Miller
c857e3fdbc [SPARC64]: __bzero_noasi --> __clear_user
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:28 -08:00
David S. Miller
97532f5982 [SPARC64]: Add HWCAP_SPARC_BLKINIT elf capability flag for Niagara.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:26 -08:00
David S. Miller
ebd8c56c5a [SPARC64]: Fix uniprocessor IRQ targetting on SUN4V.
We need to use the real hardware processor ID when
targetting interrupts, not the "define to 0" thing
the uniprocessor build gives us.

Also, fill in the Node-ID and Agent-ID fields properly
on sun4u/Safari.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:24 -08:00
David S. Miller
72aff53f1f [SPARC64]: Get SUN4V SMP working.
The sibling cpu bringup is extremely fragile.  We can only
perform the most basic calls until we take over the trap
table from the firmware/hypervisor on the new cpu.

This means no accesses to %g4, %g5, %g6 since those can't be
TLB translated without our trap handlers.

In order to achieve this:

1) Change sun4v_init_mondo_queues() so that it can operate in
   several modes.

   It can allocate the queues, or install them in the current
   processor, or both.

   The boot cpu does both in it's call early on.

   Later, the boot cpu allocates the sibling cpu queue, starts
   the sibling cpu, then the sibling cpu loads them in.

2) init_cur_cpu_trap() is changed to take the current_thread_info()
   as an argument instead of reading %g6 directly on the current
   cpu.

3) Create a trampoline stack for the sibling cpus.  We do our basic
   kernel calls using this stack, which is locked into the kernel
   image, then go to our proper thread stack after taking over the
   trap table.

4) While we are in this delicate startup state, we put 0xdeadbeef
   into %g4/%g5/%g6 in order to catch accidental accesses.

5) On the final prom_set_trap_table*() call, we put &init_thread_union
   into %g6.  This is a hack to make prom_world(0) work.  All that
   wants to do is restore the %asi register using
   get_thread_current_ds().

Longer term we should just do the OBP calls to set the trap table by
hand just like we do for everything else.  This would avoid that silly
prom_world(0) issue, then we can remove the init_thread_union hack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:22 -08:00
David S. Miller
4ff7ac417d [SPARC64]: Add GET_GL_GLOBAL() macro for SUN4V.
So we can read the %gl register for debugging.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:18 -08:00
David S. Miller
94f8762db9 [SPARC64]: Add sun4v_cpu_qconf() hypervisor call.
Call it from register_one_mondo().

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:16 -08:00
David S. Miller
bc45e32e0f [SPARC]: Kill off these __put_user_ret things.
They are bogus and haven't been referenced in years.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:15 -08:00
David S. Miller
9d29a3fafd [SPARC64]: Decode virtual-devices interrupts correctly.
Need to translate through the interrupt-map{,-mask] properties.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:05 -08:00
David S. Miller
7890f794e0 [SPARC64]: Add prom_{start,stop}cpu_cpuid().
Use prom_startcpu_cpuid() on SUN4V instead of prom_startcpu().

We should really test for "SUNW,start-cpu-by-cpuid" presence
and use it if present even on SUN4U.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:04 -08:00
David S. Miller
7c3514e450 [SPARC64]: Fixup TSTATE layout diagram in asm/pstate.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:13:02 -08:00
David S. Miller
50f4f23c3b [SPARC64]: Fix gcc-3.3.x warnings.
It doesn't like const variables being passed into
"i" constraing asm operations.  It's a bug, but
there is nothing we can really do but work around
it.

Based upon a report from Andrew Morton.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:53 -08:00
David S. Miller
c4bea28839 [SPARC64]: Make error codes available from sun4v_intr_get*().
And check for errors at call sites.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:51 -08:00
David S. Miller
5259d5bfaf [SPARC64]: Fix comment typo in asm/hypervisor.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:46 -08:00
David S. Miller
e77227eb4e [SPARC64]: Probe virtual-devices root node on sun4v.
This is where we learn how to get the interrupts
for things like the hypervisor console device.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:44 -08:00
David S. Miller
e3999574b4 [SPARC64]: Generic sun4v_build_irq().
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:40 -08:00
David S. Miller
6c0f402f6c [SPARC64]: Implement rest of generic interrupt hypervisor calls.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:37 -08:00
David S. Miller
85dfa19ba9 [SPARC64]: Move devino_to_sysino out of pci_sun4v_asm.S
It is not PCI specific, it is for all system interrupts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:36 -08:00
David S. Miller
cf627156c4 [SPARC64]: Use inline patching for critical PTE operations.
This handles the SUN4U vs SUN4V PTE layout differences
with near zero performance cost.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:32 -08:00
David S. Miller
ff02e0d26f [SPARC64]: Move PTE field definitions back into asm/pgtable.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:31 -08:00
David S. Miller
1a7a242c89 [SPARC64]: Recognize "virtual-console" as input and output console device.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:28 -08:00
David S. Miller
c4bce90ea2 [SPARC64]: Deal with PTE layout differences in SUN4V.
Yes, you heard it right, they changed the PTE layout for
SUN4V.  Ho hum...

This is the simple and inefficient way to support this.
It'll get optimized, don't worry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:25 -08:00
David S. Miller
490384e752 [SPARC64]: Register kernel TSB with hypervisor.
We do this right after we take over the trap table from OBP.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:23 -08:00
David S. Miller
459b6e621e [SPARC64]: Fix some SUN4V TLB miss bugs.
Code patching did not sign extend negative branch
offsets correctly.

Kernel TLB miss path needs patching and %g4 register
preservation in order to handle SUN4V correctly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:23 -08:00
David S. Miller
12eaa328f9 [SPARC64]: Use ASI_SCRATCHPAD address 0x0 properly.
This is where the virtual address of the fault status
area belongs.

To set it up we don't make a hypervisor call, instead
we call OBP's SUNW,set-trap-table with the real address
of the fault status area as the second argument.  And
right before that call we write the virtual address into
ASI_SCRATCHPAD vaddr 0x0.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:15 -08:00
David S. Miller
dedacf6232 [SPARC64]: Add HV_PCI_TSBID() macro.
For constructing hypervisor PCI TSB IDs.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:13 -08:00
David S. Miller
bade562216 [SPARC64]: More SUN4V PCI controller work.
Add assembler file for PCI hypervisor calls.
Setup basic skeleton of SUN4V PCI controller driver.

Add 32-bit devhandle to PBM struct, as this is needed for
hypervisor calls.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:11 -08:00
David S. Miller
8f6a93a196 [SPARC64]: Beginnings of SUN4V PCI controller support.
Abstract out IOMMU operations so that we can have a different
set of calls on sun4v, which needs to do things through
hypervisor calls.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:10 -08:00
David S. Miller
5fe91cf625 [SPARC]: Clean up idprom header files.
Delete unused macros, and use fixed sized types in
sparc32 header.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:08 -08:00
David S. Miller
618e9ed98a [SPARC64]: Hypervisor TSB context switching.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:06 -08:00
David S. Miller
aa9143b971 [SPARC64]: Implement sun4v TSB miss handlers.
When we register a TSB with the hypervisor, so that it or hardware can
handle TLB misses and do the TSB walk for us, the hypervisor traps
down to these trap when it incurs a TSB miss.

Processing is simple, we load the missing virtual address and context,
and do a full page table walk.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:05 -08:00
David S. Miller
d82ace7dc4 [SPARC64]: Detect sun4v early in boot process.
We look for "SUNW,sun4v" in the 'compatible' property
of the root OBP device tree node.

Protect every %ver register access, to make sure it is
not touched on sun4v, as %ver is hyperprivileged there.

Lock kernel TLB entries using hypervisor calls instead of
calls into OBP.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:03 -08:00
David S. Miller
1d2f1f90a1 [SPARC64]: Sun4v cross-call sending support.
Technically the hypervisor call supports sending in a list
of all cpus to get the cross-call, but I only pass in one
cpu at a time for now.

The multi-cpu support is there, just ifdef'd out so it's easy to
enable or delete it later.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:02 -08:00
David S. Miller
5b0c0572fc [SPARC64]: Sun4v interrupt handling.
Sun4v has 4 interrupt queues: cpu, device, resumable errors,
and non-resumable errors.  A set of head/tail offset pointers
help maintain a work queue in physical memory.  The entries
are 64-bytes in size.

Each queue is allocated then registered with the hypervisor
as we bring cpus up.

The two error queues each get a kernel side buffer that we
use to quickly empty the main interrupt queue before we
call up to C code to log the event and possibly take evasive
action.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:12:01 -08:00
David S. Miller
7202c55c5c [SPARC64]: Add sun4v mondo queue bases to struct trap_per_cpu.
Also, correct TRAP_PER_CPU_FAULT_INFO define, it should be
0x40 not 0x20.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:58 -08:00
David S. Miller
3bfd6f3e77 [SPARC64]: Fix some comment typos in asm/hypervisor.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:57 -08:00
David S. Miller
8b11bd12af [SPARC64]: Patch up mmu context register writes for sun4v.
sun4v uses ASI_MMU instead of ASI_DMMU

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:56 -08:00
David S. Miller
481295f982 [SPARC64]: Register per-cpu fault status area with sun4v hypervisor.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:55 -08:00
David S. Miller
89a5264f06 [SPARC64]: asm/cpudata.h needs asm/asi.h
For the expansion of __GET_CPUID() on SMP.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:55 -08:00
David S. Miller
df7d6aec96 [SPARC64]: Rename gl_{1,2}insn_patch --> sun4v_{1,2}insn_patch
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:53 -08:00
David S. Miller
d257d5da39 [SPARC64]: Initial sun4v TLB miss handling infrastructure.
Things are a little tricky because, unlike sun4u, we have
to:

1) do a hypervisor trap to do the TLB load.
2) do the TSB lookup calculations by hand

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:52 -08:00
David S. Miller
45fec05f80 [SPARC64]: Sanitize %pstate writes for sun4v.
If we're just switching between different alternate global
sets, nop it out on sun4v.  Also, get rid of all of the
alternate global save/restore in the OBP CIF trampoline code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:50 -08:00
David S. Miller
314981ac71 [SPARC64]: Kill all %pstate changes in context switch code.
They are totally unnecessary because:

1) Interrupts are already disabled when switch_to()
   runs.

2) We don't use hard-coded alternate globals any longer.

This found a case in rtrap, which still assumed alternate
global %g6 was current_thread_info(), and that is fixed
by this changeset as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:49 -08:00
David S. Miller
936f482af1 [SPARC64]: Add initial code to twiddle %gl on trap entry/exit.
Instead of setting/clearing PSTATE_AG we have to change
the %gl register value on sun4v.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:48 -08:00
David S. Miller
d619d7f116 [SPARC64]: Add define for "GL" field of sun4v %tstate register.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:46 -08:00
David S. Miller
d96b81533b [SPARC64]: Add sun4v case to __GET_CPUID() patch tables.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:45 -08:00
David S. Miller
e1c21c4f47 [SPARC64]: Sun4v interrupt queue register definitions.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:44 -08:00
David S. Miller
277b6dd960 [SPARC64]: Sun4v scratchpad register layout.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:44 -08:00
David S. Miller
d398ee230f [SPARC64]: Sun4v specific ASI defines.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:43 -08:00
David S. Miller
30ddbdb033 [SPARC64]: Add Niagara init-store twin-load ASI defines.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:41 -08:00
David S. Miller
1633a53c79 [SPARC64]: Add 'hypervisor' to ultra_tlb_type enumeration.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:39 -08:00
David S. Miller
766f861fbb [SPARC64]: SUN4V hypervisor interface defines.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:37 -08:00
David S. Miller
314ef68597 [SPARC64]: Refine register window trap handling.
When saving and restoing trap state, do the window spill/fill
handling inline so that we never trap deeper than 2 trap levels.
This is important for chips like Niagara.

The window fixup code is massively simplified, and many more
improvements are now possible.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:36 -08:00
David S. Miller
ffe483d552 [SPARC64]: Add explicit register args to trap state loading macros.
This, as well as making the code cleaner, allows a simplification in
the TSB miss handling path.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:35 -08:00
David S. Miller
92704a1c63 [SPARC64]: Refine code sequences to get the cpu id.
On uniprocessor, it's always zero for optimize that.

On SMP, the jmpl to the stub kills the return address stack in the cpu
branch prediction logic, so expand the code sequence inline and use a
code patching section to fix things up.  This also always better and
explicit register selection, which will be taken advantage of in a
future changeset.

The hard_smp_processor_id() function is big, so do not inline it.

Fix up tests for Jalapeno to also test for Serrano chips too.  These
tests want "jbus Ultra-IIIi" cases to match, so that is what we should
test for.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:35 -08:00
David S. Miller
7bec08e38a [SPARC64]: Correctable ECC errors cannot occur at trap level > 0.
The are distrupting, which by the sparc v9 definition means they
can only occur when interrupts are enabled in the %pstate register.
This never occurs in any of the trap handling code running at
trap levels > 0.

So just mark it as an unexpected trap.

This allows us to kill off the cee_stuff member of struct thread_info.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:33 -08:00
David S. Miller
517af33237 [SPARC64]: Access TSB with physical addresses when possible.
This way we don't need to lock the TSB into the TLB.
The trick is that every TSB load/store is registered into
a special instruction patch section.  The default uses
virtual addresses, and the patch instructions use physical
address load/stores.

We can't do this on all chips because only cheetah+ and later
have the physical variant of the atomic quad load.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:32 -08:00
David S. Miller
b0fd4e49ae [SPARC64]: Kill out-of-date commentary in asm-sparc64/tsb.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:31 -08:00
David S. Miller
86b818687d [SPARC64]: Fix race in LOAD_PER_CPU_BASE()
Since we use %g5 itself as a temporary, it can get clobbered
if we take an interrupt mid-stream and thus cause end up with
the final %g5 value too early as a result of rtrap processing.

Set %g5 at the very end, atomically, to avoid this problem.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:29 -08:00
David S. Miller
2f7ee7c63f [SPARC64]: Increase swapper_tsb size to 32K.
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:26 -08:00
David S. Miller
a8b900d801 [SPARC64]: Kill sole argument passed to setup_tba().
No longer used, and move extern declaration to a header file.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:25 -08:00
David S. Miller
4753eb2ac7 [SPARC64]: Fix incorrect TSB lock bit handling.
The TSB_LOCK_BIT define is actually a special
value shifted down by 32-bits for the assembler
code macros.

In C code, this isn't what we want.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:21 -08:00
David S. Miller
b70c0fa161 [SPARC64]: Preload TSB entries from update_mmu_cache().
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:19 -08:00
David S. Miller
bd40791e1d [SPARC64]: Dynamically grow TSB in response to RSS growth.
As the RSS grows, grow the TSB in order to reduce the likelyhood
of hash collisions and thus poor hit rates in the TSB.

This definitely needs some serious tuning.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:18 -08:00
David S. Miller
98c5584cfc [SPARC64]: Add infrastructure for dynamic TSB sizing.
This also cleans up tsb_context_switch().  The assembler
routine is now __tsb_context_switch() and the former is
an inline function that picks out the bits from the mm_struct
and passes it into the assembler code as arguments.

setup_tsb_parms() computes the locked TLB entry to map the
TSB.  Later when we support using the physical address quad
load instructions of Cheetah+ and later, we'll simply use
the physical address for the TSB register value and set
the map virtual and PTE both to zero.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:17 -08:00
David S. Miller
09f94287f7 [SPARC64]: TSB refinements.
Move {init_new,destroy}_context() out of line.

Do not put huge pages into the TSB, only base page size translations.
There are some clever things we could do here, but for now let's be
correct instead of fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:16 -08:00
David S. Miller
56fb4df6da [SPARC64]: Elminate all usage of hard-coded trap globals.
UltraSPARC has special sets of global registers which are switched to
for certain trap types.  There is one set for MMU related traps, one
set of Interrupt Vector processing, and another set (called the
Alternate globals) for all other trap types.

For what seems like forever we've hard coded the values in some of
these trap registers.  Some examples include:

1) Interrupt Vector global %g6 holds current processors interrupt
   work struct where received interrupts are managed for IRQ handler
   dispatch.

2) MMU global %g7 holds the base of the page tables of the currently
   active address space.

3) Alternate global %g6 held the current_thread_info() value.

Such hardcoding has resulted in some serious issues in many areas.
There are some code sequences where having another register available
would help clean up the implementation.  Taking traps such as
cross-calls from the OBP firmware requires some trick code sequences
wherein we have to save away and restore all of the special sets of
global registers when we enter/exit OBP.

We were also using the IMMU TSB register on SMP to hold the per-cpu
area base address, which doesn't work any longer now that we actually
use the TSB facility of the cpu.

The implementation is pretty straight forward.  One tricky bit is
getting the current processor ID as that is different on different cpu
variants.  We use a stub with a fancy calling convention which we
patch at boot time.  The calling convention is that the stub is
branched to and the (PC - 4) to return to is in register %g1.  The cpu
number is left in %g6.  This stub can be invoked by using the
__GET_CPUID macro.

We use an array of per-cpu trap state to store the current thread and
physical address of the current address space's page tables.  The
TRAP_LOAD_THREAD_REG loads %g6 with the current thread from this
table, it uses __GET_CPUID and also clobbers %g1.

TRAP_LOAD_IRQ_WORK is used by the interrupt vector processing to load
the current processor's IRQ software state into %g6.  It also uses
__GET_CPUID and clobbers %g1.

Finally, TRAP_LOAD_PGD_PHYS loads the physical address base of the
current address space's page tables into %g7, it clobbers %g1 and uses
__GET_CPUID.

Many refinements are possible, as well as some tuning, with this stuff
in place.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:16 -08:00
David S. Miller
3c93646524 [SPARC64]: Kill pgtable quicklists and use SLAB.
Taking a nod from the powerpc port.

With the per-cpu caching of both the page allocator and SLAB, the
pgtable quicklist scheme becomes relatively silly and primitive.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:14 -08:00
David S. Miller
05e28f9de6 [SPARC64]: No need to D-cache color page tables any longer.
Unlike the virtual page tables, the new TSB scheme does not
require this ugly hack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:13 -08:00
David S. Miller
74bf4312ff [SPARC64]: Move away from virtual page tables, part 1.
We now use the TSB hardware assist features of the UltraSPARC
MMUs.

SMP is currently knowingly broken, we need to find another place
to store the per-cpu base pointers.  We hid them away in the TSB
base register, and that obviously will not work any more :-)

Another known broken case is non-8KB base page size.

Also noticed that flush_tlb_all() is not referenced anywhere, only
the internal __flush_tlb_all() (local cpu only) is used by the
sparc64 port, so we can get rid of flush_tlb_all().

The kernel gets it's own 8KB TSB (swapper_tsb) and each address space
gets it's own private 8K TSB.  Later we can add code to dynamically
increase the size of per-process TSB as the RSS grows.  An 8KB TSB is
good enough for up to about a 4MB RSS, after which the TSB starts to
incur many capacity and conflict misses.

We even accumulate OBP translations into the kernel TSB.

Another area for refinement is large page size support.  We could use
a secondary address space TSB to handle those.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 01:11:13 -08:00
David S. Miller
4d000d5b96 [SPARC64]: Mark __ex_table section correctly.
We must use the "a" (allocate) attribute every time we
emit an entry into the __ex_table section.

For consistency, use "a" instead of #alloc which is some
Solaris compat cruft GNU as provides on Sparc.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-04 23:23:56 -08:00
David S. Miller
7abea92145 [SPARC64]: Make cpu_present_map available earlier.
The change to kernel/sched.c's init code to use for_each_cpu()
requires that the cpu_possible_map be setup much earlier.

Set it up via setup_arch(), constrained to NR_CPUS, and later
constrain it to max_cpus in smp_prepare_cpus().

This fixes SMP booting on sparc64.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-26 19:36:00 -08:00
David S. Miller
043df59eb3 [SPARC64]: Implement futex_atomic_op_inuser().
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-26 19:35:58 -08:00
Michael S. Tsirkin
5f6164f309 [PATCH] add asm-generic/mman.h
Make new MADV_REMOVE, MADV_DONTFORK, MADV_DOFORK consistent across all
arches.  The idea is to make it possible to use them portably even before
distros include them in libc headers.

Move common flags to asm-generic/mman.h

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15 15:32:22 -08:00
Michael S. Tsirkin
f822566165 [PATCH] madvise MADV_DONTFORK/MADV_DOFORK
Currently, copy-on-write may change the physical address of a page even if the
user requested that the page is pinned in memory (either by mlock or by
get_user_pages).  This happens if the process forks meanwhile, and the parent
writes to that page.  As a result, the page is orphaned: in case of
get_user_pages, the application will never see any data hardware DMA's into
this page after the COW.  In case of mlock'd memory, the parent is not getting
the realtime/security benefits of mlock.

In particular, this affects the Infiniband modules which do DMA from and into
user pages all the time.

This patch adds madvise options to control whether memory range is inherited
across fork.  Useful e.g.  for when hardware is doing DMA from/into these
pages.  Could also be useful to an application wanting to speed up its forks
by cutting large areas out of consideration.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-14 16:09:34 -08:00
David S. Miller
40ad7a6afc [SPARC]: sys_newfstatat --> sys_fstatat64
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-12 23:30:11 -08:00
David S. Miller
1b9a428901 [SPARC]: Wire up sys_unshare().
Also, the Solaris syscall table is sized differrently,
and does not go beyond entry 255, so trim off the excess
entries.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-07 18:11:24 -08:00
David S. Miller
d3ed309a71 [SPARC64]: Implement __raw_read_trylock()
generic__raw_read_trylock() just does a raw_read_lock() so that
isn't very useful.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-23 21:03:56 -08:00
David S. Miller
2d7d5f0511 [SPARC]: Add support for *at(), ppoll, and pselect syscalls.
This also includes by necessity _TIF_RESTORE_SIGMASK support,
which actually resulted in a lot of cleanups.

The sparc signal handling code is quite a mess and I should
clean it up some day.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-19 02:42:49 -08:00
Eddie C. Dost
c126cf80d4 [SPARC64]: Serial Console for E250 Patch
From: Eddie C. Dost <ecd@brainaid.de>

I have the following patch for serial console over the RSC
(remote system controller) on my E250 machine. It basically adds
support for input-device=rsc and output-device=rsc from OBP, and
allows 115200,8,n,1,- serial mode setting.

Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-18 14:54:31 -08:00
Al Viro
26ecbdea4b [PATCH] sparc64: task_pt_regs()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:52 -08:00
Al Viro
f3169641c1 [PATCH] sparc64: task_thread_info()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:52 -08:00
Ingo Molnar
4dc7a0bbeb [PATCH] sched: add cacheflush() asm
Add per-arch sched_cacheflush() which is a write-back cacheflush used by
the migration-cost calibration code at bootup time.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-12 09:08:49 -08:00
Ananth N Mavinakayanahalli
0498b63504 [PATCH] kprobes: fix build breakage
The following patch (against 2.6.15-rc5-mm3) fixes a kprobes build break
due to changes introduced in the kprobe locking in 2.6.15-rc5-mm3.  In
addition, the patch reverts back the open-coding of kprobe_mutex.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Anil S Keshavamurthy
e597c2984c [PATCH] kprobes: arch_remove_kprobe
Currently arch_remove_kprobes() is only implemented/required for x86_64 and
powerpc.  All other architecture like IA64, i386 and sparc64 implementes a
dummy function which is being called from arch independent kprobes.c file.

This patch removes the dummy functions and replaces it with
#define arch_remove_kprobe(p, s)	do { } while(0)

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Anil S Keshavamurthy
41dead49cc [PATCH] kprobes: cleanup include/asm/kprobes.h
The arch specific kprobes.h files never gets included when CONFIG_KPROBES is
turned off.  Hence check for CONFIG_KPROBES is not appropriate here in this
arch specific kprobes.h files.

Also the below defined function kprobes_exception_notify() is not needed when
CONFIG_KPROBES is off.

Compile tested for both CONFIG_KPROBES=y and N.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10 08:01:40 -08:00
Arjan van de Ven
2acbb8c657 [PATCH] mutex subsystem, add default include/asm-*/mutex.h files
add the per-arch mutex.h files for the remaining architectures.

We default to asm-generic/mutex-dec.h, because that performs
quite well on most arches. Arches that do not have atomic
decrement/increment instructions should switch to mutex-xchg.h
instead. Arches can also provide their own implementation for
the mutex fastpath primitives.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2006-01-09 15:59:19 -08:00
Ingo Molnar
ffbf670f5c [PATCH] mutex subsystem, add atomic_xchg() to all arches
add atomic_xchg() to all the architectures. Needed by the new mutex code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
2006-01-09 15:59:17 -08:00
Andrew Morton
a136564702 [PATCH] remove gcc-2 checks
Remove various things which were checking for gcc-1.x and gcc-2.x compilers.

From: Adrian Bunk <bunk@stusta.de>

    Some documentation updates and removes some code paths for gcc < 3.2.

Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:14:02 -08:00
Jeff Dike
f8aaeacec1 [PATCH] consolidate asm/futex.h
Most of the architectures have the same asm/futex.h.  This consolidates them
into asm-generic, with the arches including it from their own asm/futex.h.

In the case of UML, this reverts the old broken futex.h and goes back to using
the same one as almost everyone else.

Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:39 -08:00
Ravikiran G Thirumalai
1fd73c6b67 [PATCH] Kill L1_CACHE_SHIFT_MAX
Kill L1_CACHE_SHIFT from all arches.  Since L1_CACHE_SHIFT_MAX is not used
anymore with the introduction of INTERNODE_CACHE, kill L1_CACHE_SHIFT_MAX.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:13:39 -08:00
Christoph Lameter
d3cb487149 [PATCH] atomic_long_t & include/asm-generic/atomic.h V2
Several counters already have the need to use 64 atomic variables on 64 bit
platforms (see mm_counter_t in sched.h).  We have to do ugly ifdefs to fall
back to 32 bit atomic on 32 bit platforms.

The VM statistics patch that I am working on will also make more extensive
use of atomic64.

This patch introduces a new type atomic_long_t by providing definitions in
asm-generic/atomic.h that works similar to the c "long" type.  Its 32 bits
on 32 bit platforms and 64 bits on 64 bit platforms.

Also cleans up the determination of the mm_counter_t in sched.h.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:29 -08:00
Badari Pulavarty
f6b3ec238d [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store
Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
given range of pages & its associated backing store.  Current
implementation supports only shmfs/tmpfs and other filesystems return
-ENOSYS.

"Some app allocates large tmpfs files, then when some task quits and some
client disconnect, some memory can be released.  However the only way to
release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli

Databases want to use this feature to drop a section of their bufferpool
(shared memory segments) - without writing back to disk/swap space.

This feature is also useful for supporting hot-plug memory on UML.

Concerns raised by Andrew Morton:

- "We have no plan for holepunching!  If we _do_ have such a plan (or
  might in the future) then what would the API look like?  I think
  sys_holepunch(fd, start, len), so we should start out with that."

- Using madvise is very weird, because people will ask "why do I need to
  mmap my file before I can stick a hole in it?"

- None of the other madvise operations call into the filesystem in this
  manner.  A broad question is: is this capability an MM operation or a
  filesytem operation?  truncate, for example, is a filesystem operation
  which sometimes has MM side-effects.  madvise is an mm operation and with
  this patch, it gains FS side-effects, only they're really, really
  significant ones."

Comments:

- Andrea suggested the fs operation too but then it's more efficient to
  have it as a mm operation with fs side effects, because they don't
  immediatly know fd and physical offset of the range.  It's possible to
  fixup in userland and to use the fs operation but it's more expensive,
  the vmas are already in the kernel and we can use them.

Short term plan &  Future Direction:

- We seem to need this interface only for shmfs/tmpfs files in the short
  term.  We have to add hooks into the filesystem for correctness and
  completeness.  This is what this patch does.

- In the future, plan is to support both fs and mmap apis also.  This
  also involves (other) filesystem specific functions to be implemented.

- Current patch doesn't support VM_NONLINEAR - which can be addressed in
  the future.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:33:22 -08:00
Stephen Hemminger
3821af2fe1 [FLS64]: generic version
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:06 -08:00
David S. Miller
5cd9194a1b [PATCH] sparc: convert IO remapping to VM_PFNMAP
Here are the Sparc bits.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-28 14:35:36 -08:00
Hugh Dickins
7c72aaf296 [PATCH] mm: fill arch atomic64 gaps
alpha, sparc64, x86_64 are each missing some primitives from their atomic64
support: fill in the gaps I've noticed by extrapolating asm, follow the
groupings in each file.  But powerpc and parisc still lack atomic64.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23 16:08:39 -08:00
Nick Piggin
8426e1f6af [PATCH] atomic: inc_not_zero
Introduce an atomic_inc_not_zero operation.  Make this a special case of
atomic_add_unless because lockless pagecache actually wants
atomic_inc_not_negativeone due to its offset refcount.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:16 -08:00
Nick Piggin
4a6dae6d38 [PATCH] atomic: cmpxchg
Introduce an atomic_cmpxchg operation.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-13 18:14:16 -08:00
Hugh Dickins
59871bcd11 [SPARC64] mm: simpler tlb_flush_mmu
Minor simplification to the sparc64 tlb_flush_mmu: tlb_remove_page
set need_flush only after handling the tlb_fast_mode case, then
tlb_flush_mmu need not consider whether it's tlb_fast_mode.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:12:08 -08:00
Christoph Hellwig
59f85dc95e [SPARC]: remove vuid_event.h
I don't know if we ever implemented this, but the only user in any 2.6
tree are the compat ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:11:38 -08:00
Christoph Hellwig
e1413315b8 [SPARC]: remove kbio.h
The old keyboard driver is gone in 2.6, so the only user left are the
compat ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:11:25 -08:00
Christoph Hellwig
9d3c7d1bfd [SPARC]: remove audioio.h
The old sound drivers are gone in 2.6, so the only user left are the
compat ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:11:14 -08:00
Stephen Rothwell
d16436e686 [SPARC]: remove duplicate TIOCPKT_ definitions
The TIOCPKT_ macros are defined by all other architectures in asm/ioctls.h
and so does sparc and sparc64, so reomve the duplicates in asm/termios.h.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:10:42 -08:00
David S. Miller
62dbec78be [SPARC64] mm: Do not flush TLB mm in tlb_finish_mmu()
It isn't needed any longer, as noted by Hugh Dickins.

We still need the flush routines, due to the one remaining
call site in hugetlb_prefault_arch_hook().  That can be
eliminated at some later point, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:09:58 -08:00
Georg Chini
b128254fdb [SPARC]: More abstractions and cleanups of dma handling in cs4231.
From: Georg Chini <georg.chini@triaton-webhosting.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:09:19 -08:00
Hugh Dickins
dedeb0029b [SPARC64] mm: context switch ptlock
sparc64 is unique among architectures in taking the page_table_lock in
its context switch (well, cris does too, but erroneously, and it's not
yet SMP anyway).

This seems to be a private affair between switch_mm and activate_mm,
using page_table_lock as a per-mm lock, without any relation to its uses
elsewhere.  That's fine, but comment it as such; and unlock sooner in
switch_mm, more like in activate_mm (preemption is disabled here).

There is a block of "if (0)"ed code in smp_flush_tlb_pending which would
have liked to rely on the page_table_lock, in switch_mm and elsewhere;
but its comment explains how dup_mmap's flush_tlb_mm defeated it.  And
though that could have been changed at any time over the past few years,
now the chance vanishes as we push the page_table_lock downwards, and
perhaps split it per page table page.  Just delete that block of code.

Which leaves the mysterious spin_unlock_wait(&oldmm->page_table_lock)
in kernel/fork.c copy_mm.  Textual analysis (supported by Nick Piggin)
suggests that the comment was written by DaveM, and that it relates to
the defeated approach in the sparc64 smp_flush_tlb_pending.  Just delete
this block too.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-07 14:09:01 -08:00
Ananth N Mavinakayanahalli
f215d985e9 [PATCH] Kprobes: Track kprobe on a per_cpu basis - sparc64 changes
Sparc64 changes to track kprobe execution on a per-cpu basis.  We now track
the kprobe state machine independently on each cpu using an arch specific
kprobe control block.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:46 -08:00
Christoph Hellwig
481bed4542 [PATCH] consolidate sys_ptrace()
The sys_ptrace boilerplate code (everything outside the big switch
statement for the arch-specific requests) is shared by most architectures.
This patch moves it to kernel/ptrace.c and leaves the arch-specific code as
arch_ptrace.

Some architectures have a too different ptrace so we have to exclude them.
They continue to keep their implementations.  For sh64 I had to add a
sh64_ptrace wrapper because it does some initialization on the first call.
For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but
SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-By: David Howells <dhowells@redhat.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07 07:53:42 -08:00
Arthur Othieno
727a53bd53 [PATCH] semaphore: Remove __MUTEX_INITIALIZER()
__MUTEX_INITIALIZER() has no users, and equates to the more commonly used
DECLARE_MUTEX(), thus making it pretty much redundant.  Remove it for good.

Signed-off-by: Arthur Othieno <a.othieno@bluewin.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:27 -08:00
Tejun Heo
1426d7a81d [PATCH] vm: remove unused/broken page_pte[_prot] macros
This patch removes page_pte_prot and page_pte macros from all
architectures.  Some architectures define both, some only page_pte (broken)
and others none.  These macros are not used anywhere.

page_pte_prot(page, prot) is identical to mk_pte(page, prot) and
page_pte(page) is identical to page_pte_prot(page, __pgprot(0)).

* The following architectures define both page_pte_prot and page_pte

  arm, arm26, ia64, sh64, sparc, sparc64

* The following architectures define only page_pte (broken)

  frv, i386, m32r, mips, sh, x86-64

* All other architectures define neither

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:22 -08:00
Hugh Dickins
fc2acab31b [PATCH] mm: tlb_finish_mmu forget rss
zap_pte_range has been counting the pages it frees in tlb->freed, then
tlb_finish_mmu has used that to update the mm's rss.  That got stranger when I
added anon_rss, yet updated it by a different route; and stranger when rss and
anon_rss became mm_counters with special access macros.  And it would no
longer be viable if we're relying on page_table_lock to stabilize the
mm_counter, but calling tlb_finish_mmu outside that lock.

Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
business, just decrement the rss mm_counter in zap_pte_range (yes, there was
some point to batching the update, and a subsequent patch restores that).  And
forget the anal paranoia of first reading the counter to avoid going negative
- if rss does go negative, just fix that bug.

Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
was being made of them.  But arm26 alone was actually using the freed, in the
way some others use need_flush: give it a need_flush.  arm26 seems to prefer
spaces to tabs here: respect that.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:37 -07:00
Hugh Dickins
4d6ddfa924 [PATCH] mm: tlb_is_full_mm was obscure
tlb_is_full_mm?  What does that mean?  The TLB is full?  No, it means that the
mm's last user has gone and the whole mm is being torn down.  And it's an
inline function because sparc64 uses a different (slightly better)
"tlb_frozen" name for the flag others call "fullmm".

And now the ptep_get_and_clear_full macro used in zap_pte_range refers
directly to tlb->fullmm, which would be wrong for sparc64.  Rather than
correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
sparc64 to just use the same poor name as everyone else - is that okay?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:37 -07:00
Hugh Dickins
15a23ffa2f [PATCH] mm: tlb_gather_mmu get_cpu_var
tlb_gather_mmu dates from before kernel preemption was allowed, and uses
smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather.  That works
because it's currently only called after getting page_table_lock, which is not
dropped until after the matching tlb_finish_mmu.  But don't rely on that, it
will soon change: now disable preemption internally by proper get_cpu_var in
tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:37 -07:00
Rik Van Riel
eb92f4ef32 [PATCH] add sem_is_read/write_locked()
Add sem_is_read/write_locked functions to the read/write semaphores, along the
same lines of the *_is_locked spinlock functions.  The swap token tuning patch
uses sem_is_read_locked; sem_is_write_locked is added for completeness.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 21:40:35 -07:00
Al Viro
970a9e73f9 [PATCH] gfp_t: dma-mapping (simple cases)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28 08:16:49 -07:00
David S. Miller
688cb30bdc [SPARC64]: Eliminate PCI IOMMU dma mapping size limit.
The hairy fast allocator in the sparc64 PCI IOMMU code
has a hard limit of 256 pages.  Certain devices can
exceed this when performing very large I/Os.

So replace with a more simple allocator, based largely
upon the arch/ppc64/kernel/iommu.c code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13 22:15:24 -07:00
David S. Miller
51e8513615 [SPARC64]: Consolidate common PCI IOMMU init code.
All the PCI controller drivers were doing the same thing
setting up the IOMMU software state, put it all in one spot.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-10-13 21:10:08 -07:00
David S. Miller
4cb29d1812 [SPARC64]: Kill arch/sparc64/prom/memory.c
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29 18:05:28 -07:00
David S. Miller
13edad7a5c [SPARC64]: Rewrite convoluted physical memory probing.
Delete all of the code working with sp_banks[] and replace
with clean acquisition and sorting of physical memory
parameters from the firmware.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-29 17:58:26 -07:00
David S. Miller
10147570f9 [SPARC64]: Kill all external references to sp_banks[]
Thus, we can mark sp_banks[] static in arch/sparc64/mm/init.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 21:46:43 -07:00
David S. Miller
801ab3c731 [SPARC]: Declare paging_init() in asm/pgtable.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 21:31:25 -07:00
David S. Miller
efdc1e2083 [SPARC64]: Simplify user fault fixup handling.
Instead of doing byte-at-a-time user accesses to figure
out where the fault occurred, read the saved fault_address
from the current thread structure.

For the sake of defensive programming, if the fault_address
does not fall into the user buffer range, simply assume the
whole area faulted.  This will cause the fixup for
copy_from_user() to clear the entire kernel side buffer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 21:06:47 -07:00
David S. Miller
5fd29752f0 [SPARC64]: Fix fault handling in unaligned trap handler.
We were not calling kernel_mna_trap_fault() correctly.
Instead of being fancy, just return 0 vs. -EFAULT from
the assembler stubs, and handle that return value as
appropriate.

Create an "__retl_efault" stub for assembler exception
table entries and use it where possible.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 20:41:45 -07:00
David S. Miller
8cf14af0a7 [SPARC64]: Convert to use generic exception table support.
The funny "range" exception table entries we had were only
used by the compat layer socketcall assembly, and it wasn't
even needed there.

For free we now get proper exception table sorting and fast
binary searching.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-28 20:21:11 -07:00
David S. Miller
d2212bc7db [SPARC64]: Add missing IDs for newer cpus.
Also, the us3_cpufreq driver can work on Ultra-IV and IV+.
They use the SAFARI bus register to control the clock divider
just like Ultra-III and III+ do.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-27 22:50:06 -07:00
David S. Miller
f16af555cc [SPARC64]: Add defines for 32MB/256MB PTE page size on Ultra-IV+.
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-27 22:37:08 -07:00
David S. Miller
80dc0d6b44 [SPARC64]: Probe D/I/E-cache config and use.
At boot time, determine the D-cache, I-cache and E-cache size and
line-size.  Use them in cache flushes when appropriate.

This change was motivated by discovering that the D-cache on
UltraSparc-IIIi and later are 64K not 32K, and the flushes done by the
Cheetah error handlers were assuming a 32K size.

There are still some pieces of code that are hard coding things and
will need to be fixed up at some point.

While we're here, fix the D-cache and I-cache parity error handlers
to run with interrupts disabled, and when the trap occurs at trap
level > 1 log the event via a counter displayed in /proc/cpuinfo.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-26 00:32:17 -07:00
David S. Miller
5642530651 [SPARC64]: Add CONFIG_DEBUG_PAGEALLOC support.
The trick is that we do the kernel linear mapping TLB miss starting
with an instruction sequence like this:

	ba,pt		%xcc, kvmap_load
	 xor		%g2, %g4, %g5

succeeded by an instruction sequence which performs a full page table
walk starting at swapper_pg_dir.

We first take over the trap table from the firmware.  Then, using this
constant PTE generation for the linear mapping area above, we build
the kernel page tables for the linear mapping.

After this is setup, we patch that branch above into a "nop", which
will cause TLB misses to fall through to the full page table walk.

With this, the page unmapping for CONFIG_DEBUG_PAGEALLOC is trivial.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-25 16:46:57 -07:00
David S. Miller
bff06d5522 [SPARC64]: Rewrite bootup sequence.
Instead of all of this cpu-specific code to remap the kernel
to the correct location, use portable firmware calls to do
this instead.

What we do now is the following in position independant
assembler:

	chosen_node = prom_finddevice("/chosen");
	prom_mmu_ihandle_cache = prom_getint(chosen_node, "mmu");
	vaddr = 4MB_ALIGN(current_text_addr());
	prom_translate(vaddr, &paddr_high, &paddr_low, &mode);
	prom_boot_mapping_mode = mode;
	prom_boot_mapping_phys_high = paddr_high;
	prom_boot_mapping_phys_low = paddr_low;
	prom_map(-1, 8 * 1024 * 1024, KERNBASE, paddr_low);

and that replaces the massive amount of by-hand TLB probing and
programming we used to do here.

The new code should also handle properly the case where the kernel
is mapped at the correct address already (think: future kexec
support).

Consequently, the bulk of remap_kernel() dies as does the entirety
of arch/sparc64/prom/map.S

We try to share some strings in the PROM library with the ones used
at bootup, and while we're here mark input strings to oplib.h routines
with "const" when appropriate.

There are many more simplifications now possible.  For one thing, we
can consolidate the two copies we now have of a lot of cpu setup code
sitting in head.S and trampoline.S.

This is a significant step towards CONFIG_DEBUG_PAGEALLOC support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-22 20:11:33 -07:00
Paolo 'Blaisorblade' Giarrusso
676067cfea [PATCH] Remove unused var from asm/futex.h
As recently done by Russell King for ARM, commit
4732efbeb9 introduces a generic asm/futex.h copied
along most arches, which includes a "-ENOSYS support" to be changed if needed.
However, it includes an unused var (taken from the "real" version) which GCC
warns about.

Remove it from all arches having that file version (i.e. same GIT id).
$ git-diff-tree -r HEAD
and
$ git-ls-tree  -r HEAD include/|grep 9feff4ce14

may be more interesting than looking at the patch itself, to make sure I've
just copied the arm header to all other archs having the original dummy version
of this file.

Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-21 16:16:29 -07:00
David S. Miller
729b4f7de6 [SPARC64]: Verify vmalloc TLB misses more strictly.
Arrange the modules, OBP, and vmalloc areas such that a range
verification can be done quite minimally.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-20 12:18:38 -07:00
David S. Miller
6a9b490d5f [SPARC64]: Move DCACHE_ALIASING_POSSIBLE define to asm/page.h
This showed that arch/sparc64/kernel/ptrace.c was not getting
the define properly, and thus the code protected by this ifdef
was never actually compiled before.  So fix that too.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-19 20:11:57 -07:00
Ingo Molnar
fb1c8f93d8 [PATCH] spinlock consolidation
This patch (written by me and also containing many suggestions of Arjan van
de Ven) does a major cleanup of the spinlock code.  It does the following
things:

 - consolidates and enhances the spinlock/rwlock debugging code

 - simplifies the asm/spinlock.h files

 - encapsulates the raw spinlock type and moves generic spinlock
   features (such as ->break_lock) into the generic code.

 - cleans up the spinlock code hierarchy to get rid of the spaghetti.

Most notably there's now only a single variant of the debugging code,
located in lib/spinlock_debug.c.  (previously we had one SMP debugging
variant per architecture, plus a separate generic one for UP builds)

Also, i've enhanced the rwlock debugging facility, it will now track
write-owners.  There is new spinlock-owner/CPU-tracking on SMP builds too.
All locks have lockup detection now, which will work for both soft and hard
spin/rwlock lockups.

The arch-level include files now only contain the minimally necessary
subset of the spinlock code - all the rest that can be generalized now
lives in the generic headers:

 include/asm-i386/spinlock_types.h       |   16
 include/asm-x86_64/spinlock_types.h     |   16

I have also split up the various spinlock variants into separate files,
making it easier to see which does what. The new layout is:

   SMP                         |  UP
   ----------------------------|-----------------------------------
   asm/spinlock_types_smp.h    |  linux/spinlock_types_up.h
   linux/spinlock_types.h      |  linux/spinlock_types.h
   asm/spinlock_smp.h          |  linux/spinlock_up.h
   linux/spinlock_api_smp.h    |  linux/spinlock_api_up.h
   linux/spinlock.h            |  linux/spinlock.h

/*
 * here's the role of the various spinlock/rwlock related include files:
 *
 * on SMP builds:
 *
 *  asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
 *                        initializers
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  asm/spinlock.h:       contains the __raw_spin_*()/etc. lowlevel
 *                        implementations, mostly inline assembly code
 *
 *   (also included on UP-debug builds:)
 *
 *  linux/spinlock_api_smp.h:
 *                        contains the prototypes for the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 *
 * on UP builds:
 *
 *  linux/spinlock_type_up.h:
 *                        contains the generic, simplified UP spinlock type.
 *                        (which is an empty structure on non-debug builds)
 *
 *  linux/spinlock_types.h:
 *                        defines the generic type and initializers
 *
 *  linux/spinlock_up.h:
 *                        contains the __raw_spin_*()/etc. version of UP
 *                        builds. (which are NOPs on non-debug, non-preempt
 *                        builds)
 *
 *   (included on UP-non-debug builds:)
 *
 *  linux/spinlock_api_up.h:
 *                        builds the _spin_*() APIs.
 *
 *  linux/spinlock.h:     builds the final spin_*() APIs.
 */

All SMP and UP architectures are converted by this patch.

arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
crosscompilers.  m32r, mips, sh, sparc, have not been tested yet, but should
be mostly fine.

From: Grant Grundler <grundler@parisc-linux.org>

  Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
  Builds 32-bit SMP kernel (not booted or tested).  I did not try to build
  non-SMP kernels.  That should be trivial to fix up later if necessary.

  I converted bit ops atomic_hash lock to raw_spinlock_t.  Doing so avoids
  some ugly nesting of linux/*.h and asm/*.h files.  Those particular locks
  are well tested and contained entirely inside arch specific code.  I do NOT
  expect any new issues to arise with them.

 If someone does ever need to use debug/metrics with them, then they will
  need to unravel this hairball between spinlocks, atomic ops, and bit ops
  that exist only because parisc has exactly one atomic instruction: LDCW
  (load and clear word).

From: "Luck, Tony" <tony.luck@intel.com>

   ia64 fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjanv@infradead.org>
Signed-off-by: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Mikael Pettersson <mikpe@csd.uu.se>
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10 10:06:21 -07:00
Linus Torvalds
7bbedd5213 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6 2005-09-08 15:55:23 -07:00
David S. Miller
085ae41f66 [PATCH] Make sparc64 use setup-res.c
There were three changes necessary in order to allow
sparc64 to use setup-res.c:

1) Sparc64 roots the PCI I/O and MEM address space using
   parent resources contained in the PCI controller structure.
   I'm actually surprised no other platforms do this, especially
   ones like Alpha and PPC{,64}.  These resources get linked into the
   iomem/ioport tree when PCI controllers are probed.

   So the hierarchy looks like this:

   iomem --|
	   PCI controller 1 MEM space --|
				        device 1
					device 2
					etc.
	   PCI controller 2 MEM space --|
				        ...
   ioport --|
            PCI controller 1 IO space --|
					...
            PCI controller 2 IO space --|
					...

   You get the idea.  The drivers/pci/setup-res.c code allocates
   using plain iomem_space and ioport_space as the root, so that
   wouldn't work with the above setup.

   So I added a pcibios_select_root() that is used to handle this.
   It uses the PCI controller struct's io_space and mem_space on
   sparc64, and io{port,mem}_resource on every other platform to
   keep current behavior.

2) quirk_io_region() is buggy.  It takes in raw BUS view addresses
   and tries to use them as a PCI resource.

   pci_claim_resource() expects the resource to be fully formed when
   it gets called.  The sparc64 implementation would do the translation
   but that's absolutely wrong, because if the same resource gets
   released then re-claimed we'll adjust things twice.

   So I fixed up quirk_io_region() to do the proper pcibios_bus_to_resource()
   conversion before passing it on to pci_claim_resource().

3) I was mistakedly __init'ing the function methods the PCI controller
   drivers provide on sparc64 to implement some parts of these
   routines.  This was, of course, easy to fix.

So we end up with the following, and that nasty SPARC64 makefile
ifdef in drivers/pci/Makefile is finally zapped.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2005-09-08 14:57:25 -07:00
David S. Miller
4d803fcdcd [SPARC64]: Inline membar()'s again.
Since GCC has to emit a call and a delay slot to the
out-of-line "membar" routines in arch/sparc64/lib/mb.S
it is much better to just do the necessary predicted
branch inline instead as:

	ba,pt	%xcc, 1f
	 membar	#whatever
1:

instead of the current:

	call	membar_foo
	 dslot

because this way GCC is not required to allocate a stack
frame if the function can be a leaf function.

This also makes this bug fix easier to backport to 2.4.x

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-08 14:37:53 -07:00
Stephen Rothwell
5ac353f9ba [PATCH] Clean up struct flock definitions
This patch just gathers together all the struct flock definitions except
xtensa into asm-generic/fcntl.h.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:38 -07:00
Stephen Rothwell
1abf62afb6 [PATCH] Clean up the fcntl operations
This patch puts the most popular of each fcntl operation/flag into
asm-generic/fcntl.h and cleans up the arch files.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:38 -07:00
Stephen Rothwell
e64ca97fd8 [PATCH] Clean up the open flags
This patch puts the most popular of each open flag into asm-generic/fcntl.h
and cleans up the arch files.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:38 -07:00
Stephen Rothwell
9317259ead [PATCH] Create asm-generic/fcntl.h
This set of patches creates asm-generic/fcntl.h and consolidates as much as
possible from the asm-*/fcntl.h files into it.

This patch just gathers all the identical bits of the asm-*/fcntl.h files into
asm-generic/fcntl.h.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:37 -07:00
Jesper Juhl
97de50c0ad [PATCH] remove verify_area(): remove verify_area() from various uaccess.h headers
Remove the deprecated (and unused) verify_area() from various uaccess.h
headers.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:35 -07:00
Christoph Hellwig
c8d127418d [PATCH] remove asm-*/hdreg.h
unused and useless..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:30 -07:00
H. J. Lu
36d57ac4a8 [PATCH] auxiliary vector cleanups
The size of auxiliary vector is fixed at 42 in linux/sched.h.  But it isn't
very obvious when looking at linux/elf.h.  This patch adds AT_VECTOR_SIZE
so that we can change it if necessary when a new vector is added.

Because of include file ordering problems, doing this necessitated the
extraction of the AT_* symbols into a standalone header file.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:21 -07:00
Stephen Rothwell
202e5979af [PATCH] compat: be more consistent about [ug]id_t
When I first wrote the compat layer patches, I was somewhat cavalier about
the definition of compat_uid_t and compat_gid_t (or maybe I just
misunderstood :-)).  This patch makes the compat types much more consistent
with the types we are being compatible with and hopefully will fix a few
bugs along the way.

	compat type		type in compat arch
	__compat_[ug]id_t	__kernel_[ug]id_t
	__compat_[ug]id32_t	__kernel_[ug]id32_t
	compat_[ug]id_t		[ug]id_t

The difference is that compat_uid_t is always 32 bits (for the archs we
care about) but __compat_uid_t may be 16 bits on some.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:19 -07:00
Jakub Jelinek
4732efbeb9 [PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedup
ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter
(which at least on UP usually means an immediate context switch to one of
the waiter threads).  This waiter wakes up and after a few instructions it
attempts to acquire the cv internal lock, but that lock is still held by
the thread calling pthread_cond_signal.  So it goes to sleep and eventually
the signalling thread is scheduled in, unlocks the internal lock and wakes
the waiter again.

Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal
to avoid this performance issue, but it was removed when locks were
redesigned to the 3 state scheme (unlocked, locked uncontended, locked
contended).

Following scenario shows why simply using FUTEX_REQUEUE in
pthread_cond_signal together with using lll_mutex_unlock_force in place of
lll_mutex_unlock is not enough and probably why it has been disabled at
that time:

The number is value in cv->__data.__lock.
        thr1            thr2            thr3
0       pthread_cond_wait
1       lll_mutex_lock (cv->__data.__lock)
0       lll_mutex_unlock (cv->__data.__lock)
0       lll_futex_wait (&cv->__data.__futex, futexval)
0                       pthread_cond_signal
1                       lll_mutex_lock (cv->__data.__lock)
1                                       pthread_cond_signal
2                                       lll_mutex_lock (cv->__data.__lock)
2                                         lll_futex_wait (&cv->__data.__lock, 2)
2                       lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock)
                          # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE
2                       lll_mutex_unlock_force (cv->__data.__lock)
0                         cv->__data.__lock = 0
0                         lll_futex_wake (&cv->__data.__lock, 1)
1       lll_mutex_lock (cv->__data.__lock)
0       lll_mutex_unlock (cv->__data.__lock)
          # Here, lll_mutex_unlock doesn't know there are threads waiting
          # on the internal cv's lock

Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal,
but it will cost us not one, but 2 extra syscalls and, what's worse, one of
these extra syscalls will be done for every single waiting loop in
pthread_cond_*wait.

We would need to use lll_mutex_unlock_force in pthread_cond_signal after
requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait.

Another alternative is to do the unlocking pthread_cond_signal needs to do
(the lock can't be unlocked before lll_futex_wake, as that is racy) in the
kernel.

I have implemented both variants, futex-requeue-glibc.patch is the first
one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel.
 The kernel interface allows userland to specify how exactly an unlocking
operation should look like (some atomic arithmetic operation with optional
constant argument and comparison of the previous futex value with another
constant).

It has been implemented just for ppc*, x86_64 and i?86, for other
architectures I'm including just a stub header which can be used as a
starting point by maintainers to write support for their arches and ATM
will just return -ENOSYS for FUTEX_WAKE_OP.  The requeue patch has been
(lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running
32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL.

With the following benchmark on UP x86-64 I get:

for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \
for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done
time elf/ld.so --library-path .:nptl-orig /tmp/bench
real 0m0.655s user 0m0.253s sys 0m0.403s
real 0m0.657s user 0m0.269s sys 0m0.388s
time elf/ld.so --library-path .:nptl-requeue /tmp/bench
real 0m0.496s user 0m0.225s sys 0m0.271s
real 0m0.531s user 0m0.242s sys 0m0.288s
time elf/ld.so --library-path .:nptl-wake_op /tmp/bench
real 0m0.380s user 0m0.176s sys 0m0.204s
real 0m0.382s user 0m0.175s sys 0m0.207s

The benchmark is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt
Older futex-requeue-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt
Older futex-wake_op-glibc.patch version is at:
http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt
Will post a new version (just x86-64 fixes so that the patch
applies against pthread_cond_signal.S) to libc-hacker ml soon.

Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded
testcase that will not test the atomicity of the operation, but at least
check if the threads that should have been woken up are woken up and
whether the arithmetic operation in the kernel gave the expected results.

Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jamie Lokier <jamie@shareable.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07 16:57:17 -07:00
Linus Torvalds
e766f1cc59 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 2005-09-05 00:12:58 -07:00
Kyle Moffett
fa5b08d5f8 [PATCH] sab: consolidate kmem_bufctl_t
This is used only in slab.c and each architecture gets to define whcih
underlying type is to be used.

Seems a bit silly - move it to slab.c and use the same type for all
architectures: unsigned int.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:48 -07:00
Stephen Rothwell
fd4fd5aac1 [PATCH] mm: consolidate get_order
Someone mentioned that almost all the architectures used basically the same
implementation of get_order.  This patch consolidates them into
asm-generic/page.h and includes that in the appropriate places.  The
exceptions are ia64 and ppc which have their own (presumably optimised)
versions.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05 00:05:39 -07:00
David S. Miller
a7a6cac204 [SPARC]: Kill io_remap_page_range()
It's been deprecated long enough and there are no in-tree
users any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-09-01 21:51:26 -07:00
David S. Miller
8a36895c0d [SPARC64]: Use 'unsigned long' for port argument to I/O string ops.
This kills warnings when building drivers/ide/ide-iops.c
and puts us in-line with what other platforms do here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-31 15:01:33 -07:00
David S. Miller
d7ce78fd9a [SPARC64]: Eliminate irq_cpustat_t.
We can put the __softirq_pending mask in the cpudata,
no need for the silly NR_CPUS array in kernel/softirq.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 22:46:43 -07:00
Patrick McHardy
b0573dea1f [NET]: Introduce SO_{SND,RCV}BUFFORCE socket options
Allows overriding of sysctl_{wmem,rmrm}_max

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 15:31:35 -07:00
David S. Miller
4f07118f65 [SPARC64]: More fully work around Spitfire Errata 51.
It appears that a memory barrier soon after a mispredicted
branch, not just in the delay slot, can cause the hang
condition of this cpu errata.

So move them out-of-line, and explicitly put them into
a "branch always, predict taken" delay slot which should
fully kill this problem.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 12:46:22 -07:00
David S. Miller
442464a500 [SPARC64]: Make debugging spinlocks usable again.
When the spinlock routines were moved out of line into
kernel/spinlock.c this made it so that the debugging
spinlocks record lock acquisition program counts in the
kernel/spinlock.c functions not in their callers.
This makes the debugging info kind of useless.

So record the correct caller's program counter and
now this feature is useful once more.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 12:46:07 -07:00
Kumar Gala
3d6364abcf [SPARC64]: remove use of asm/segment.h
Removed sparc64 architecture specific users of asm/segment.h and
asm-sparc64/segment.h itself

Signed-off-by: Kumar Gala <kumar.gala@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 12:45:30 -07:00
David S. Miller
6c52a96e6c [SPARC64]: Revamp Spitfire error trap handling.
Current uncorrectable error handling was poor enough
that the processor could just loop taking the same
trap over and over again.  Fix things up so that we
at least get a log message and perhaps even some register
state.

In the process, much consolidation became possible,
particularly with the correctable error handler.

Prefix assembler and C function names with "spitfire"
to indicate that these are for Ultra-I/II/IIi/IIe only.

More work is needed to make these routines robust and
featureful to the level of the Ultra-III error handlers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29 12:45:11 -07:00
David S. Miller
a3f9985843 [SPARC64]: Move kernel unaligned trap handlers into assembler file.
GCC 4.x really dislikes the games we are playing in
unaligned.c, and the cleanest way to fix this is to
move things into assembler.

Noted by Al Viro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-19 15:55:33 -07:00
David S. Miller
40a085c41d [SPARC]: Add inotify syscall entries.
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-27 14:14:39 -07:00
Eric W. Biederman
7c9034735e [PATCH] Add emergency_restart()
When the kernel is working well and we want to restart cleanly
kernel_restart is the function to use.   But in many instances
the kernel wants to reboot when thing are expected to be working
very badly such as from panic or a software watchdog handler.

This patch adds the function emergency_restart() so that
callers can be clear what semantics they expect when calling
restart.  emergency_restart() is expected to be callable
from interrupt context and possibly reliable in even more
trying circumstances.

This is an initial generic implementation for all architectures.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-26 14:35:41 -07:00
David S. Miller
db7d9a4eb7 [SPARC64]: Move syscall success and newchild state out of thread flags.
These two bits were accesses non-atomically from assembler
code.  So, in order to eliminate any potential races resulting
from that, move these pieces of state into two bytes elsewhere
in struct thread_info.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-24 19:36:26 -07:00
David S. Miller
cdd5186f75 [SPARC64]: Privatize sun5_timer.
It is only used by some localized code in irq.c, and also
delete enable_prom_timer() as that is totally unused.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-24 19:36:13 -07:00
David S. Miller
c5019a578f [SPARC64]: Kill totally unused inline functions from asm/spitfire.h
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-24 19:35:56 -07:00
David S. Miller
620de54675 [SPARC64]: Simplify asm/rwsem.h slightly.
rwsem_atomic_update and rwsem_atomic_add can be implemented
straightly using atomic_*() routines.

Also, rwsem_cmpxchgw() is totally unused, kill it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-24 19:35:42 -07:00