Extend the parameters of migrate_pages() to allow the caller control over the
fate of successfully migrated or impossible to migrate pages.
Swap migration and direct migration will have the same interface after this
patch so that patches can be independently applied to the policy layer and the
core migration code.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add gfp_mask to add_to_swap
add_to_swap does allocations with GFP_ATOMIC in order not to interfere with
swapping. During migration we may have use add_to_swap extensively which may
lead to out of memory errors.
This patch makes add_to_swap take a parameter that specifies the gfp mask.
The page migration code can then make add_to_swap use GFP_KERNEL.
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by
CONFIG_MIGRATION saving some codesize for single processor kernels.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_migrate_pages implementation using swap based page migration
This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
The intent of sys_migrate is to migrate memory of a process. A process may
have migrated to another node. Memory was allocated optimally for the prior
context. sys_migrate_pages allows to shift the memory to the new node.
sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.
This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory. sys_migrate_pages does not modify policies in contrast to Ray's
implementation.
The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset). When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets. The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.
Patch supports ia64, i386 and x86_64.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add page migration support via swap to the NUMA policy layer
This patch adds page migration support to the NUMA policy layer. An
additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is
specified then pages that do not conform to the memory policy will be evicted
from memory. When they get pages back in new pages will be allocated
following the numa policy.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Include page migration if the system is NUMA or having a memory model that
allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM).
And:
- Only include lru_add_drain_per_cpu if building for an SMP system.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds the basic page migration function with a minimal implementation that
only allows the eviction of pages to swap space.
Page eviction and migration may be useful to migrate pages, to suspend
programs or for remapping single pages (useful for faulty pages or pages with
soft ECC failures)
The process is as follows:
The function wanting to migrate pages must first build a list of pages to be
migrated or evicted and take them off the lru lists via isolate_lru_page().
isolate_lru_page determines that a page is freeable based on the LRU bit set.
Then the actual migration or swapout can happen by calling migrate_pages().
migrate_pages does its best to migrate or swapout the pages and does multiple
passes over the list. Some pages may only be swappable if they are not dirty.
migrate_pages may start writing out dirty pages in the initial passes over
the pages. However, migrate_pages may not be able to migrate or evict all
pages for a variety of reasons.
The remaining pages may be returned to the LRU lists using putback_lru_pages().
Changelog V4->V5:
- Use the lru caches to return pages to the LRU
Changelog V3->V4:
- Restructure code so that applying patches to support full migration does
require minimal changes. Rename swapout_pages() to migrate_pages().
Changelog V2->V3:
- Extract common code from shrink_list() and swapout_pages()
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Michael Kerrisk" <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add PF_SWAPWRITE to control a processes permission to write to swap.
- Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
and pdflush
- Set PF_SWAPWRITE flag for kswapd and pdflush
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the start of the `swap migration' patch series.
Swap migration allows the moving of the physical location of pages between
nodes in a numa system while the process is running. This means that the
virtual addresses that the process sees do not change. However, the system
rearranges the physical location of those pages.
The main intent of page migration patches here is to reduce the latency of
memory access by moving pages near to the processor where the process
accessing that memory is running.
The patchset allows a process to manually relocate the node on which its
pages are located through the MF_MOVE and MF_MOVE_ALL options while
setting a new memory policy.
The pages of process can also be relocated from another process using the
sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages
function call takes two sets of nodes and moves pages of a process that are
located on the from nodes to the destination nodes.
Manual migration is very useful if for example the scheduler has relocated a
process to a processor on a distant node. A batch scheduler or an
administrator can detect the situation and move the pages of the process
nearer to the new processor.
sys_migrate_pages() could be used on non-numa machines as well, to force all
of a particualr process's pages out to swap, if someone thinks that's useful.
Larger installations usually partition the system using cpusets into sections
of nodes. Paul has equipped cpusets with the ability to move pages when a
task is moved to another cpuset. This allows automatic control over locality
of a process. If a task is moved to a new cpuset then also all its pages are
moved with it so that the performance of the process does not sink
dramatically (as is the case today).
Swap migration works by simply evicting the page. The pages must be faulted
back in. The pages are then typically reallocated by the system near the node
where the process is executing.
For swap migration the destination of the move is controlled by the allocation
policy. Cpusets set the allocation policy before calling sys_migrate_pages()
in order to move the pages as intended.
No allocation policy changes are performed for sys_migrate_pages(). This
means that the pages may not faulted in to the specified nodes if no
allocation policy was set by other means. The pages will just end up near the
node where the fault occurred.
There's another patch series in the pipeline which implements "direct
migration".
The direct migration patchset extends the migration functionality to avoid
going through swap. The destination node of the relation is controllable
during the actual moving of pages. The crutch of using the allocation policy
to relocate is not necessary and the pages are moved directly to the target.
Its also faster since swap is not used.
And sys_migrate_pages() can then move pages directly to the specified node.
Implement functions to isolate pages from the LRU and put them back later.
This patch:
An earlier implementation was provided by Hirokazu Takahashi
<taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the
memory hotplug project.
From: Magnus
This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap
migration.
Signed-off-by: Magnus Damm <magnus.damm@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
swap migration's isolate_lru_page() currently uses an IPI to notify other
processors that the lru caches need to be drained if the page cannot be
found on the LRU. The IPI interrupt may interrupt a processor that is just
processing lru requests and cause a race condition.
This patch introduces a new function run_on_each_cpu() that uses the
keventd() to run the LRU draining on each processor. Processors disable
preemption when dealing the LRU caches (these are per processor) and thus
executing LRU draining from another process is safe.
Thanks to Lee Schermerhorn <lee.schermerhorn@hp.com> for finding this race
condition.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As recently there has been lot of traffic on the right values for batch and
high water marks for per_cpu_pagelists. This patch makes these two
variables configurable through /proc interface.
A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry
controls the fraction of pages at most in each zone that are allocated for
each per cpu page list. The min value for this is 8. It means that we
don't allow more than 1/8th of pages in each zone to be allocated in any
single per_cpu_pagelist.
The batch value of each per cpu pagelist is also updated as a result. It
is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to
discard as much pagecache and/or reclaimable slab objects as it can. THis
operation requires root permissions.
It won't drop dirty data, so the user should run `sync' first.
Caveats:
a) Holds inode_lock for exorbitant amounts of time.
b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.
This is a debugging feature: useful for getting consistent results between
filesystem benchmarks. We could possibly put it under a config option, but
it's less than 300 bytes.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__alloc_percpu and alloc_percpu both take an 'align' argument which is
completely ignored. snmp6_mib_init() in net/ipv6/af_inet6.c attempts to use
it, but it will be ignored. Therefore, remove the 'align' argument and fixup
the lone caller.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix compilation with CONFIG_MEMORY_HOTPLUG=y and gcc41.
Also remove unneeded declations, add a public function.
drivers/base/memory.c:53: error: static declaration of 'register_memory_notifier' follows non-static declaration
include/linux/memory.h:85: error: previous declaration of 'register_memory_notifier' was here
drivers/base/memory.c:58: error: static declaration of 'unregister_memory_notifier' follows non-static declaration
include/linux/memory.h:86: error: previous declaration of 'unregister_memory_notifier' was here
drivers/base/memory.c:68: error: static declaration of 'register_memory' follows non-static declaration
include/linux/memory.h:73: error: previous declaration of 'register_memory' was here
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds oprofile support for the 7450 and all its multitudinous
derivatives.
* Added 7450 (and derivatives) support for oprofile
* Changed e500 cputable to have oprofile model and cpu_type fields
* Added support for classic 32-bit performance monitor interrupt
* Cleaned up common powerpc oprofile code to be as common as possible
* Cleaned up oprofile_impl.h to reflect 32 bit classic code
* Added 32-bit MMCRx bitfield definitions and SPR numbers
Signed-off-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes pci_address_to_pio() to return an unsigned long (to be safe)
and fixes a bug in the implementation that caused it to return a bogus
IO port number
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On ppc64, we independently define VMALLOCBASE and VMALLOC_START to be
the same thing: the start of the vmalloc() area at 0xd000000000000000.
VMALLOC_START is used much more widely, including in generic code, so
this patch gets rid of the extraneous VMALLOCBASE.
This does require moving the definitions of region IDs from page_64.h
to pgtable.h, but they don't clearly belong in the former rather than
the latter, anyway. While we're moving them, clean up the definitions
of the REGION_IDs:
- Abolish REGION_SIZE, it was only used once, to define
REGION_MASK anyway
- Define the specific region ids in terms of the REGION_ID()
macro.
- Define KERNEL_REGION_ID in terms of PAGE_OFFSET rather than
KERNELBASE. It amounts to the same thing, but conceptually this is
about the region of the linear mapping (which starts at PAGE_OFFSET)
rather than of the kernel text itself (which is at KERNELBASE).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds some very basic support for the new machines, including the
Quad G5 (tested), and other new dual core based machines and iMac G5
iSight (untested). This is still experimental ! There is no thermal
control yet, there is no proper handing of MSIs, etc.. but it
boots, I have all 4 cores up on my machine. Compared to the previous
version of this patch, this one adds DART IOMMU support for the U4
chipset and thus should work fine on setups with more than 2Gb of RAM.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There is code in the RPAPHP directory that is identical to this routine;
I'll be removing that code in an upcoming patch, but this patch is needed
to expose the function to make it callable.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cleanup the MPIC IO-APIC workarounds, make them a bit more generic,
smaller and faster.
Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pre-parsed addrs/n_addrs fields in struct device_node are finally
gone. Remove the dodgy heuristics that did that parsing at boot and
remove the fields themselves since we now have a good replacement with
the new OF parsing code. This patch also fixes a bunch of drivers to use
the new code instead, so that at least pmac32, pseries, iseries and g5
defconfigs build.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Milton and I were looking at the cputable code and it looks like we can
set spurious bits on 64bit.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We used to print a NUMA cpu summary at boot before the hotplug cpu code
was added. This has been useful for catching machine configuration as
well as firmware bugs in the past.
This patch restores that functionality. An example of the output is:
Node 0 CPUs: 0-7
Node 1 CPUs: 8-15
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch adds support for the TQ Components TQM85xx modules. Currently the
modules TQM8540/8541/8555/8560 are supported.
Signed-off-by: Stefan Roese <sr@denx.de>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch reduces lock complexity of SPU scheduler, particularly
for involuntary preemptive switches. As a result the new code
does a better job of mapping the highest priority tasks to SPUs.
Lock complexity is reduced by using the system default workqueue
to perform involuntary saves. In this way we avoid nasty lock
ordering problems that the previous code had. A "minimum timeslice"
for SPU contexts is also introduced. The intent here is to avoid
thrashing.
While the new scheduler does a better job at prioritization it
still does nothing for fairness.
From: Mark Nutter <mnutter@us.ibm.com>
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch makes it easier to preempt an SPU context by
having the scheduler hold ctx->state_sema for much shorter
periods of time.
As part of this restructuring, the control logic for the "run"
operation is moved from arch/ppc64/kernel/spu_base.c to
fs/spufs/file.c. Of course the base retains "bottom half"
handlers for class{0,1} irqs. The new run loop will re-acquire
an SPU if preempted.
From: Mark Nutter <mnutter@us.ibm.com>
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Implementing the machine_crash_shutdown which will be called by
crash_kexec (called in case of a panic, sysrq etc.). Disable the
interrupts, shootdown cpus using debugger IPI and collect regs
for all CPUs.
elfcorehdr= specifies the location of elf core header stored by
the crashed kernel. This command line option will be passed by
the kexec-tools to capture kernel.
savemaxmem= specifies the actual memory size that the first kernel
has and this value will be used for dumping in the capture kernel.
This command line option will be passed by the kexec-tools to
capture kernel.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a few places where we need to fix things up for the kernel to work
if it's linked at 32MB:
- platforms/powermac/smp.c
To start secondary cpus on pmac we patch the reset vector, which is fine.
Except if we're above 32MB we don't have enough bits for an absolute branch,
it needs to relative.
- kernel/head_64.s
- A few branches in the cpu hold code need to load the full target address
and do a bctr.
- after_prom_start needs to load PHYSICAL_START as the dest address, not 0.
- The exception prolog needs to load the low word of the target adddress,
not just the low halfword.
- Fixup handling of the initial stab address.
- kernel/setup_64.c
smp_release_cpus() needs to write 1 to the spinloop flag near 0, not 32 MB.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Regardless of where the kernel's linked we always get interrupts at low
addresses. This patch creates a trampoline in the first 3 pages of memory,
where interrupts land, and patches those addresses to jump into the real
kernel code at PHYSICAL_START.
We also need to reserve the trampoline code and a bit more in prom.c
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The fwnmi vectors can be anywhere < 32 MB, so we need to use a trampoline
for them. The kdump kernel will register the trampoline addresses, which will
then jump up to the real code above 32 MB.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch adds a Kconfig variable, CONFIG_CRASH_DUMP, which configures the
built kernel for use as a Kdump kernel.
Currently "all" this involves is changing the value of KERNELBASE to 32 MB.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This places dynamically added memory within the appropriate
numa node. A new routine hot_add_scn_to_nid() replicates most of
the memory scanning code in parse_numa_properties().
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch separates usage of KERNELBASE and PAGE_OFFSET. I haven't
looked at any of the PPC32 code, if we ever want to support Kdump on
PPC we'll have to do another audit, ditto for iSeries.
This patch makes PAGE_OFFSET the constant, it'll always be 0xC * 1
gazillion for 64-bit.
To get a physical address from a virtual one you subtract PAGE_OFFSET,
_not_ KERNELBASE.
KERNELBASE is the virtual address of the start of the kernel, it's
often the same as PAGE_OFFSET, but _might not be_.
If you want to know something's offset from the start of the kernel
you should subtract KERNELBASE.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a bunch of code that compares an address with KERNELBASE to see if
it's a "kernel address", ie. >= KERNELBASE. The proper test is actually to
compare with PAGE_OFFSET, since we're going to change KERNELBASE soon.
So replace all of them with an is_kernel_addr() macro that does that.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently machine_crash_shutdown() gets a struct pt_regs, but doesn't pass it
through to the ppc_md function, it should.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This updates the OF address parsers to return the IO flags
indicating the type of address obtained. It also adds a PCI
call for converting physical addresses that hit IO space into
into IO tokens, and add routines that return the translated
addresses into struct resource
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The udbg low level io layer has an issue with udbg_getc() returning a
char (unsigned on ppc) instead of an int, thus the -1 if you had no
available input device could end up turned into 0xff, filling your
display with bogus characters. This fixes it, along with adding a little
blob to xmon to do a delay before exiting when getting an EOF and fixing
the detection of ADB keyboards in udbg_adb.c
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
23-rpaphp-migrate.patch (parts)
This patch moves some pci device add & remove code from the PCI
hotplug directory to the arch/powerpc/kernel directory, and cleans
it up a tad. The primary reason for this is that the code performs
some fairly generic operations that are shared with the PCI error
recovery code (living in the arch/powerpc/kernel directory).
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
22-rpaphp-eliminate-dupe-code.patch (parts)
The RPAPHP code contains two routines that appear to be gratuitous
copies of very similar pci code. In particular,
rpaphp_claim_resource ~~ pci_claim_resource
rpadlpar_claim_one_bus == pcibios_claim_one_bus
This makes pcibios_claim_one_bus from arch/powerpc/kernel/pci_64.c
available to the RPAPHP code.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
20-rpaphp-eeh-cleanup.patch
This patch move some code from the rpaphp directory, to the powerpc
directory, where it should have been all along (Among other things, I
need it in the powerpc directory for the PCI error recovery.)
Please note that patch affects TWO maintainers: Paul, after applying
the powerpc part, please ask that GregKH appli the PCI part. It is safe
to have the powerpc part go in first. It would be bad to have the
PCI part go in first.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch removes several unnecessary fields from the paca:
- next_jiffy_update_tb was simply unused. Remove trivially.
- The exdsi exception save area was not used. There were plans to use
it, but they never seem to have gone anywhere. If they ever do, we
can put it back. Remove from the paca, and from asm-offsets.c
- The default_decr field was used from asm, but was only ever assigned
the value of tb_ticks_per_jiffy. Just access tb_ticks_per_jiffy from
asm directly instead.
Built and booted on POWER5 LPAR and iSeries RS64.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On iSeries, the paca contains, amongst other things an ItLpRegSave
structure used by the hypervisor to save registers. The hypervisor
locates this area through a pointer at the beginning of the paca, so
the structure itself can be located elsewhere. This patch moves the
reg_save area out into its own array. This reduces the amount of
iSeries specific gunk which is visible to general powerpc code via
paca.h
Built and booted on POWER5 LPAR and iSeries RS64.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
ARCH=powerpc couldn't boot from BootX as it uses a "different" way of
getting in the kernel. This patch adds the necessary trampolines,
creating a flattened device-tree from the tree passed from MacOS, and
initializing the btext engine early for really-early debugging.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch unifies udbg for both ppc32 and ppc64 when building the
merged achitecture. xmon now has a single "back end". The powermac udbg
stuff gets enriched with some ADB capabilities and btext output. In
addition, the early_init callback is now called on ppc32 as well,
approx. in the same order as ppc64 regarding device-tree manipulations.
The init sequences of ppc32 and ppc64 are getting closer, I'll unify
them in a later patch.
For now, you can force udbg to the scc using "sccdbg" or to btext using
"btextdbg" on powermacs. I'll implement a cleaner way of forcing udbg
output to something else than the autodetected OF output device in a
later patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This moves the discovery of legacy serial ports to a separate file,
makes it common to ppc32 and ppc64, and reworks it to use the new OF
address translators to get to the ports early. This new version can also
detect some PCI serial cards using legacy chips and will probably match
those discovered port with the default console choice.
Only ppc64 gets udbg still yet, unifying udbg isn't finished yet.
It also adds some speed-probing code to udbg so that the default console
can come up at the same speed it was set to by the firmware.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Parsing addresses extracted from Open Firmware isn't a simple matter. We
have various bits of code that try to do it in various place, including
some heuristics in prom.c that pre-parse addresses at boot and fill
device-nodes "addrs", but those are dodgy at best and I want to
deprecate them. So this patch introduces a new set of routines that
should be capable of parsing most types of addresses and translating
them into CPU physical addresses. It currently works for things on PCI
busses and ISA busses and should work on "standard" busses like the root
bus or the MacIO bus that don't put funky flags in addresses. If you
have other bus types that do use funky flags, you'll have to add new bus
type translators, which is fairly easy.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
include/asm-ppc/bseip.h is a duplicate of arch/ppc/platforms/bseip.h
and is not referenced anywhere, so get rid of it. Pointed out by
Marcelo Tosatti.
Signed-off-by: Paul Mackerras <paulus@samba.org>
A previous patch ended up not increasing __NR_syscalls to account
for the new SPU syscalls (probably my fault).
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a scheduler for SPUs to make it possible to use
more logical SPUs than physical ones are present in the
system.
Currently, there is no support for preempting a running
SPU thread, they have to leave the SPU by either triggering
an event on the SPU that causes it to return to the
owning thread or by sending a signal to it.
This patch also adds operations that enable accessing an SPU
in either runnable or saved state. We use an RW semaphore
to protect the state of the SPU from changing underneath
us, while we are holding it readable. In order to change
the state, it is acquired writeable and a context save
or restore is executed before downgrading the semaphore
to read-only.
From: Mark Nutter <mnutter@us.ibm.com>,
Uli Weigand <Ulrich.Weigand@de.ibm.com>
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the code needed to perform a context switch from
spufs, following the recommended 76-step sequence.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add some infrastructure for saving and restoring the context of an
SPE. This patch creates a new structure that can hold the whole
state of a physical SPE in memory. It also contains code that
avoids races during the context switch and the binary code that
is loaded to the SPU in order to access its registers.
The actual PPE- and SPE-side context switch code are two separate
patches.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch adds the necessary core bus support used by device drivers
that sit on the IBM GX bus on modern pSeries machines like the Galaxy
infiniband for example. It provide transparent DMA ops (the low level
driver works with virtual addresses directly) along with a simple bus
layer using the Open Firmware matching routines.
Signed-off-by: Heiko J Schick <schickhj@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This cleanup patch speeds up the null syscall path on ppc64 by about 3%,
and brings the ppc32 and ppc64 code slightly closer together.
The ppc64 code was checking current_thread_info()->flags twice in the
syscall exit path; once for TIF_SYSCALL_T_OR_A before disabling
interrupts, and then again for TIF_SIGPENDING|TIF_NEED_RESCHED etc after
disabling interrupts. Now we do the same as ppc32 -- check the flags
only once in the fast path, and re-enable interrupts if necessary in the
ptrace case.
The patch abolishes the 'syscall_noerror' member of struct thread_info
and replaces it with a TIF_NOERROR bit in the flags, which is handled in
the slow path. This shortens the syscall entry code, which no longer
needs to clear syscall_noerror.
The patch adds a TIF_SAVE_NVGPRS flag which causes the syscall exit slow
path to save the non-volatile GPRs into a signal frame. This removes the
need for the assembly wrappers around sys_sigsuspend(),
sys_rt_sigsuspend(), et al which existed solely to save those registers
in advance. It also means I don't have to add new wrappers for ppoll()
and pselect(), which is what I was supposed to be doing when I got
distracted into this...
Finally, it unifies the ppc64 and ppc32 methods of handling syscall exit
directly into a signal handler (as required by sigsuspend et al) by
introducing a TIF_RESTOREALL flag which causes _all_ the registers to be
reloaded from the pt_regs by taking the ret_from_exception path, instead
of the normal syscall exit path which stomps on the callee-saved GPRs.
It appears to pass an LTP test run on ppc64, and passes basic testing on
ppc32 too. Brief tests of ptrace functionality with strace and gdb also
appear OK. I wouldn't send it to Linus for 2.6.15 just yet though :)
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Moved 83xx and QUICC Engine interrupt handling code into arch/powerpc
as a precursor of getting 83xx sub-arch building in arch/powerpc.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch merges, to some extent, the PPC32 and PPC64 kexec implementations.
We adopt the PPC32 approach of having ppc_md callbacks for the kexec functions.
The current PPC64 implementation becomes the "default" implementation for PPC64
which platforms can select if they need no special treatment.
I've added these default callbacks to pseries/maple/cell/powermac, this means
iSeries no longer supports kexec - but it never worked anyway.
I've renamed PPC32's machine_kexec_simple to default_machine_kexec, inline with
PPC64. Judging by the comments it might be better named machine_kexec_non_of,
or something, but at the moment it's the only implementation for PPC32 so it's
the "default".
Kexec requires machine_shutdown(), which is in machine_kexec.c on PPC32, but we
already have in setup-common.c on powerpc. All this does is call
ppc_md.nvram_sync, which only powermac implements, so instead make
machine_shutdown a ppc_md member and have it call core99_nvram_sync directly
on powermac.
I've also stuck relocate_kernel.S into misc_32.S for powerpc.
Built for ARCH=ppc, and 32 & 64 bit ARCH=powerpc, with KEXEC=y/n. Booted on
P5 LPAR and successfully kexec'ed.
Should apply on top of 493f25ef40.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch removes the EXPORT_SYMBOL'ed but completely unused variable
ucSystemType and removes the unneeded EXPORT_SYMBOL(_prep_type).
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Tom Rini <trini@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since there's no longer any external user of ip_fragment() we can make
it static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle NAT of decapsulated IPsec packets by reconstructing the struct flowi
of the original packet from the conntrack information for IPsec policy
checks.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NAT changes the key used for the xfrm lookup it needs to be done
again. If a new policy is returned in POST_ROUTING the packet needs
to be passed to xfrm4_output_one manually after all hooks were called
because POST_ROUTING is called with fixed okfn (ip_finish_output).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_me_harder doesn't use the port numbers of the xfrm lookup and
uses ip_route_input for non-local addresses which doesn't do a xfrm
lookup, ip6_route_me_harder doesn't do a xfrm lookup at all.
Use xfrm_decode_session and do the lookup manually, make sure both
only do the lookup if the packet hasn't been transformed already.
Makeing sure the lookup only happens once needs a new field in the
IP6CB, which exceeds the size of skb->cb. The size of skb->cb is
increased to 48b. Apparently the IPv6 mobile extensions need some
more room anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reset IPSKB_XFRM_TUNNEL_SIZE flags in ipip and ip_gre hard_start_xmit
function before the packet reenters IP. This is neccessary so the
encapsulated packets are checked not to be oversized in xfrm4_output.c
again. Reset all flags in sit when a packet changes its address family.
Also remove some obsolete IPSKB flags.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the innermost transform uses transport mode the decapsulated packet
is not visible to netfilter. Pass the packet through the PRE_ROUTING and
LOCAL_IN hooks again before handing it to upper layer protocols to make
netfilter-visibility symetrical to the output path.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move nextheader offset to the IP6CB to make it possible to pass a
packet to ip6_input_finish multiple times and have it skip already
parsed headers. As a nice side effect this gets rid of the manual
hopopts skipping in ip6_input_finish.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call netfilter hooks before IPsec transforms. Packets visit the
FORWARD/LOCAL_OUT and POST_ROUTING hook before the first encapsulation
and the LOCAL_OUT and POST_ROUTING hook before each following tunnel mode
transform.
Patch from Herbert Xu <herbert@gondor.apana.org.au>:
Move the loop from dst_output into xfrm4_output/xfrm6_output since they're
the only ones who need to it. xfrm{4,6}_output_one() processes the first SA
all subsequent transport mode SAs and is called in a loop that calls the
netfilter hooks between each two calls.
In order to avoid the tail call issue, I've added the inline function
nf_hook which is nf_hook_slow plus the empty list check.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the ARM AMBA bus is used on MIPS as well as ARM, we need
to make the bus available for other architectures to use. Move
the AMBA include files from include/asm-arm/hardware/ to
include/linux/amba/
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Patch from Andre McCurdy
Replaces generic swab32 routine with a more ARM friendly version.
Reduces kernel text size by approx 1200 bytes when compiled with
3.4.4 and approx 2400 bytes with 4.0.2
Probably some performance benefit as well.
Signed-off-by: Andre McCurdy <armccurdy@yahoo.co.uk>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
It should return an unsigned value, and fix sk_filter() as well.
Signed-off-by: Kris Katterjohn <kjak@ispwest.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when kbuild passes KBUILD_MODNAME with "" do not __stringify it when
used. Remove __stringnify for all users.
This also fixes the output of:
$ ls -l /sys/module/
drwxr-xr-x 4 root root 0 2006-01-05 14:24 pcmcia
drwxr-xr-x 4 root root 0 2006-01-05 14:24 pcmcia_core
drwxr-xr-x 3 root root 0 2006-01-05 14:24 "processor"
drwxr-xr-x 3 root root 0 2006-01-05 14:24 "psmouse"
The quoting of the module names will be gone again.
Thanks to GregKH + Kay Sievers for reproting this.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Also update the tokenlen calculations to accomodate g_token_size().
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If someone changes the uid/gid mapping in userland, then we do eventually
want those changes to be propagated to the kernel. Currently the kernel
assumes that it may cache entries forever.
Add an expiration time + garbage collector for idmap entries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server decides to close the RPC socket, we currently don't actually
respond until either another RPC call is scheduled, or until xprt_autoclose()
gets called by the socket expiry timer (which may be up to 5 minutes
later).
This patch ensures that xprt_autoclose() is called much sooner if the
server closes the socket.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
At some point, transport endpoint addresses will no longer be IPv4. To hide
the structure of the rpc_xprt's address field from ULPs and port mappers,
add an API for setting the port number during an RPC bind operation.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We'd like to hide fields in rpc_xprt and rpc_clnt from upper layer protocols.
Start by creating an API to force RPC rebind, replacing logic that simply
sets cl_port to zero.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP. NFSv2/3 and NFSv4 mounting should be carefully checked.
Probably need to rig a server where certain services aren't running, or
that returns an error for some typical operation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add RPC client transport switch support for replacing buffer management
on a per-transport basis.
In the current IPv4 socket transport implementation, RPC buffers are
allocated as needed for each RPC message that is sent. Some transport
implementations may choose to use pre-allocated buffers for encoding,
sending, receiving, and unmarshalling RPC messages, however. For
transports capable of direct data placement, the buffers can be carved
out of a pre-registered area of memory rather than from a slab cache.
Test-plan:
Millions of fsx operations. Performance characterization with "sio" and
"iozone". Use oprofile and other tools to look for significant regression
in CPU utilization.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server receives an NLM cancel call and finds no waiting lock to
cancel, then chances are the lock has already been applied, and the client
just hadn't yet processed the NLM granted callback before it sent the
cancel.
The Open Group text, for example, perimts a server to return either success
(LCK_GRANTED) or failure (LCK_DENIED) in this case. But returning an error
seems more helpful; the client may be able to use it to recognize that a
race has occurred and to recover from the race.
So, modify the relevant functions to return an error in this case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch removes ths unused function xdr_decode_string().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Charles Lever <Charles.Lever@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.
Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.
Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Minor cleanup: inlined bit ops in nfs_page.h can be simpler.
Test plan:
Write-intensive workload against a server that requires COMMITs.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.
The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.
Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS needs to be able to distinguish between single-page ->writepage() calls and
multipage ->writepages() calls.
For the single-page writepage calls NFS can kick off the I/O within the
context of ->writepage().
For multipage ->writepages calls, nfs_writepage() will leave the I/O pending
and nfs_writepages() will kick off the I/O when it all has been queued up
within NFS.
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds suspend patch to libata, and ata_piix in particular. For
most low level drivers, they should just need to add the 4 hooks to
work. As I can only test ata_piix, I didn't enable it for more
though.
Suspend support is the single most important feature on a notebook, and
most new notebooks have sata drives. It's quite embarrassing that we
_still_ do not support this. Right now, it's perfectly possible to
suspend the drive in mid-transfer.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Also export current (average) speed and status in sysfs.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
.. because they aren't used outside md.c
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md sometimes call put_page on NULL pointers (treating it like kfree). This is
not safe, so define and use a 'safe_put_page' which checks for NULL.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).
These personalities have fairly artificial 'numbers'. The numbers
are use to:
1- provide an index into an array where the various personalities
are recorded
2- identify the module (via an alias) which implements are particular
personality.
Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely). Module identification can be done using an alias
based on level rather than 'personality' number.
The current 'raid5' modules support two level (4 and 5) but only one
personality. This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.
With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.
This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- replace open-coded hash chain with hlist macros
- Fix hash-table size at one page - it is already quite generous, so there
will never be a need to use multiple pages, so no need for __get_free_pages
No functional change.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add in correct read-error handling for resync and read-only situations.
When read-only, we don't over-write, so we need to mark the failed drive in
the r10_bio so we don't re-try it. During resync, we always read all blocks,
so if there is a read error, we simply over-write it with the good block that
we found (assuming we found one).
Note that the recovery case still isn't handled in an interesting way. There
is nothing useful to do for the 2-copies case. If there are 3 or more copies,
then we could try reading from one of the non-missing copies, but this is a
bit complicated and very rarely would be used, so I'm leaving it for now.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Largely just a cross-port from raid1.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is this "FIXME" comment with a typo in it!! that been annoying me for
days, so I just had to remove it.
conf->disks[i].rdev should only be accessed if
- we know we hold a reference or
- the mddev->reconfig_sem is down or
- we have a rcu_readlock
handle_stripe was referencing rdev in three places without any of these. For
the first two, get an rcu_readlock. For the last, the same access
(md_sync_acct call) is made a little later after the rdev has been claimed
under and rcu_readlock, if R5_Syncio is set. So just use that access...
However R5_Syncio isn't really needed as the 'syncing' variable contains the
same information. So use that instead.
Issues, comment, and fix are identical in raid5 and raid6.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On a read-error we suspend the array, then synchronously read the block from
other arrays until we find one where we can read it. Then we try writing the
good data back everywhere and make sure it works. If any write or subsequent
read fails, only then do we fail the device out of the array.
To be able to suspend the array, we need to also keep track of how many
requests are queued for handling by raid1d.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
raid6 currently does not check the P/Q syndromes when doing a resync, it just
calculates the correct value and writes it. Doing the check can reduce writes
(often to 0) for a resync, and it is needed to properly implement the
echo check > sync_action
operation.
This patch implements the appropriate checks and tidies up some related code.
It also allows raid6 user-requested resync to bypass the intent bitmap.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
See patch to md.txt for more details
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
raid10 needs to put up a barrier to new requests while it does resync or other
background recovery. The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.
This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
raid1 needs to put up a barrier to new requests while it does resync or other
background recovery. The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.
This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add ioctl DM_SKIP_LOCKFS_FLAG for userspace to request that lock_fs is
bypassed when suspending a device.
There's no change to the behaviour of existing code that doesn't know about
the new flag.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Both vfs_getattr and i_op->fsync return error statuses which nfsd was
largely ignoring. This as noticed when exporting directories using fuse.
This patch cleans up most of the offences, which involves moving the call
to vfs_getattr out of the xdr encoding routines (where it is too late to
report an error) into the main NFS procedure handling routines.
There is still a called to vfs_gettattr (related to the ACL code) where the
status is ignored, and called to nfsd_sync_dir don't check return status
either.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Split the checkpoint list of the transaction into two lists. In the first
list we keep the buffers that need to be submitted for IO. In the second
list are kept buffers that were already submitted and we just have to wait
for the IO to complete. This should simplify a handling of checkpoint
lists a bit and can eventually be also a performance gain.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
"extern inline" doesn't make much sense.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add missing "struct" keyword preventing compilation with DEBUG_PARPORT
defined. Also add some "const".
Signed-off-by: Marko Kohtala <marko.kohtala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Did not move the parport interface properly into IEEE1284_PH_REV_IDLE phase at
end of data due to comparing bytes with nibbles. Internal phase
IEEE1284_PH_HBUSY_DNA became unused, so remove it.
Signed-off-by: Marko Kohtala <marko.kohtala@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the maximum size of write data configurable by the filesystem. The
previous fixed 4096 limit only worked on architectures where the page size is
less or equal to this. This change make writing work on other architectures
too, and also lets the filesystem receive bigger write requests in direct_io
mode.
Normal writes which go through the page cache are still limited to a page
sized chunk per request.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the way a too large request is handled. Until now in this case the
device read returned -EINVAL and the operation returned -EIO.
Make it more flexibible by not returning -EINVAL from the read, but restarting
it instead.
Also remove the fixed limit on setxattr data and let the filesystem provide as
large a read buffer as it needs to handle the extended attribute data.
The symbolic link length is already checked by VFS to be less than PATH_MAX,
so the extra check against FUSE_SYMLINK_MAX is not needed.
The check in fuse_create_open() against FUSE_NAME_MAX is not needed, since the
dentry has already been looked up, and hence the name already checked.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add 'frsize' member to the statfs reply.
I'm not sure if sending f_fsid will ever be needed, but just in case leave
some space at the end of the structure, so less compatibility mess would be
required.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change interface version to 7.4.
Following changes will need backward compatibility support, so store the minor
version returned by userspace.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Removed some kmalloc's with __GFP_ZERO and replace it with memset()
because it didn't work properly.
- Fixed returned message frame in i2o_cfg_passthru() which caused raidutils
to display wrong error message in case a disk was missing.
- Fixed size of printk() in i2o_scsi.c.
- Fixed get_device() and put_device() in probing of the I2O controller.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Removed wrong I2O device class, which was only needed to add sysfs attributes.
Signed-off-by: Markus Lidel <Markus.Lidel@shadowconnect.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changed the I2O API to create I2O messages first in kernel memory and then
transfer it at once over the PCI bus instead of sending each quad-word over
the PCI bus.
Signed-off-by: Markus Lidel <Markus.Lidel@shadowconnect.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sanitize some s390 Kconfig options. We have ARCH_S390, ARCH_S390X,
ARCH_S390_31, 64BIT, S390_SUPPORT and COMPAT. Replace these 6 options by
S390, 64BIT and COMPAT.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
New feature V=V qdio pass-through.
QDIO and HiperSockets processing in z/VM V=V guest environments (as well as
V=R with z/VM running in LPAR mode) requires shadowing of all QDIO
architecture queue elements. Especially the shadowing of SBALs and SLSBs
structures in the hypervisor, and the need to issue SIGA SYNC operations to
observe state changes, eventually causes significant CPU processing overhead
in the hypervisor.
The QDIO pass-through support for V=V guests avoids the shadowing of SBALs and
SLSBs. This significantly reduces the hypervisor overhead for QDIO based I/O.
Signed-off-by: Frank Pavlic <pavlic@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extract the s390_root_dev_* functions from the common I/O layer as they are
also used by non-ccw device drivers.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert __access_ok to an inline C function and change __get_user primitive to
avoid uaccess compiler warnings.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Moved definition of CMS volume label to vtoc.h and modify partitions/ibm.c to
use this volume label definition instead of anonymous array.
Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins <hugh@veritas.com>
Fix the broken atomic_cmpxchg primitive. Add atomic_sub_and_test,
atomic64_sub_return, atomic64_sub_and_test, atomic64_cmpxchg,
atomic64_add_unless and atomic64_inc_not_zero. Replace old style
atomic_compare_and_swap by atomic_cmpxchg. Shorten the whole header by
defining most primitives with the two inline functions atomic_add_return and
atomic_sub_return.
In addition this patch contains the s390 related fixes of Hugh's "mm: fill
arch atomic64 gaps" patch.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
m68k, m68knommu and h8300 define this, but it's not actually used
anywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mach_enable_irq/mach_disable_irq are never actually set, so let's remove
them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify _port2addr*() routines in arch/m32r/kernel/io_*.c to use
NONCACHE_OFFSET instead of hard-coding of a constant address.
This modification is also required to support an M3A-ZA36 FPGA eva board in
case an MMU-less synthesizable m32r core is used.
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is for updating m32r's MMU-less support.
Some legacy MMU-less m32r chips cannot return from a trap handler to the
right-hand side 16-bit halfword code of a 32-bit instrucion code pair, because
a "trap" instruction specification was expanded in M32R-II ISA.
This modification forces "trap" instructions to be placed in word alignment
location with a parallel "nop" code.
Signed-off-by: Kazuhiro Inaoka <inaoka@linux-m32r.org>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is for supporting a new target platform, Renesas M32104UT
evaluation board.
The M32104UT is an eval board based on an uT-Engine specification. This board
has an MMU-less M32R family processor, M32104.
http://www-wa0.personal-media.co.jp/pmc/archive/te/te_m32104_e.pdf
This board is one of the most popular M32R platform, so we have ported
Linux/M32R to it.
Signed-off-by: Naoto Sugai <Sugai.Naoto@ak.MitsubishiElectric.co.jp>
Signed-off-by: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>