Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d1953c8888 removed the use of
4level-fixup.h for 32-bit systems under arch/powerpc. However, I
missed a few things activated on some configurations, resulting in
some warnings (at least with STRICT_MM_TYPECHECKS enabled) and build
errors in some circumstances. This fixes it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The SLUB allocator relies on struct page fields first_page and slab,
overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
be used for the lowest level of pagetable pages. This was obstructing
SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert
its pte level to use normal gfp pages (whereas pmd, pud and 64k-page pgd
want partpages, so continue to use kmem_caches for pmd, pud and pgd).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the ability for a kernel compiled with 4K page size
to have special slices containing 64K pages and hash the right type
of hash PTEs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The basic issue is to be able to do what hugetlbfs does but with
different page sizes for some other special filesystems; more
specifically, my need is:
- Huge pages
- SPE local store mappings using 64K pages on a 4K base page size
kernel on Cell
- Some special 4K segments in 64K-page kernels for mapping a dodgy
type of powerpc-specific infiniband hardware that requires 4K MMU
mappings for various reasons I won't explain here.
The main issues are:
- To maintain/keep track of the page size per "segment" (as we can
only have one page size per segment on powerpc, which are 256MB
divisions of the address space).
- To make sure special mappings stay within their allotted
"segments" (including MAP_FIXED crap)
- To make sure everybody else doesn't mmap/brk/grow_stack into a
"segment" that is used for a special mapping
Some of the necessary mechanisms to handle that were present in the
hugetlbfs code, but mostly in ways not suitable for anything else.
The patch relies on some changes to the generic get_unmapped_area()
that just got merged. It still hijacks hugetlb callbacks here or
there as the generic code hasn't been entirely cleaned up yet but
that shouldn't be a problem.
So what is a slice ? Well, I re-used the mechanism used formerly by our
hugetlbfs implementation which divides the address space in
"meta-segments" which I called "slices". The division is done using
256MB slices below 4G, and 1T slices above. Thus the address space is
divided currently into 16 "low" slices and 16 "high" slices. (Special
case: high slice 0 is the area between 4G and 1T).
Doing so simplifies significantly the tracking of segments and avoids
having to keep track of all the 256MB segments in the address space.
While I used the "concepts" of hugetlbfs, I mostly re-implemented
everything in a more generic way and "ported" hugetlbfs to it.
Slices can have an associated page size, which is encoded in the mmu
context and used by the SLB miss handler to set the segment sizes. The
hash code currently doesn't care, it has a specific check for hugepages,
though I might add a mechanism to provide per-slice hash mapping
functions in the future.
The slice code provide a pair of "generic" get_unmapped_area() (bottomup
and topdown) functions that should work with any slice size. There is
some trickiness here so I would appreciate people to have a look at the
implementation of these and let me know if I got something wrong.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code for demoting segments to 4K had some issues, like for example,
when using _PAGE_4K_PFN flag, the first CPU to hit it would do the
demotion, but other CPUs hitting the same page wouldn't properly flush
their SLBs if mmu_ci_restriction isn't set. There are also potential
issues with hash_preload not handling _PAGE_4K_PFN. All of these are
non issues on current hardware but might bite us in the future.
This patch thus fixes it by:
- Taking the test comparing the mm and current CPU context page
sizes to decide to flush SLBs out of the mmu_ci_restrictions test
since that can also be triggered by _PAGE_4K_PFN pages
- Due to the above being done all the time, demote_segment_4k
doesn't need update the context and flush the SLB
- demote_segment_4k can be static and doesn't need an EXPORT_SYMBOL
- Making hash_preload ignore anything that has either _PAGE_4K_PFN
or _PAGE_NO_CACHE set, thus avoiding duplication of the complicated
logic in hash_page() (and possibly making hash_preload a little bit
faster for the normal case).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On some powerpc architectures (notably 64-bit powermac) there is a memory
hole, for example on powermacs between 2G and 4G. Since we use the flat
memory model regardless, these pages must be marked as nosave (for suspend
to disk.)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the die notifier handling to common code. Previous
various architectures had exactly the same code for it. Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)
arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at. avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use SLAB_PANIC and delete duplicated panic().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Ian Molton <spyro@f2s.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For 32-bit systems, powerpc still relies on the 4level-fixup.h hack,
to pretend that the generic pagetable handling stuff is 3-levels
rather than 4. This patch removes this, instead using the newer
pgtable-nopmd.h to handle the elision of both the pud and pmd
pagetable levels (ppc32 pagetables are actually 2 levels).
This removes a little extraneous code, and makes it more easily
compared to the 64-bit pagetable code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
reserve_mem() and stabs_alloc() are both called only from other __init
routines, so can be marked __init.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current get_unmapped_area code calls the f_ops->get_unmapped_area or the
arch one (via the mm) only when MAP_FIXED is not passed. That makes it
impossible for archs to impose proper constraints on regions of the virtual
address space. To work around that, get_unmapped_area() then calls some
hugetlbfs specific hacks.
This cause several problems, among others:
- It makes it impossible for a driver or filesystem to do the same thing
that hugetlbfs does (for example, to allow a driver to use larger page sizes
to map external hardware) if that requires applying a constraint on the
addresses (constraining that mapping in certain regions and other mappings
out of those regions).
- Some archs like arm, mips, sparc, sparc64, sh and sh64 already want
MAP_FIXED to be passed down in order to deal with aliasing issues. The code
is there to handle it... but is never called.
This series of patches moves the logic to handle MAP_FIXED down to the various
arch/driver get_unmapped_area() implementations, and then changes the generic
code to always call them. The hugetlbfs hacks then disappear from the generic
code.
Since I need to do some special 64K pages mappings for SPEs on cell, I need to
work around the first problem at least. I have further patches thus
implementing a "slices" layer that handles multiple page sizes through slices
of the address space for use by hugetlbfs, the SPE code, and possibly others,
but it requires that serie of patches first/
There is still a potential (but not practical) issue due to the fact that
filesystems/drivers implemeting g_u_a will effectively bypass all arch checks.
This is not an issue in practice as the only filesystems/drivers using that
hook are doing so for arch specific purposes in the first place.
There is also a problem with mremap that will completely bypass all arch
checks. I'll try to address that separately, I'm not 100% certain yet how,
possibly by making it not work when the vma has a file whose f_ops has a
get_unmapped_area callback, and by making it use is_hugepage_only_range()
before expanding into a new area.
Also, I want to turn is_hugepage_only_range() into a more generic
is_normal_page_range() as that's really what it will end up meaning when used
in stack grow, brk grow and mremap.
None of the above "issues" however are introduced by this patch, they are
already there, so I think the patch can go ini for 2.6.22.
This patch:
Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3
implementations of it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not necessary to tell the slab allocators to align to a cacheline
if an explicit alignment was already specified. It is rather confusing
to specify multiple alignments.
Make sure that the call sites only use one form of alignment.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch was recently posted to lkml and acked by Pekka.
The flag SLAB_MUST_HWCACHE_ALIGN is
1. Never checked by SLAB at all.
2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB
3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.
The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.
The flag is confusing, inconsistent and has no purpose.
Remove it.
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a couple of kexec problems related to 64K page
support in the kernel. kexec issues a tlbie for each pte. The
parameters for the tlbie are the page size and the virtual address.
Support was missing for the computation of these two parameters
for 64K pages. This adds that support.
Signed-off-by: Luke Browning <lukebrowning@us.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Call the kprobes pagefault handler directly instead of going through
the complex notifier chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch takes the definitions for the PPC44x MMU (a software loaded
TLB) from asm-ppc/mmu.h, cleans them up of things no longer necessary
in arch/powerpc and puts them in a new asm-powerpc/mmu_44x.h file. It
also substantially simplifies arch/powerpc/mm/44x_mmu.c and makes a
couple of small fixes necessary for the 44x MMU code to build and work
properly in arch/powerpc.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
arch/powerpc/mm/mem.c exports page_is_ram, which is not used anywhere
that could be modular.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
32-bit powerpc systems define a macro, PHYS_FMT, giving a printf
format string fragment for displaying physical addresses, since most
32-bit powerpc platforms use 32-bit physical addresses but a few use
64-bit physical addresses.
This macro is used in exactly one place, a rare error message, where
we can solve the problem more simply by just unconditionally casting
the address up to 64-bit quantity before formatting it.
This patch does so, meaning that as we bring MMU definitions from
asm-ppc over to asm-powerpc, cleaning them up in the process, we don't
need to implement this ugly macro (which additionally has a very bad
name for something global).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
BenH's commit a741e67969 in powerpc.git,
although (AFAICT) only intended to affect ppc64, also has side-effects
which break 44x. I think 40x, 8xx and Freescale Book E are also
affected, though I haven't tested them.
The problem lies in unconditionally removing flush_tlb_pending() from
the versions of flush_tlb_mm(), flush_tlb_range() and
flush_tlb_kernel_range() used on ppc64 - which are also used the
embedded platforms mentioned above.
The patch below cleans up the convoluted #ifdef logic in tlbflush.h,
in the process restoring the necessary flushes for the software TLB
platforms. There are three sets of definitions for the flushing
hooks: the software TLB versions (revised to avoid using names which
appear to related to TLB batching), the 32-bit hash based versions
(external functions) amd the 64-bit hash based versions (which
implement batching).
It also moves the declaration of update_mmu_cache() to always be in
tlbflush.h (previously it was in tlbflush.h except for PPC64, where it
was in pgtable.h).
Booted on Ebony (440GP) and compiled for 64-bit and 32-bit
multiplatform.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Here's an implementation of DEBUG_PAGEALLOC for 64 bits powerpc.
It applies on top of the 32 bits patch.
Unlike Anton's previous attempt, I'm not using updatepp. I'm removing
the hash entries from the bolted mapping (using a map in RAM of all the
slots). Expensive but it doesn't really matter, does it ? :-)
Memory hot-added doesn't benefit from this unless it's added at an
address that is below end_of_DRAM() as calculated at boot time.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/Kconfig.debug | 2
arch/powerpc/mm/hash_utils_64.c | 84 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 82 insertions(+), 4 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Here's an implementation of DEBUG_PAGEALLOC for ppc32. It disables BAT
mapping and is only tested with Hash table based processor though it
shouldn't be too hard to adapt it to others.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/Kconfig.debug | 9 ++++++
arch/powerpc/mm/init_32.c | 4 +++
arch/powerpc/mm/pgtable_32.c | 52 +++++++++++++++++++++++++++++++++++++++
arch/powerpc/mm/ppc_mmu_32.c | 4 ++-
include/asm-powerpc/cacheflush.h | 6 ++++
5 files changed, 74 insertions(+), 1 deletion(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
On hash table based 32 bits powerpc's, the hash management code runs with
a big spinlock. It's thus important that it never causes itself a hash
fault. That code is generally safe (it does memory accesses in real mode
among other things) with the exception of the actual access to the code
itself. That is, the kernel text needs to be accessible without taking
a hash miss exceptions.
This is currently guaranteed by having a BAT register mapping part of the
linear mapping permanently, which includes the kernel text. But this is
not true if using the "nobats" kernel command line option (which can be
useful for debugging) and will not be true when using DEBUG_PAGEALLOC
implemented in a subsequent patch.
This patch fixes this by pre-faulting in the hash table pages that hit
the kernel text, and making sure we never evict such a page under hash
pressure.
Signed-off-by: Benjamin Herrenchmidt <benh@kernel.crashing.org>
arch/powerpc/mm/hash_low_32.S | 22 ++++++++++++++++++++--
arch/powerpc/mm/mem.c | 3 ---
arch/powerpc/mm/mmu_decl.h | 4 ++++
arch/powerpc/mm/pgtable_32.c | 11 +++++++----
4 files changed, 31 insertions(+), 9 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
The 32 bits map_page() function is used internally by the mm code
for early mmu mappings and for ioremap. It should never be called
for an address that already has a valid PTE or hash entry, so we
add a BUG_ON for that and remove the useless flush_HPTE call.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/mm/pgtable_32.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current tlb flush code on powerpc 64 bits has a subtle race since we
lost the page table lock due to the possible faulting in of new PTEs
after a previous one has been removed but before the corresponding hash
entry has been evicted, which can leads to all sort of fatal problems.
This patch reworks the batch code completely. It doesn't use the mmu_gather
stuff anymore. Instead, we use the lazy mmu hooks that were added by the
paravirt code. They have the nice property that the enter/leave lazy mmu
mode pair is always fully contained by the PTE lock for a given range
of PTEs. Thus we can guarantee that all batches are flushed on a given
CPU before it drops that lock.
We also generalize batching for any PTE update that require a flush.
Batching is now enabled on a CPU by arch_enter_lazy_mmu_mode() and
disabled by arch_leave_lazy_mmu_mode(). The code epects that this is
always contained within a PTE lock section so no preemption can happen
and no PTE insertion in that range from another CPU. When batching
is enabled on a CPU, every PTE updates that need a hash flush will
use the batch for that flush.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some drivers have resources that they want to be able to map into
userspace that are 4k in size. On a kernel configured with 64k pages
we currently end up mapping the 4k we want plus another 60k of
physical address space, which could contain anything. This can
introduce security problems, for example in the case of an infiniband
adaptor where the other 60k could contain registers that some other
program is using for its communications.
This patch adds a new function, remap_4k_pfn, which drivers can use to
map a single 4k page to userspace regardless of whether the kernel is
using a 4k or a 64k page size. Like remap_pfn_range, it would
typically be called in a driver's mmap function. It only maps a
single 4k page, which on a 64k page kernel appears replicated 16 times
throughout a 64k page. On a 4k page kernel it reduces to a call to
remap_pfn_range.
The way this works on a 64k kernel is that a new bit, _PAGE_4K_PFN,
gets set on the linux PTE. This alters the way that __hash_page_4K
computes the real address to put in the HPTE. The RPN field of the
linux PTE becomes the 4k RPN directly rather than being interpreted as
a 64k RPN. Since the RPN field is 32 bits, this means that physical
addresses being mapped with remap_4k_pfn have to be below 2^44,
i.e. 0x100000000000.
The patch also factors out the code in arch/powerpc/mm/hash_utils_64.c
that deals with demoting a process to use 4k pages into one function
that gets called in the various different places where we need to do
that. There were some discrepancies between exactly what was done in
the various places, such as a call to spu_flush_all_slbs in one case
but not in others.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is more consistent and gets us closer to the Sparc code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is more consistent and gets us closer to the Sparc code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The SPU code doesn't properly invalidate SPUs SLBs when necessary,
for example when changing a segment size from the hugetlbfs code. In
addition, it saves and restores the SLB content on context switches
which makes it harder to properly handle those invalidations.
This patch removes the saving & restoring for now, something more
efficient might be found later on. It also adds a spu_flush_all_slbs(mm)
that can be used by the core mm code to flush the SLBs of all SPEs that
are running a given mm at the time of the flush.
In order to do that, it adds a spinlock to the list of all SPEs and move
some bits & pieces from spufs to spu_base.c
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
At present calling lmb_reserve() (and hence lmb_add_region()) twice
for exactly the same memory region will cause strange behaviour.
This makes life difficult when booting from a flat device tree with
memory reserve map. Which regions are automatically reserved by the
kernel has changed over time, so it's quite possible a newer kernel
could attempt to auto-reserve a region which is also explicitly listed
in the device tree's reserve map, leading to trouble.
This patch avoids the problem by making lmb_reserve() ignore a call to
reserve a previously reserved region. It also removes a now redundant
test designed to avoid one specific case of the problem noted above.
At present, this patch deals only with duplicate reservations of an
identical region. Attempting to reserve two different, but
overlapping regions will still cause problems. I might post another
patch later dealing with this case, but I'm avoiding it now since it
is substantially more complicated to deal with, less likely to occur
and more likely to indicate a genuine bug elsewhere if it does occur.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
Documentation/kernel-docs.txt update.
arch/cris: typo in KERN_INFO
Storage class should be before const qualifier
kernel/printk.c: comment fix
update I/O sched Kconfig help texts - CFQ is now default, not AS.
Remove duplicate listing of Cris arch from README
kbuild: more doc. cleanups
doc: make doc. for maxcpus= more visible
drivers/net/eexpress.c: remove duplicate comment
add a help text for BLK_DEV_GENERIC
correct a dead URL in the IP_MULTICAST help text
fix the BAYCOM_SER_HDX help text
fix SCSI_SCAN_ASYNC help text
trivial documentation patch for platform.txt
Fix typos concerning hierarchy
Fix comment typo "spin_lock_irqrestore".
Fix misspellings of "agressive".
drivers/scsi/a100u2w.c: trivial typo patch
Correct trivial typo in log2.h.
Remove useless FIND_FIRST_BIT() macro from cardbus.c.
...
The code for bolting hash entries for ioremap done before proper
mm initialization has a grown a bug when using 64K pages on a
machine where non-cacheable mappings are demoted to 4K HW pages.
The wrong page size index is being passed to the hash table mapping
functions causing a crash at boot on some pSeries machines using
bare metal linux. This fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The recent vDSO consolidation patches broke powerpc due to a mistake
in the definition of MAXPAGES constants. This fixes it by moving to
a dynamically allocated array of pages instead as I don't like much
hard coded size limits. Also move the vdso initialisation to an initcall
since it doesn't really need to be done -that- early.
Applogies for not catching the breakage earlier, Roland _did_ CC me on
his patches a while ago, I got busy with other things and forgot to test
them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When building an 85xx kernel we get:
CC arch/powerpc/mm/pgtable_32.o
arch/powerpc/mm/pgtable_32.c: In function 'io_block_mapping':
arch/powerpc/mm/pgtable_32.c:330: error: expected identifier before '(' token
arch/powerpc/mm/pgtable_32.c:330: error: expected statement before ')' token
The is_power_of_2(x) fixup patch left an extra ')' on the is_power_of_4 macro.
There is a similiar issue on the arch/ppc side.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
arch/powerpc/mm/mem.c states that page_is_ram is called by the code that
implements /dev/mem, which isn't true. Remove the comment.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add the inline function "is_power_of_2()" to log2.h, where the value
zero is *not* considered to be a power of two.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is just a straight port of the same done in arch/ppc
by Marcelo Tosatti. One used to be
[PATCH] ppc32 8xx: update_mmu_cache() needs unconditional tlbie,
commit eb07d964b4
In a nutshell, the board is nearly stuck without this, yet without any
visible failure - being just very slow.
Signed-off-by: Vitaly Bordug <vbordug@ru.mvista.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch changes handling return value of ppc_md.hpte_insert() into
the same way as __hash_page_*().
Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc specific version of hugetlb_get_unmapped_area() makes some
unwarranted assumptions about what checks have been made to its
parameters by its callers. This will lead to a BUG_ON() if a 32-bit
process attempts to make a hugepage mapping which extends above
TASK_SIZE (4GB).
I'm not sure if these assumptions came about because they were valid
with earlier versions of the get_unmapped_area() path, or if it was
always broken. Nonetheless this patch fixes the logic, and removes
the crash.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Run this:
#!/bin/sh
for f in $(grep -Erl "\([^\)]*\) *k[cmz]alloc" *) ; do
echo "De-casting $f..."
perl -pi -e "s/ ?= ?\([^\)]*\) *(k[cmz]alloc) *\(/ = \1\(/" $f
done
And then go through and reinstate those cases where code is casting pointers
to non-pointers.
And then drop a few hunks which conflicted with outstanding work.
Cc: Russell King <rmk@arm.linux.org.uk>, Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Paul Fulghum <paulkf@microgate.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Karsten Keil <kkeil@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For PAPR partitions with large amounts of memory, the firmware has an
alternative, more compact representation for the information about the
memory in the partition and its NUMA associativity information. This
adds the code to the kernel to parse this alternative representation.
The other part of this patch is telling the firmware that we can
handle the alternative representation. There is however a subtlety
here, because the firmware will invoke a reboot if the memory
representation we request is different from the representation that
firmware is currently using. This is because firmware can't change
the representation on the fly. Further, some firmware versions used
on POWER5+ machines have a bug where this reboot leaves the machine
with an altered value of load-base, which will prevent any kernel
booting until it is reset to the normal value (0x4000). Because of
this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
do not request the new representation on POWER5+ and earlier machines.
We do request the new representation on POWER6, which uses the
ibm,client-architecture-support call.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.
The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.
With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove CPU_FTR_16M_PAGE from the cupfeatures mask at runtime on iSeries.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
A few code paths need to check whether or not they are running
on the PS3's LV1 hypervisor before making hcalls. This introduces
a new firmware feature bit for this, FW_FEATURE_PS3_LV1.
Now when both PS3 and IBM_CELL_BLADE are enabled, but not PSERIES,
FW_FEATURE_PS3_LV1 and FW_FEATURE_LPAR get enabled at compile time,
which is a bug. The same problem can also happen for (PPC_ISERIES &&
!PPC_PSERIES && PPC_SOMETHING_ELSE). In order to solve this, I
introduce a new CONFIG_PPC_NATIVE option that is set when at least
one platform is selected that can run without a hypervisor and then
turns the firmware feature check into a run-time option.
The new cell oprofile support that was recently merged does not
work on hypervisor based platforms like the PS3, therefore make
it depend on PPC_CELL_NATIVE instead of PPC_CELL. This may change
if we get oprofile support for PS3.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
powerpc: Merge 32 and 64 bits asm-powerpc/io.h
The rework on io.h done for the new hookable accessors made it easier,
so I just finished the work and merged 32 and 64 bits io.h for arch/powerpc.
arch/ppc still uses the old version in asm-ppc, there is just too much gunk
in there that I really can't be bothered trying to cleanup.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In order to suppose platforms with devices above 4Gb on 32 bits platforms
with a >32 bits physical address space, we used to have a special ioremap64
along with a fixup routine fixup_bigphys_addr.
This shouldn't be necessary anymore as struct resource now supports 64 bits
addresses even on 32 bits archs. This patch enables that option when
CONFIG_PHYS_64BIT is set and removes ioremap64 and fixup_bigphys_addr.
This is a preliminary work for the upcoming merge of 32 and 64 bits io.h
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch reworks the way iSeries hooks on PCI IO operations (both MMIO
and PIO) and provides a generic way for other platforms to do so (we
have need to do that for various other platforms).
While reworking the IO ops, I ended up doing some spring cleaning in
io.h and eeh.h which I might want to split into 2 or 3 patches (among
others, eeh.h had a lot of useless stuff in it).
A side effect is that EEH for PIO should work now (it used to pass IO
ports down to the eeh address check functions which is bogus).
Also, new are MMIO "repeat" ops, which other archs like ARM already had,
and that we have too now: readsb, readsw, readsl, writesb, writesw,
writesl.
In the long run, I might also make EEH use the hooks instead
of wrapping at the toplevel, which would make things even cleaner and
relegate EEH completely in platforms/iseries, but we have to measure the
performance impact there (though it's really only on MMIO reads)
Since I also need to hook on ioremap, I shuffled the functions a bit
there. I introduced ioremap_flags() to use by drivers who want to pass
explicit flags to ioremap (and it can be hooked). The old __ioremap() is
still there as a low level and cannot be hooked, thus drivers who use it
should migrate unless they know they want the low level version.
The patch "arch provides generic iomap missing accessors" (should be
number 4 in this series) is a pre-requisite to provide full iomap
API support with this patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(David:)
If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.
But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.
(Hugh:)
prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down. PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.
Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bad_page_fault() prints a message telling the user what type of bad
fault we took. The first line of this message is currently implemented
as two separate printks. This has the unfortunate effect that if
several cpus simultaneously take a bad fault, the first and second parts
of the printk get jumbled up, which looks dodge and is hard to read.
So do a single one-line printk for each fault type.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Checking source for other get_paca()->field preemption dangers found that
open_high_hpage_areas does a structure copy into its paca while preemption
is enabled: unsafe however gcc accomplishes it. Just remove that copy:
it's done safely afterwards by on_each_cpu, as in open_low_hpage_areas.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change the powerpc hpte_insert routines now called through ppc_md to
static scope.
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Arch-independent zone-sizing is using indices instead of symbolic names to
offset within an array related to zones (max_zone_pfns). The unintended
impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the
the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
The following patch properly initialises the max_zone_pfns[] array and uses
symbolic names instead of indices in each architecture using
arch-independent zone-sizing. Two users have successfully booted their
powerpcs with it (one an ibook G4). It has also been boot tested on x86,
x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2.
Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
first fix. Additional credit to Johannes Berg and Andreas Schwab for
reporting the problem and testing on powerpc.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().
Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.
Eric's original description:
There are a lot of places in the kernel where we test for init
because we give it special properties. Most significantly init
must not die. This results in code all over the kernel test
->pid == 1.
Introduce is_init to capture this case.
With multiple pid spaces for all of the cases affected we are
looking for only the first process on the system, not some other
process that has pid == 1.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make PROT_WRITE imply PROT_READ for a number of architectures which don't
support write only in hardware.
While looking at this, I noticed that some architectures which do not
support write only mappings already take the exact same approach. For
example, in arch/alpha/mm/fault.c:
"
if (cause < 0) {
if (!(vma->vm_flags & VM_EXEC))
goto bad_area;
} else if (!cause) {
/* Allow reads even for write-only mappings */
if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
goto bad_area;
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
}
"
Thus, this patch brings other architectures which do not support write only
mappings in-line and consistent with the rest. I've verified the patch on
ia64, x86_64 and x86.
Additional discussion:
Several architectures, including x86, can not support write-only mappings.
The pte for x86 reserves a single bit for protection and its two states are
read only or read/write. Thus, write only is not supported in h/w.
Currently, if i 'mmap' a page write-only, the first read attempt on that page
creates a page fault and will SEGV. That check is enforced in
arch/blah/mm/fault.c. However, if i first write that page it will fault in
and the pte will be set to read/write. Thus, any subsequent reads to the page
will succeed. It is this inconsistency in behavior that this patch is
attempting to address. Furthermore, if the page is swapped out, and then
brought back the first read will also cause a SEGV. Thus, any arbitrary read
on a page can potentially result in a SEGV.
According to the SuSv3 spec, "if the application requests only PROT_WRITE, the
implementation may also allow read access." Also as mentioned, some
archtectures, such as alpha, shown above already take the approach that i am
suggesting.
The counter-argument to this raised by Arjan, is that the kernel is enforcing
the write only mapping the best it can given the h/w limitations. This is
true, however Alan Cox, and myself would argue that the inconsitency in
behavior, that is applications can sometimes work/sometimes fails is highly
undesireable. If you read through the thread, i think people, came to an
agreement on the last patch i posted, as nobody has objected to it...
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Ian Molton <spyro@f2s.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On Tue, 2006-08-15 at 08:22 -0700, Dave Hansen wrote:
> kernel BUG in cache_free_debugcheck at mm/slab.c:2748!
Alright, this one is only triggered when slab debugging is enabled. The
slabs are assumed to be aligned on a HUGEPTE_TABLE_SIZE boundary. The free
path makes use of this assumption and uses the lowest nibble to pass around
an index into an array of kmem_cache pointers. With slab debugging turned
on, the slab is still aligned, but the "working" object pointer is not.
This would break the assumption above that a full nibble is available for
the PGF_CACHENUM_MASK.
The following patch reduces PGF_CACHENUM_MASK to cover only the two least
significant bits, which is enough to cover the current number of 4 pgtable
cache types. Then use this constant to mask out the appropriate part of
the huge pte pointer.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a shadow buffer for the SLBs and regsiters it with PHYP.
Only the bolted SLB entries (top 3) are shadowed.
The SLB shadow buffer tells the hypervisor what the kernel needs to
have in the SLB for the kernel to be able to function. The hypervisor
can use this information to speed up partition context switches.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.
powerpc core changes.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The compiler doesn't understand that BUG() never returns, so complains that
psize isn't set. Just set it to the normal value, which seems to produce nice
code and keeps gcc happy.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
[POWERPC] Use little-endian bit from firmware ibm,pa-features property
[POWERPC] Make sure smp_processor_id works very early in boot
[POWERPC] U4 DART improvements
[POWERPC] todc: add support for Time-Of-Day-Clock
[POWERPC] Make lparcfg.c work when both iseries and pseries are selected
[POWERPC] Fix idr locking in init_new_context
[POWERPC] mpc7448hpc2 (taiga) board config file
[POWERPC] Add tsi108 pci and platform device data register function
[POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
[POWERPC] Correct the MAX_CONTEXT definition
powerpc: minor cleanups for mpc86xx
[POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
[POWERPC] Simplify the code defining the 64-bit CPU features
[POWERPC] powerpc: kconfig warning fix
[POWERPC] Consolidate some of kernel/misc*.S
[POWERPC] Remove unused function call_with_mmu_off
[POWERPC] update asm-powerpc/time.h
[POWERPC] Clean up it_lp_queue.h
[POWERPC] Skip the "copy down" of the kernel if it is already at zero.
[POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
...
We always need to serialize accesses to mmu_context_idr.
I hit this bug when testing with a small number of mmu contexts.
Signed-off-by: Sonny Rao <sonny@burdell.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
With the ppc_md htab pointers setup earlier, we can use ppc_md.hpte_insert
in htab_bolt_mapping(), rather than deciding which version to call by hand.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Initialise the ppc_md htab callbacks earlier, in the probe routines. This
allows us to call htab_finish_init() from htab_initialize(), and makes it
private to hash_utils_64.c. Move htab_finish_init() and make_bl() above
htab_initialize() to avoid forward declarations.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.
__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the name of old add_memory() to arch_add_memory. And use node id to
get pgdat for the node at NODE_DATA().
Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not well known about powerpc.
So, I keep it. (But, __meminit is better at least.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Overloading of page fault notification with the notify_die() has performance
issues(since the only interested components for page fault is kprobes and/or
kdb) and hence this patch introduces the new notifier call chain exclusively
for page fault notifications their by avoiding notifying unnecessary
components in the do_page_fault() code path.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
[POWERPC] re-enable OProfile for iSeries, using timer interrupt
[POWERPC] support ibm,extended-*-frequency properties
[POWERPC] Extra sanity check in EEH code
[POWERPC] Dont look for class-code in pci children
[POWERPC] Fix mdelay badness on shared processor partitions
[POWERPC] disable floating point exceptions for init
[POWERPC] Unify ppc syscall tables
[POWERPC] mpic: add support for serial mode interrupts
[POWERPC] pseries: Print PCI slot location code on failure
[POWERPC] spufs: one more fix for 64k pages
[POWERPC] spufs: fail spu_create with invalid flags
[POWERPC] spufs: clear class2 interrupt status before wakeup
[POWERPC] spufs: fix Makefile for "make clean"
[POWERPC] spufs: remove stop_code from struct spu
[POWERPC] spufs: fix spu irq affinity setting
[POWERPC] spufs: further abstract priv1 register access
[POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
[POWERPC] spufs: dont try to access SPE channel 1 count
[POWERPC] spufs: use kzalloc in create_spu
[POWERPC] spufs: fix initial state of wbox file
...
Manually resolved conflicts in:
drivers/net/phy/Makefile
include/asm-powerpc/spu.h
Clear the high BATS during load_up_mmu if FTR_HAS_HIGH_BATS.
Allow just a bit more time for secondary CPUs to phone home.
Signed-off-by: Wei Zhang <Wei.Zhang@freescale.com>
Signed-off-by: Haiying Wang <Haiying.Wang@freescale.com>
Signed-off-by: Jon Loeliger <jdl@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The page size encoding passed to tlbie is incorrect for new-style
large pages. This fixes it. This doesn't affect anything on older
machines because mmu_psize_defs[psize].penc (the page size encoding)
is 0 for 4k and 16M pages (the two are distinguished by a separate "is
a large page" bit).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove some stale POWER3/POWER4/970 on 32bit kernel support.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some POWER5+ machines can do 64k hardware pages for normal memory but
not for cache-inhibited pages. This patch lets us use 64k hardware
pages for most user processes on such machines (assuming the kernel
has been configured with CONFIG_PPC_64K_PAGES=y). User processes
start out using 64k pages and get switched to 4k pages if they use any
non-cacheable mappings.
With this, we use 64k pages for the vmalloc region and 4k pages for
the imalloc region. If anything creates a non-cacheable mapping in
the vmalloc region, the vmalloc region will get switched to 4k pages.
I don't know of any driver other than the DRM that would do this,
though, and these machines don't have AGP.
When a region gets switched from 64k pages to 4k pages, we do not have
to clear out all the 64k HPTEs from the hash table immediately. We
use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
was hashed in as a 64k page or a set of 4k pages. If hash_page is
trying to insert a 4k page for a Linux PTE and it sees that it has
already been inserted as a 64k page, it first invalidates the 64k HPTE
before inserting the 4k HPTE. The hash invalidation routines also use
the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
set of 4k HPTEs to remove. With those two changes, we can tolerate a
mix of 4k and 64k HPTEs in the hash table, and they will all get
removed when the address space is torn down.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pgdir field in the paca was a leftover from the dynamic VSIDs
patch, and is not used in the current kernel code. This removes it.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a vdso_base element to the mm_context_t for 32-bit compiles
(both for ARCH=powerpc and ARCH=ppc). This fixes the compile errors
that have been reported in arch/powerpc/kernel/signal_32.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Our MMU hash management code would not set the "C" bit (changed bit) in
the hardware PTE when updating a RO PTE into a RW PTE. That would cause
the hardware to possibly to a write back to the hash table to set it on
the first store access, which in addition to being a performance issue,
might also hit a bug when running with native hash management (non-HV)
as our code is specifically optimized for the case where no write back
happens.
Thus there is a very small therocial window were a hash PTE can become
corrupted if that HPTE has just been upgraded to read write, a store
access happens on it, and that races with another processor evicting
that same slot. Since eviction (caused by an almost full hash) is
extremely rare, the bug is very unlikely to happen fortunately.
This fixes by allowing the updating of the protection bits in the native
hash handling to also set (but not clear) the "C" bit, and, in order to
also improve performances in the general case, by always setting that
bit on newly inserted hash PTE so that writeback really never happens.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We currently do mem= handling in three seperate places. And as benh pointed out
I wrote two of them. Now that we parse command line parameters earlier we can
clean this mess up.
Moving the parsing out of prom_init means the device tree might be allocated
above the memory limit. If that happens we'd have to move it. As it happens
we already have logic to do that for kdump, so just genericise it.
This also means we might have reserved regions above the memory limit, if we
do the bootmem allocator will blow up, so we have to modify
lmb_enforce_memory_limit() to truncate the reserves as well.
Tested on P5 LPAR, iSeries, F50, 44p. Tested moving device tree on P5 and
44p and F50.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change of_node_to_nid() to traverse the device tree, looking for a numa id.
Cell uses this to assign ids to SPUs, which are children of the CPU node.
Existing users of of_node_to_nid() are altered to use of_node_to_nid_single(),
which doesn't do the traversal.
Export an attach_sysdev_to_node() function, allowing system devices (eg.
SPUs) to link themselves into the numa topology in sysfs.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At present, ARCH=powerpc kernels can waste considerable space in
pagetables when making large hugepage mappings. Hugepage PTEs go in
PMD pages, but each PMD page maps 256M and so contains only 16
hugepage PTEs (128 bytes of data), but takes up a 1024 byte
allocation. With CONFIG_PPC_64K_PAGES enabled (64k base page size),
the situation is worse. Now hugepage PTEs are at the PTE page level
(also mapping 256M), so we store 16 hugepage PTEs in a 64k allocation.
The PowerPC MMU already means that any 256M region is either all
hugepage, or all normal pages. Thus, with some care, we can use a
different allocation for the hugepage PTE tables and only allocate the
128 bytes necessary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Quieten some of the debug ram config output. we already print out available
memory at KERN_INFO level.
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix 44x and BookE page fault handler to correctly lock PTE before
trying to pte_update() it, otherwise this PTE might be swapped out
after pte_present() check but before pte_uptdate() call, resulting in
corrupted PTE. This can happen with enabled preemption and low memory
condition.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This removes statically assigned platform numbers and reworks the
powerpc platform probe code to use a better mechanism. With this,
board support files can simply declare a new machine type with a
macro, and implement a probe() function that uses the flattened
device-tree to detect if they apply for a given machine.
We now have a machine_is() macro that replaces the comparisons of
_machine with the various PLATFORM_* constants. This commit also
changes various drivers to use the new macro instead of looking at
_machine.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We were printing node ids in hex in one spot. Lets be consistent and
always print them in decimal.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The return statement is to prevent `warning: 'nid' might be used uninitialized
in this function'.
Cc: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jens Axboe <axboe@suse.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
arch/powerpc/mm/mem.c: In function `add_memory':
arch/powerpc/mm/mem.c:128: warning: assignment makes integer from pointer without a cast
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Quite a long time back, prepare_hugepage_range() replaced
is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
verify if an address range is suitable for a hugepage mapping.
is_aligned_hugepage_range() stuck around, but only to implement
prepare_hugepage_range() on archs which didn't implement their own.
Most archs (everything except ia64 and powerpc) used the same
implementation of is_aligned_hugepage_range(). On powerpc, which
implements its own prepare_hugepage_range(), the custom version was never
used.
In addition, "is_aligned_hugepage_range()" was a bad name, because it
suggests it returns true iff the given range is a good hugepage range,
whereas in fact it returns 0-or-error (so the sense is reversed).
This patch cleans up by abolishing is_aligned_hugepage_range(). Instead
prepare_hugepage_range() is defined directly. Most archs use the default
version, which simply checks the given region is aligned to the size of a
hugepage. ia64 and powerpc define custom versions. The ia64 one simply
checks that the range is in the correct address space region in addition to
being suitably aligned. The powerpc version (just as previously) checks
for suitable addresses, and if necessary performs low-level MMU frobbing to
set up new areas for use by hugepages.
No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().
This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A couple of places set_page_count(page, 1) that don't need to.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In mm_init_ppc64() we calculate the location of the "IO hole", but then
no one ever looks at the value. So don't bother.
That's actually all mm_init_ppc64() does, so get rid of it too.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
It has been decreed that platform numbers are evil, so as a step in that
direction, replace platform_is_lpar() with a FW_FEATURE_LPAR bit.
Currently FW_FEATURE_LPAR really means i/pSeries LPAR, in the future we might
have to clean that up if we need to be more specific about what LPAR actually
means. But that's another patch ...
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
htab_bolt_mapping() takes a vstart and pstart parameter, but all but one of
its callers actually pass it vstart and vstart. Luckily before it passes
paddr (calculated from paddr) to the hpte_insert routines it calls
virt_to_abs() (aka. __pa()) on the address, so there isn't actually a bug.
map_io_page() however does pass pstart properly, so currently it's broken
AFAICT because we're calling __pa(paddr) which will get us something very
large. Presumably no one's calling map_io_page() in the right context.
Anyway, change htab_bolt_mapping() callers to properly pass pstart, and then
use it properly in htab_bolt_mapping(), ie. don't call __pa() on it again.
Booted on p5 LPAR, iSeries and Power3.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We can plug the boot cpu into its node independently of whether numa
topology is detected. And numa_setup_cpu does the right thing for all
cases now, so remove special-casing for non-numa from the cpu hotplug
callback.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc numa code unconditionally onlines all nodes from 0 to the
highest node id found, regardless of whether cpus or memory are
present in the nodes. This wastes 8K per node and complicates some
cpu and memory hotplug situations, such as adding a resource that
doesn't map to one of the nodes discovered at boot.
Set nodes online as resources are scanned. Fall back to node 0 only
when we're sure this isn't a NUMA machine.
Instead of defaulting to node 0 for cases of hot-adding a resource
which doesn't belong to any initialized node, assign it to the first
online node.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Code to handle Power4's invalid node id (0xffff) is duplicated for cpu
and memory. Better to handle this case in one place --
of_node_to_nid. Overall behavior should be unchanged.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since we effectively treat the domain ids given to us by firmare as
logical node ids, make this explicit (basically s/numa_domain/nid/).
No functional changes, only variable and function names are modified.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
map_cpu_to_node does not need to be inline, it is never called in a
hot path.
map_cpu_to_node, numa_setup_cpu, and find_cpu_node can be marked
__cpuinit, as they are never used after boot if CONFIG_HOTPLUG_CPU=n.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add debug statement for map_cpu_to_node; it's useful for cpu hotplug.
Clarify debug statement about not finding the numa reference points
property.
Don't print a meaningless associativity depth (-1) on non-numa systems.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At boot, the numa code is assigning boot_cpuid to node 0
unconditionally. Basically, numa_setup_cpu is being stupid about it,
but this is the minimal fix -- just call numa_setup_cpu(boot_cpuid)
later, after all nodes have been set online.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
My patch (d7a5b2ffa1) to always panic if
lmb_alloc() fails is broken because it checks alloc < 0, but should be
checking alloc == 0.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
remove warnings when building a 64bit kernel.
smp_call_function triggers also with 32bit kernel.
WARNING: vmlinux: duplicate symbol 'smp_call_function' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:164:EXPORT_SYMBOL(smp_call_function);
arch/powerpc/kernel/smp.c:300:EXPORT_SYMBOL(smp_call_function);
WARNING: vmlinux: duplicate symbol 'ioremap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:113:EXPORT_SYMBOL(ioremap);
arch/powerpc/mm/pgtable_64.c:321:EXPORT_SYMBOL(ioremap);
WARNING: vmlinux: duplicate symbol '__ioremap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:117:EXPORT_SYMBOL(__ioremap);
arch/powerpc/mm/pgtable_64.c:322:EXPORT_SYMBOL(__ioremap);
WARNING: vmlinux: duplicate symbol 'iounmap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:118:EXPORT_SYMBOL(iounmap);
arch/powerpc/mm/pgtable_64.c:323:EXPORT_SYMBOL(iounmap);
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Before the merge I updated create_pte_mapping() to work for iSeries, by
calling iSeries_hpte_bolt_or_insert. (4c55130b2a)
Later we changed iSeries_hpte_insert to cope with the bolting case, and called
that instead from create_pte_mapping() (which was renamed to htab_bolt_mapping)
(3c726f8dee).
Unfortunately that change introduced a subtle bug, where we pass an absolute
address to iSeries_hpte_insert() where it expects a physical address. This
leads to us calling phys_to_abs() twice on the physical address, which is
seriously bogus.
This only causes a problem if the absolute address from the first translation
can be looked up again in the chunk_map, which depends on the size and layout
of memory. I've seen it fail on one box, but not others.
The minimal fix is to pass the physical address to iSeries_hpte_insert(). For
2.6.17 we should make phys_to_abs() BUG if we try to double-translate an
address.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
native_hpte_clear has a spinlock recursion problem with the native_tlbie_lock
being called twice, once in native_hpte_clear() and once within tlbie().
Fix the problem by changing the call to tlbie() in native_hpte_clear() to
__tlbie(). It still supports only 4k pages for now.
Signed-off-by: R Sharada <sharada@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
For kexec we need to know the size of the MMU hash table.
Currently we calculate the size once in the htab code, and then twice more in
the kexec code, once using htab_hash_mask and once using ppc64_pft_size.
On some machines the ppc64_pft_size calculation is broken because
ppc64_pft_size is not set.
So we need to fix the second calculation, but better still we should just
calculate the size once and use it everywhere else.
Tested on Power5 LPAR, Power4 non-LPAR and Power3.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch removes all self references and fixes references to files
in the now defunct arch/ppc64 tree. I think this accomplises
everything wanted, though there might be a few references I missed.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
LMB_ALLOC_ANYWHERE doesn't need to be part of the API, it's only used in
lmb.c - so move it out of the header file.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently most callers of lmb_alloc() don't check if it worked or not, if it
ever does weird bad things will probably happen. The few callers who do check
just panic or BUG_ON.
So make lmb_alloc() panic internally, to catch bugs at the source. The few
callers who did check the result no longer need to.
The only caller that did anything interesting with the return result was
careful_allocation(). For it we create __lmb_alloc_base() which _doesn't_ panic
automatically, a little messy, but passable.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code to mark a page as icache dirty (so that it will later be
icache-dcache flushed when we try to execute from it) is duplicated in
three places: flush_dcache_page() does this marking and nothing else,
but clear_user_page() and copy_user_page() duplicate it, since those
functions make the page icache dirty themselves.
This patch makes those other functions call flush_dcache_page()
instead, so the logic's all in one place. This will make life less
confusing if we ever need to tweak the details of the the lazy icache
flush mechanism.
arch/powerpc/mm/mem.c | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
To prevent problems later in boot, make sure we don't create zero-size lmb
regions.
I've checked all the callers, and at the moment no one should ever hit this.
All callers use a constant size, or they check the computed size before they
call us.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
240-ioremap-null-ptr-test.patch
Under highly unusual circumstances, a buggy driver will ask a null ptr
to be ioremapped, an operation that curently succeeds but leads to
later trouble. Instead, refuse to remap the null pointer.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from e71d9e598533c1889e7162f5f4647e5d378c102c commit)
In add_memory() we should be using __va() to get a virtual address.
Spotted by Mike Kravetz.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When taking a DABR exception we were reporting the PC. It makes more
sense to report the address that caused the exception, and the gdb guys
would like it that way.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The system will oops if an attempt is made to add memory to an
empty node/zone. This patch prevents adding memory to an empty
node. The code to dynamically add a node/zone is non-trivial.
This patch is temporary and will be removed when the ability
to dynamically add a node/zone is complete.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Building the arch/powerpc tree currently gives me
two warnings with gcc-4.0:
arch/powerpc/mm/imalloc.c: In function '__im_get_area':
arch/powerpc/mm/imalloc.c:225: warning: 'tmp' may be used uninitialized in this function
arch/powerpc/mm/hugetlbpage.c: In function 'hugetlb_get_unmapped_area':
arch/powerpc/mm/hugetlbpage.c:608: warning: unused variable 'vma'
both fixes are trivial.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On ppc64, we independently define VMALLOCBASE and VMALLOC_START to be
the same thing: the start of the vmalloc() area at 0xd000000000000000.
VMALLOC_START is used much more widely, including in generic code, so
this patch gets rid of the extraneous VMALLOCBASE.
This does require moving the definitions of region IDs from page_64.h
to pgtable.h, but they don't clearly belong in the former rather than
the latter, anyway. While we're moving them, clean up the definitions
of the REGION_IDs:
- Abolish REGION_SIZE, it was only used once, to define
REGION_MASK anyway
- Define the specific region ids in terms of the REGION_ID()
macro.
- Define KERNEL_REGION_ID in terms of PAGE_OFFSET rather than
KERNELBASE. It amounts to the same thing, but conceptually this is
about the region of the linear mapping (which starts at PAGE_OFFSET)
rather than of the kernel text itself (which is at KERNELBASE).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pre-parsed addrs/n_addrs fields in struct device_node are finally
gone. Remove the dodgy heuristics that did that parsing at boot and
remove the fields themselves since we now have a good replacement with
the new OF parsing code. This patch also fixes a bunch of drivers to use
the new code instead, so that at least pmac32, pseries, iseries and g5
defconfigs build.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We used to print a NUMA cpu summary at boot before the hotplug cpu code
was added. This has been useful for catching machine configuration as
well as firmware bugs in the past.
This patch restores that functionality. An example of the output is:
Node 0 CPUs: 0-7
Node 1 CPUs: 8-15
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Milton has proposed that we should support a "linux,usable-memory" property
on memory nodes which describes, in preference to "reg", the regions of memory
Linux should use.
This facility is required for kdump to inform the second kernel which memory
it should use.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This places dynamically added memory within the appropriate
numa node. A new routine hot_add_scn_to_nid() replicates most of
the memory scanning code in parse_numa_properties().
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>