For PAPR partitions with large amounts of memory, the firmware has an
alternative, more compact representation for the information about the
memory in the partition and its NUMA associativity information. This
adds the code to the kernel to parse this alternative representation.
The other part of this patch is telling the firmware that we can
handle the alternative representation. There is however a subtlety
here, because the firmware will invoke a reboot if the memory
representation we request is different from the representation that
firmware is currently using. This is because firmware can't change
the representation on the fly. Further, some firmware versions used
on POWER5+ machines have a bug where this reboot leaves the machine
with an altered value of load-base, which will prevent any kernel
booting until it is reset to the normal value (0x4000). Because of
this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
do not request the new representation on POWER5+ and earlier machines.
We do request the new representation on POWER6, which uses the
ibm,client-architecture-support call.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.
The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.
With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove CPU_FTR_16M_PAGE from the cupfeatures mask at runtime on iSeries.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
A few code paths need to check whether or not they are running
on the PS3's LV1 hypervisor before making hcalls. This introduces
a new firmware feature bit for this, FW_FEATURE_PS3_LV1.
Now when both PS3 and IBM_CELL_BLADE are enabled, but not PSERIES,
FW_FEATURE_PS3_LV1 and FW_FEATURE_LPAR get enabled at compile time,
which is a bug. The same problem can also happen for (PPC_ISERIES &&
!PPC_PSERIES && PPC_SOMETHING_ELSE). In order to solve this, I
introduce a new CONFIG_PPC_NATIVE option that is set when at least
one platform is selected that can run without a hypervisor and then
turns the firmware feature check into a run-time option.
The new cell oprofile support that was recently merged does not
work on hypervisor based platforms like the PS3, therefore make
it depend on PPC_CELL_NATIVE instead of PPC_CELL. This may change
if we get oprofile support for PS3.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
powerpc: Merge 32 and 64 bits asm-powerpc/io.h
The rework on io.h done for the new hookable accessors made it easier,
so I just finished the work and merged 32 and 64 bits io.h for arch/powerpc.
arch/ppc still uses the old version in asm-ppc, there is just too much gunk
in there that I really can't be bothered trying to cleanup.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In order to suppose platforms with devices above 4Gb on 32 bits platforms
with a >32 bits physical address space, we used to have a special ioremap64
along with a fixup routine fixup_bigphys_addr.
This shouldn't be necessary anymore as struct resource now supports 64 bits
addresses even on 32 bits archs. This patch enables that option when
CONFIG_PHYS_64BIT is set and removes ioremap64 and fixup_bigphys_addr.
This is a preliminary work for the upcoming merge of 32 and 64 bits io.h
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch reworks the way iSeries hooks on PCI IO operations (both MMIO
and PIO) and provides a generic way for other platforms to do so (we
have need to do that for various other platforms).
While reworking the IO ops, I ended up doing some spring cleaning in
io.h and eeh.h which I might want to split into 2 or 3 patches (among
others, eeh.h had a lot of useless stuff in it).
A side effect is that EEH for PIO should work now (it used to pass IO
ports down to the eeh address check functions which is bogus).
Also, new are MMIO "repeat" ops, which other archs like ARM already had,
and that we have too now: readsb, readsw, readsl, writesb, writesw,
writesl.
In the long run, I might also make EEH use the hooks instead
of wrapping at the toplevel, which would make things even cleaner and
relegate EEH completely in platforms/iseries, but we have to measure the
performance impact there (though it's really only on MMIO reads)
Since I also need to hook on ioremap, I shuffled the functions a bit
there. I introduced ioremap_flags() to use by drivers who want to pass
explicit flags to ioremap (and it can be hooked). The old __ioremap() is
still there as a low level and cannot be hooked, thus drivers who use it
should migrate unless they know they want the low level version.
The patch "arch provides generic iomap missing accessors" (should be
number 4 in this series) is a pre-requisite to provide full iomap
API support with this patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(David:)
If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.
But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.
(Hugh:)
prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down. PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.
Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bad_page_fault() prints a message telling the user what type of bad
fault we took. The first line of this message is currently implemented
as two separate printks. This has the unfortunate effect that if
several cpus simultaneously take a bad fault, the first and second parts
of the printk get jumbled up, which looks dodge and is hard to read.
So do a single one-line printk for each fault type.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Checking source for other get_paca()->field preemption dangers found that
open_high_hpage_areas does a structure copy into its paca while preemption
is enabled: unsafe however gcc accomplishes it. Just remove that copy:
it's done safely afterwards by on_each_cpu, as in open_low_hpage_areas.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change the powerpc hpte_insert routines now called through ppc_md to
static scope.
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Arch-independent zone-sizing is using indices instead of symbolic names to
offset within an array related to zones (max_zone_pfns). The unintended
impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the
the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
The following patch properly initialises the max_zone_pfns[] array and uses
symbolic names instead of indices in each architecture using
arch-independent zone-sizing. Two users have successfully booted their
powerpcs with it (one an ibook G4). It has also been boot tested on x86,
x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2.
Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
first fix. Additional credit to Johannes Berg and Andreas Schwab for
reporting the problem and testing on powerpc.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().
Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.
Eric's original description:
There are a lot of places in the kernel where we test for init
because we give it special properties. Most significantly init
must not die. This results in code all over the kernel test
->pid == 1.
Introduce is_init to capture this case.
With multiple pid spaces for all of the cases affected we are
looking for only the first process on the system, not some other
process that has pid == 1.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make PROT_WRITE imply PROT_READ for a number of architectures which don't
support write only in hardware.
While looking at this, I noticed that some architectures which do not
support write only mappings already take the exact same approach. For
example, in arch/alpha/mm/fault.c:
"
if (cause < 0) {
if (!(vma->vm_flags & VM_EXEC))
goto bad_area;
} else if (!cause) {
/* Allow reads even for write-only mappings */
if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
goto bad_area;
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
}
"
Thus, this patch brings other architectures which do not support write only
mappings in-line and consistent with the rest. I've verified the patch on
ia64, x86_64 and x86.
Additional discussion:
Several architectures, including x86, can not support write-only mappings.
The pte for x86 reserves a single bit for protection and its two states are
read only or read/write. Thus, write only is not supported in h/w.
Currently, if i 'mmap' a page write-only, the first read attempt on that page
creates a page fault and will SEGV. That check is enforced in
arch/blah/mm/fault.c. However, if i first write that page it will fault in
and the pte will be set to read/write. Thus, any subsequent reads to the page
will succeed. It is this inconsistency in behavior that this patch is
attempting to address. Furthermore, if the page is swapped out, and then
brought back the first read will also cause a SEGV. Thus, any arbitrary read
on a page can potentially result in a SEGV.
According to the SuSv3 spec, "if the application requests only PROT_WRITE, the
implementation may also allow read access." Also as mentioned, some
archtectures, such as alpha, shown above already take the approach that i am
suggesting.
The counter-argument to this raised by Arjan, is that the kernel is enforcing
the write only mapping the best it can given the h/w limitations. This is
true, however Alan Cox, and myself would argue that the inconsitency in
behavior, that is applications can sometimes work/sometimes fails is highly
undesireable. If you read through the thread, i think people, came to an
agreement on the last patch i posted, as nobody has objected to it...
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Ian Molton <spyro@f2s.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On Tue, 2006-08-15 at 08:22 -0700, Dave Hansen wrote:
> kernel BUG in cache_free_debugcheck at mm/slab.c:2748!
Alright, this one is only triggered when slab debugging is enabled. The
slabs are assumed to be aligned on a HUGEPTE_TABLE_SIZE boundary. The free
path makes use of this assumption and uses the lowest nibble to pass around
an index into an array of kmem_cache pointers. With slab debugging turned
on, the slab is still aligned, but the "working" object pointer is not.
This would break the assumption above that a full nibble is available for
the PGF_CACHENUM_MASK.
The following patch reduces PGF_CACHENUM_MASK to cover only the two least
significant bits, which is enough to cover the current number of 4 pgtable
cache types. Then use this constant to mask out the appropriate part of
the huge pte pointer.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a shadow buffer for the SLBs and regsiters it with PHYP.
Only the bolted SLB entries (top 3) are shadowed.
The SLB shadow buffer tells the hypervisor what the kernel needs to
have in the SLB for the kernel to be able to function. The hypervisor
can use this information to speed up partition context switches.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.
powerpc core changes.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The compiler doesn't understand that BUG() never returns, so complains that
psize isn't set. Just set it to the normal value, which seems to produce nice
code and keeps gcc happy.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
[POWERPC] Use little-endian bit from firmware ibm,pa-features property
[POWERPC] Make sure smp_processor_id works very early in boot
[POWERPC] U4 DART improvements
[POWERPC] todc: add support for Time-Of-Day-Clock
[POWERPC] Make lparcfg.c work when both iseries and pseries are selected
[POWERPC] Fix idr locking in init_new_context
[POWERPC] mpc7448hpc2 (taiga) board config file
[POWERPC] Add tsi108 pci and platform device data register function
[POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
[POWERPC] Correct the MAX_CONTEXT definition
powerpc: minor cleanups for mpc86xx
[POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
[POWERPC] Simplify the code defining the 64-bit CPU features
[POWERPC] powerpc: kconfig warning fix
[POWERPC] Consolidate some of kernel/misc*.S
[POWERPC] Remove unused function call_with_mmu_off
[POWERPC] update asm-powerpc/time.h
[POWERPC] Clean up it_lp_queue.h
[POWERPC] Skip the "copy down" of the kernel if it is already at zero.
[POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
...
We always need to serialize accesses to mmu_context_idr.
I hit this bug when testing with a small number of mmu contexts.
Signed-off-by: Sonny Rao <sonny@burdell.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
With the ppc_md htab pointers setup earlier, we can use ppc_md.hpte_insert
in htab_bolt_mapping(), rather than deciding which version to call by hand.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Initialise the ppc_md htab callbacks earlier, in the probe routines. This
allows us to call htab_finish_init() from htab_initialize(), and makes it
private to hash_utils_64.c. Move htab_finish_init() and make_bl() above
htab_initialize() to avoid forward declarations.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.
__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the name of old add_memory() to arch_add_memory. And use node id to
get pgdat for the node at NODE_DATA().
Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not well known about powerpc.
So, I keep it. (But, __meminit is better at least.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Overloading of page fault notification with the notify_die() has performance
issues(since the only interested components for page fault is kprobes and/or
kdb) and hence this patch introduces the new notifier call chain exclusively
for page fault notifications their by avoiding notifying unnecessary
components in the do_page_fault() code path.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
[POWERPC] re-enable OProfile for iSeries, using timer interrupt
[POWERPC] support ibm,extended-*-frequency properties
[POWERPC] Extra sanity check in EEH code
[POWERPC] Dont look for class-code in pci children
[POWERPC] Fix mdelay badness on shared processor partitions
[POWERPC] disable floating point exceptions for init
[POWERPC] Unify ppc syscall tables
[POWERPC] mpic: add support for serial mode interrupts
[POWERPC] pseries: Print PCI slot location code on failure
[POWERPC] spufs: one more fix for 64k pages
[POWERPC] spufs: fail spu_create with invalid flags
[POWERPC] spufs: clear class2 interrupt status before wakeup
[POWERPC] spufs: fix Makefile for "make clean"
[POWERPC] spufs: remove stop_code from struct spu
[POWERPC] spufs: fix spu irq affinity setting
[POWERPC] spufs: further abstract priv1 register access
[POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
[POWERPC] spufs: dont try to access SPE channel 1 count
[POWERPC] spufs: use kzalloc in create_spu
[POWERPC] spufs: fix initial state of wbox file
...
Manually resolved conflicts in:
drivers/net/phy/Makefile
include/asm-powerpc/spu.h
Clear the high BATS during load_up_mmu if FTR_HAS_HIGH_BATS.
Allow just a bit more time for secondary CPUs to phone home.
Signed-off-by: Wei Zhang <Wei.Zhang@freescale.com>
Signed-off-by: Haiying Wang <Haiying.Wang@freescale.com>
Signed-off-by: Jon Loeliger <jdl@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The page size encoding passed to tlbie is incorrect for new-style
large pages. This fixes it. This doesn't affect anything on older
machines because mmu_psize_defs[psize].penc (the page size encoding)
is 0 for 4k and 16M pages (the two are distinguished by a separate "is
a large page" bit).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove some stale POWER3/POWER4/970 on 32bit kernel support.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some POWER5+ machines can do 64k hardware pages for normal memory but
not for cache-inhibited pages. This patch lets us use 64k hardware
pages for most user processes on such machines (assuming the kernel
has been configured with CONFIG_PPC_64K_PAGES=y). User processes
start out using 64k pages and get switched to 4k pages if they use any
non-cacheable mappings.
With this, we use 64k pages for the vmalloc region and 4k pages for
the imalloc region. If anything creates a non-cacheable mapping in
the vmalloc region, the vmalloc region will get switched to 4k pages.
I don't know of any driver other than the DRM that would do this,
though, and these machines don't have AGP.
When a region gets switched from 64k pages to 4k pages, we do not have
to clear out all the 64k HPTEs from the hash table immediately. We
use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
was hashed in as a 64k page or a set of 4k pages. If hash_page is
trying to insert a 4k page for a Linux PTE and it sees that it has
already been inserted as a 64k page, it first invalidates the 64k HPTE
before inserting the 4k HPTE. The hash invalidation routines also use
the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
set of 4k HPTEs to remove. With those two changes, we can tolerate a
mix of 4k and 64k HPTEs in the hash table, and they will all get
removed when the address space is torn down.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pgdir field in the paca was a leftover from the dynamic VSIDs
patch, and is not used in the current kernel code. This removes it.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a vdso_base element to the mm_context_t for 32-bit compiles
(both for ARCH=powerpc and ARCH=ppc). This fixes the compile errors
that have been reported in arch/powerpc/kernel/signal_32.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Our MMU hash management code would not set the "C" bit (changed bit) in
the hardware PTE when updating a RO PTE into a RW PTE. That would cause
the hardware to possibly to a write back to the hash table to set it on
the first store access, which in addition to being a performance issue,
might also hit a bug when running with native hash management (non-HV)
as our code is specifically optimized for the case where no write back
happens.
Thus there is a very small therocial window were a hash PTE can become
corrupted if that HPTE has just been upgraded to read write, a store
access happens on it, and that races with another processor evicting
that same slot. Since eviction (caused by an almost full hash) is
extremely rare, the bug is very unlikely to happen fortunately.
This fixes by allowing the updating of the protection bits in the native
hash handling to also set (but not clear) the "C" bit, and, in order to
also improve performances in the general case, by always setting that
bit on newly inserted hash PTE so that writeback really never happens.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We currently do mem= handling in three seperate places. And as benh pointed out
I wrote two of them. Now that we parse command line parameters earlier we can
clean this mess up.
Moving the parsing out of prom_init means the device tree might be allocated
above the memory limit. If that happens we'd have to move it. As it happens
we already have logic to do that for kdump, so just genericise it.
This also means we might have reserved regions above the memory limit, if we
do the bootmem allocator will blow up, so we have to modify
lmb_enforce_memory_limit() to truncate the reserves as well.
Tested on P5 LPAR, iSeries, F50, 44p. Tested moving device tree on P5 and
44p and F50.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>