When allocating a new pud, unconditionally populate the pgd (why did
we bother to create a new pud if we weren't going to populate it?).
This will only happen if the pgd slot was empty, since any existing
pud will be reused.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
I don't know if this will fix this bug, but it's definitely a bugfix.
It was trashing random pages by overwriting them with pagetables...
Don't trash a large pmd's data when mapping physical memory.
This is a bugfix for "x86_64: adjust mapping of physical pagetables
to work with Xen".
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We will need to set a pte on l3_user_pgt. Extract set_pte_vaddr_pud()
from set_pte_vaddr(), that will accept the l3 page table as parameter.
This change should be a no-op for existing code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If PSE is not available, then fall back to 4k page mappings for the
vmemmap area.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This makes a few of changes to the construction of the initial
pagetables to work better with paravirt_ops/Xen. The main areas
are:
1. Support non-PSE mapping of memory, since Xen doesn't currently
allow 2M pages to be mapped in guests.
2. Make sure that the ioremap alias of all pages are dropped before
attaching the new page to the pagetable. This avoids having
writable aliases of pagetable pages.
3. Preserve existing pagetable entries, rather than overwriting. Its
possible that a fair amount of pagetable has already been constructed,
so reuse what's already in place rather than ignoring and overwriting it.
The algorithm relies on the invariant that any page which is part of
the kernel pagetable is itself mapped in the linear memory area. This
way, it can avoid using ioremap on a pagetable page.
The invariant holds because it maps memory from low to high addresses,
and also allocates memory from low to high. Each allocated page can
map at least 2M of address space, so the mapped area will always
progress much faster than the allocated area. It relies on the early
boot code mapping enough pages to get started.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
that range is from find_e820_area, so don't try to use end_pfn to see
if out of boundary...use table_top instead to avoid possible strange
result while cross the boundary...
also change early_printk to printk, because init_memory_mapping is after
early param parsing, and console=uart8250 already working at that time.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The 32-bit early_ioremap will work equally well for 64-bit, so just use it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the _populate() functions to attach new pages to a pagetable, to
make sure the right paravirt_ops calls get called.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
for 32bit
we already had early_res support, so don't need to track min_low_pfn.
keep it to 0 always.
also use init_bootmem_node instead of init_bootmem, so don't touch
min_low_pfn.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1. add reserve_bootmem_generic for 32bit
2. change len to unsigned long
3. make early_res_to_bootmem to use it
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.
This way we can see how many large pages are really used for it and how
many are split.
Useful for general insight into the kernel.
v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
32bit doesn't need it because it doesn't support hotadd for lowmem.
Fix some typos
v3: Rename dpages_cnt
Add CONFIG ifdef for count update as requested by tglx
Expand description
v4: Fix stupid bugs added in v3
Move update_page_count to pageattr.c
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In both cases, I went with the 32-bit behaviour.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
OGAWA Hirofumi and Fede have reported rare pmd_ERROR messages:
mm/memory.c:127: bad pmd ffff810000207xxx(9090909090909090).
Initialization's cleanup_highmap was leaving alignment filler
behind in the pmd for MODULES_VADDR: when vmalloc's guard page
would occupy a new page table, it's not allocated, and then
module unload's vfree hits the bad 9090 pmd entry left over.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
free_initrd_mem needs a prototype, which is in linux/initrd.h
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use CONFIG_MEMTEST only. if it is set, will have memtest=0 (disabled)
need to have memtest=4 in command line to test more patterns.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
All architectures use an effectively identical definition of online_page(), so
just make it common code. x86-64, ia64, powerpc and sh are actually
identical; x86-32 is slightly different.
x86-32's differences arise because it puts its hotplug pages in the highmem
zone. We can handle this in the generic code by inspecting the page to see if
its in highmem, and update the totalhigh_pages count appropriately. This
leaves init_32.c:free_new_highpage with a single caller, so I folded it into
add_one_highpage_init.
I also removed an incorrect comment referring to the NUMA case; any NUMA
details have already been dealt with by the time online_page() is called.
[akpm@linux-foundation.org: fix indenting]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On big systems with lots of memory, don't print out too much during
bootup, and make it easy to find if it is continuous.
on 256G 8 sockets system will get
[ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
[ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
[ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
[ffffe20038700000-ffffe200387fffff] potential offnode page_structs
[ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
[ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
[ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
[ffffe20054700000-ffffe200547fffff] potential offnode page_structs
[ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
[ffffe20070700000-ffffe200707fffff] potential offnode page_structs
[ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
[ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
[ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
[ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
[ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
[ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
[ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
[ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
[ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
[ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
[ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
[ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
instead of a very long print out...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
"mm: make reserve_bootmem can crossed the nodes" provides new
reserve_bootmem(), let reserve_bootmem_generic() use that.
reserve_bootmem_generic() is used to reserve initramdisk, so this way
we can make sure even when bootloader or kexec load ranges cross the
node memory boundaries, reserve_bootmem still works.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86-pat:
generic: add ioremap_wc() interface wrapper
/dev/mem: make promisc the default
pat: cleanups
x86: PAT use reserve free memtype in mmap of /dev/mem
x86: PAT phys_mem_access_prot_allowed for dev/mem mmap
x86: PAT avoid aliasing in /dev/mem read/write
devmem: add range_is_allowed() check to mmap of /dev/mem
x86: introduce /dev/mem restrictions with a config option
This patch introduces a restriction on /dev/mem: Only non-memory can be
read or written unless the newly introduced config option is set.
The X server needs access to /dev/mem for the PCI space, but it doesn't need
access to memory; both the file permissions and SELinux permissions of /dev/mem
just make X effectively super-super powerful. With the exception of the
BIOS area, there's just no valid app that uses /dev/mem on actual memory.
Other popular users of /dev/mem are rootkits and the like.
(note: mmap access of memory via /dev/mem was already not allowed since
a really long time)
People who want to use /dev/mem for kernel debugging can enable the config
option.
The restrictions of this patch have been in the Fedora and RHEL kernels for
at least 4 years without any problems.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move dma_ops structure definition to pci-dma.c, where it
belongs.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in max_pfn_mapped.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: andreas.herrmann3@amd.com
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do simple memtest after init_memory_mapping
use find_e820_area_size to find all ram range that is not reserved.
and do some simple bits test to find some bad ram.
if find some bad ram, use reserve_early to exclude that range.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The AMD Fam10h CPUs support new Gigabyte page table entry for
mapping 1GB at a time. Use this for the kernel direct mapping.
Only done for 64bit because i386 does not support GB page tables.
This only applies to the data portion of the direct mapping; the
kernel text mapping stays with 2MB pages because the AMD Fam10h
microarchitecture does not support GB ITLBs and AMD recommends
against using GB mappings for code.
Can be disabled with disable_gbpages on the kernel command line
[ tglx@linutronix.de: simplify enable code ]
[ Yinghai Lu <yinghai.lu@sun.com>: boot fix on 256 GB RAM ]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
These new controls toggle experimental support for a new CPU feature,
the straightforward extension of largepages from the pmd level to the
pud level, which allows 1GB (kernel) TLBs instead of 2MB TLBs.
Turn it off by default, as this code has not been tested well enough yet.
Use the CONFIG_DIRECT_GBPAGES=y .config option or gbpages on the
boot line can be used to enable it. If enabled in the .config then
nogbpages boot option disables it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
recently the 64-bit allyesconfig bzImage kernel started spontaneously
rebooting during early bootup.
after a few fun hours spent with early init debugging, it turns out
that we've got this rather annoying limit on the size of the kernel
image:
#define KERNEL_TEXT_SIZE (40*1024*1024)
which limit my vmlinux just happened to pass:
text data bss dec hex filename
29703744 4222751 8646224 42572719 2899baf vmlinux
40 MB is 42572719 bytes, so my vmlinux was just 1.5% above this limit :-/
So it happily crashed right in head_64.S, which - as we all know - is
the most debuggable code in the whole architecture ;-)
So increase the limit to allow an up to 128MB kernel image to be mapped.
(should anyone be that crazy or lazy)
We have a full 4K of pagetable (level2_kernel_pgt) allocated for these
mappings already, so there's no RAM overhead and the limit was rather
pointless and arbitrary.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The early boot code maps KERNEL_TEXT_SIZE (currently 40MB) starting
from __START_KERNEL_map. The kernel itself only needs _text to _end
mapped in the high alias. On relocatible kernels the ASM setup code
adjusts the compile time created high mappings to the relocation. This
creates invalid pmd entries for negative offsets:
0xffffffff80000000 -> pmd entry: ffffffffff2001e3
It points outside of the physical address space and is marked present.
This starts at the virtual address __START_KERNEL_map and goes up to
the point where the first valid physical address (0x0) is mapped.
Zap the mappings before _text and after _end right away in early
boot. This removes also the invalid entries.
Furthermore it simplifies the range check for high aliases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>