This avoids a lockdep warning from:
if (DEBUG_LOCKS_WARN_ON(unlikely(!early_boot_irqs_enabled)))
return;
in trace_hardirqs_on_caller();
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make io_mapping_create_wc and io_mapping_free go through PAT to make sure
that there are no memory type aliases.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
io_mapping_create_wc should take a resource_size_t parameter in place of
unsigned long. With unsigned long, there will be no way to map greater than 4GB
address in i386/32 bit.
On x86, greater than 4GB addresses cannot be mapped on i386 without PAE. Return
error for such a case.
Patch also adds a structure for io_mapping, that saves the base, size and
type on HAVE_ATOMIC_IOMAP archs, that can be used to verify the offset on
io_mapping_map calls.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a function to check and keep identity maps in sync, when changing
any memory type. One of the follow on patches will also use this
routine.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- print test pattern instead of pattern number,
- show pattern as stored in memory,
- use proper priority flags,
- consistent use of u64 throughout the code
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix unexpected behaviour when pattern number is out of range
Current implementation provides 4 patterns for memtest. The code doesn't
check whether the memtest parameter value exceeds the maximum pattern number.
Instead the memtest code pretends to test with non-existing patterns, e.g.
when booting with memtest=10 I've observed the following
...
early_memtest: pattern num 10
0000001000 - 0000006000 pattern 0
...
0000001000 - 0000006000 pattern 1
...
0000001000 - 0000006000 pattern 2
...
0000001000 - 0000006000 pattern 3
...
0000001000 - 0000006000 pattern 4
...
0000001000 - 0000006000 pattern 5
...
0000001000 - 0000006000 pattern 6
...
0000001000 - 0000006000 pattern 7
...
0000001000 - 0000006000 pattern 8
...
0000001000 - 0000006000 pattern 9
...
But in fact Linux didn't test anything for patterns > 4 as the default
case in memtest() is to leave the function.
I suggest to use the memtest parameter as the number of tests to be
performed and to re-iterate over all existing patterns.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make more types of copies non-temporal
This change makes the following simple fix:
30d697f: x86: fix performance regression in write() syscall
A bit more sophisticated: we check the 'total' number of bytes
written to decide whether to copy in a cached or a non-temporal
way.
This will for example cause the tail (modulo 4096 bytes) chunk
of a large write() to be non-temporal too - not just the page-sized
chunks.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, enable future change
Add a 'total bytes copied' parameter to __copy_from_user_*nocache(),
and update all the callsites.
The parameter is not used yet - architecture code can use it to
more intelligently decide whether the copy should be cached or
non-temporal.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: defconfig change
Enable MCE in the 64-bit defconfig.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The ones which go only into struct genapic are de-inlined
by compiler anyway, so remove the inline specifier from them.
Afterwards, remove summit_setup_portio_remap completely as it
is unused.
Remove inline also from summit_cpu_mask_to_apicid, since it's
not worth it (it is used in struct genapic too).
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use BAD_APICID instead of 0xFF constants in summit_cpu_mask_to_apicid.
Also remove bogus comments about what we actually return.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This was changed to a physmap_t giving a clashing symbol redefinition,
but actually using a physmap_t consumes rather a lot of space on x86,
so stick with a private copy renamed with a voyager_ prefix and made
static. Nothing outside of the Voyager code uses it, anyway.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
native_usergs_sysret64 is described as
extern void native_usergs_sysret64(void)
so lets add ENDPROC here
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
NEXT_PAGE already has 'balign' so no
need to keep this redundant one.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While the introduction of __copy_from_user_nocache (see commit:
0812a579c9) may have been an improvement
for sufficiently large writes, there is evidence to show that it is
deterimental for small writes. Unixbench's fstime test gives the
following results for 256 byte writes with MAX_BLOCK of 2000:
2.6.29-rc6 ( 5 samples, each in KB/sec ):
283750, 295200, 294500, 293000, 293300
2.6.29-rc6 + this patch (5 samples, each in KB/sec):
313050, 3106750, 293350, 306300, 307900
2.6.18
395700, 342000, 399100, 366050, 359850
See w_test() in src/fstime.c in unixbench version 4.1.0. Basically, the above test
consists of counting how much we can write in this manner:
alarm(10);
while (!sigalarm) {
for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
write(f, buf, 256);
}
lseek(f, 0L, 0);
}
Note, there are other components to the write syscall regression
that are not addressed here.
Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Cleanup
Checkin be44d2aabc eliminates the use of
a 16-bit stack for espfix. However, at least one instruction remained
that only operated on the low 16 bits of %esp.
This is not a bug per se because the kernel stack is always an aligned
4K or 8K block. Therefore it cannot cross 64K boundaries; this code,
in fact, relies strictly on that fact.
However, it's a lot cleaner (and, for that matter, smaller) to operate
on the entire 32-bit register.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
CC: Zachary Amsden <zach@vmware.com>
CC: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: fix early crash on LinuxBIOS systems
Kevin O'Connor reported that Coreboot aka LinuxBIOS tries to put
mptable somewhere very high, well above max_low_pfn (below which
BIOSes generally put the mptable), causing a panic.
The BIOS will probably be changed to be compatible with older
Linus versions, but nevertheless the MP-spec does not forbid
an MP-table in arbitrary system RAM, so make sure it all
works even if the table is in an unexpected place.
Check physptr with max_low_pfn * PAGE_SIZE.
Reported-by: Kevin O'Connor <kevin@koconnor.net>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Stefan Reinauer <stepan@coresystems.de>
Cc: coreboot@coreboot.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make x86_quirks support more transparent. The highlevel
methods are now named:
extern void x86_quirk_pre_intr_init(void);
extern void x86_quirk_intr_init(void);
extern void x86_quirk_trap_init(void);
extern void x86_quirk_pre_time_init(void);
extern void x86_quirk_time_init(void);
This makes it clear that if some platform extension has to
do something here that it is considered ... weird, and is
discouraged.
Also remove arch_hooks.h and move it into setup.h (and other
header files where appropriate).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
Remove:
- pre_setup_arch_hook()
- mca_nmi_hook()
If needed they can be added back via an x86_quirk handler.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove unused/broken code
The Voyager subarch last built successfully on the v2.6.26 kernel
and has been stale since then and does not build on the v2.6.27,
v2.6.28 and v2.6.29-rc5 kernels.
No actual users beyond the maintainer reported this breakage.
Patches were sent and most of the fixes were accepted but the
discussion around how to do a few remaining issues cleanly
fizzled out with no resolution and the code remained broken.
In the v2.6.30 x86 tree development cycle 32-bit subarch support
has been reworked and removed - and the Voyager code, beyond the
build problems already known, needs serious and significant
changes and probably a rewrite to support it.
CONFIG_X86_VOYAGER has been marked BROKEN then. The maintainer has
been notified but no patches have been sent so far to fix it.
While all other subarchs have been converted to the new scheme,
voyager is still broken. We'd prefer to receive patches which
clean up the current situation in a constructive way, but even in
case of removal there is no obstacle to add that support back
after the issues have been sorted out in a mutually acceptable
fashion.
So remove this inactive code for now.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change the CONFIG_X86_EXTENDED_PLATFORM help text to display the
32bit/64bit extended platform list. This is as suggested by Ingo.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: shai@scalex86.org
Cc: "Benzi Galili (Benzi@ScaleMP.com)" <benzi@scalemp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the sysdev_suspend/resume from the callee to the callers, with
no real change in semantics, so that we can rework the disabling of
interrupts during suspend/hibernation.
This is based on an earlier patch from Linus.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now nobody cares, but the suspend/resume code will eventually want
to suspend device interrupts without suspending the timer, and will
depend on this flag to know.
The modern x86 timer infrastructure uses the local APIC timers and never
shows up as a device interrupt at all, so it isn't affected and doesn't
need any of this.
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If BIOS hands over the control to OS in legacy xapic mode, select
legacy xapic related ops in the early apic probe and shift to x2apic
ops later in the boot sequence, only after enabling x2apic mode.
If BIOS hands over the control in x2apic mode, select x2apic related
ops in the early apic probe.
This fixes the early boot panic, where we were selecting x2apic ops,
while the cpu is still in legacy xapic mode.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this sparse warning:
arch/x86/mm/numa_32.c:197:24: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
arch/x86/kernel/machine_kexec_32.c:124:22: warning: Using plain integer as NULL pointer
arch/x86/kernel/traps.c:950:24: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As acpi_enter_sleep_state can fail, take this into account in
do_suspend_lowlevel and don't return to the do_suspend_lowlevel's
caller. This would break (currently) fpu status and preempt count.
Technically, this means use `call' instead of `jmp' and `jmp' to
the `resume_point' after the `call' (i.e. if
acpi_enter_sleep_state returns=fails). `resume_point' will handle
the restore of fpu and preempt count gracefully.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
- remove %ds re-set, it's already set in wakeup_long64
- remove double labels and alignment (ENTRY already adds both)
- use meaningful resume point labelname
- skip alignment while jumping from wakeup_long64 to the resume point
- remove .size, .type and unused labels
[v2]
- added ENDPROCs
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Bug fix on UP
Checkin 6ec68bff3c:
x86, mce: reinitialize per cpu features on resume
introduced a call to mce_cpu_features() in the resume path, in order
for the MCE machinery to get properly reinitialized after a resume.
However, this function (and its successors) was flagged __cpuinit,
which becomes __init on UP configurations (on SMP suspend/resume
requires CPU hotplug and so this would not be seen.)
Remove the offending __cpuinit annotations for mce_cpu_features() and
its successor functions.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: extend prefetch handling on 64-bit
Currently there's an extra is_prefetch() check done in do_sigbus(),
which we only do on 32 bits.
This is a last-ditch check before we terminate a task, so it's worth
giving prefetch instructions another chance - should none of our
existing quirks have caught a prefetch instruction related spurious
fault.
The only risk is if a prefetch causes a real sigbus, in that case
we'll not OOM but try another fault. But this code has been on
32-bit for a long time, so it should be fine in practice.
So do this on 64-bit too - and thus remove one more #ifdef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Removal of an #ifdef in fault_in_kernel_space(), by making
use of the new TASK_SIZE_MAX symbol which is now available
on 32-bit too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
define on 32-bit too. (mapped to TASK_SIZE)
This allows 32-bit code to make use of the (former-) TASK_SIZE64
symbol as well, in a clean way.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
do_page_fault() has this ugly #ifdef in its prototype:
#ifdef CONFIG_X86_64
asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
Replace it with 'dotraplinkage' which maps to exactly the above
construct: nothing on 32-bit and asmlinkage on 64-bit.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add oops-recursion check to 32-bit
Unify the oops state-machine, to the 64-bit version. It is
slightly more careful in that it does a recursion check
in oops_begin(), and is thus more likely to show the relevant
oops.
It also means that 32-bit will print one more line at the
end of pagefault triggered oopses:
printk(KERN_EMERG "CR2: %016lx\n", address);
Which is generally good information to be seen in partial-dump
digital-camera jpegs ;-)
The downside is the somewhat more complex critical path. Both
variants have been tested well meanwhile by kernel developers
crashing their boxes so i dont think this is a practical worry.
This removes 3 ugly #ifdefs from no_context() and makes the
function a lot nicer read.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refine/extend page fault related oops printing on 64-bit
- honor the pause_on_oops logic on 64-bit too
- print out NX fault warnings on 64-bit as well
- factor out the NX fault message to make it git-greppable and readable
Note that this means that we do the PF_INSTR check on 32-bit non-PAE
as well where it should not occur ... normally. Cannot hurt.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from notify_page_fault(). The function still
compiles to nothing in the !CONFIG_KPROBES case.
Introduce kprobes_built_in() and kprobe_fault_handler() helpers
to allow this - they returns 0 if !CONFIG_KPROBES.
No code changed:
text data bss dec hex filename
4618 32 24 4674 1242 fault.o.before
4618 32 24 4674 1242 fault.o.after
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from kmmio_fault() - we can do this by
providing default implementations for is_kmmio_active()
and kmmio_handler(). The compiler optimizes it all away
in the !CONFIG_MMIOTRACE case.
Also, while at it, clean up mmiotrace.h a bit:
- standard header guards
- standard vertical spaces for structure definitions
No code changed (both with mmiotrace on and off in the config):
text data bss dec hex filename
2947 12 12 2971 b9b fault.o.before
2947 12 12 2971 b9b fault.o.after
Cc: Pekka Paalanen <pq@iki.fi>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve page fault handling robustness
The 'PF_RSVD' flag (bit 3) of the page-fault error_code is a
relatively recent addition to x86 CPUs, so the 32-bit do_fault()
implementation never had it. This flag gets set when the CPU
detects nonzero values in any reserved bits of the page directory
entries.
Extend the existing 64-bit check for PF_RSVD in do_page_fault()
to 32-bit too. If we detect such a fault then we print a more
informative oops and the pagetables.
This unifies the code some more, removes an ugly #ifdef and improves
the 32-bit page fault code robustness a bit. It slightly increases
the 32-bit kernel text size.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Instead of an ugly, open-coded, #ifdef-ed vm86 related legacy check
in do_page_fault(), put it into the check_v8086_mode() helper
function and merge it with an existing #ifdef.
Also, simplify the code flow a tiny bit in the helper.
No code changed:
arch/x86/mm/fault.o:
text data bss dec hex filename
2711 12 12 2735 aaf fault.o.before
2711 12 12 2735 aaf fault.o.after
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: no functionality changed
Factor out the opcode checker into a helper inline.
The code got a tiny bit smaller:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4618 32 24 4674 1242 fault.o.after
And it got cleaner / easier to review as well.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, no code changed
Clean up various small details, which can be correctness checked
automatically:
- tidy up the include file section
- eliminate unnecessary includes
- introduce show_signal_msg() to clean up code flow
- standardize the code flow
- standardize comments and other style details
- more cleanups, pointed out by checkpatch
No code changed on either 32-bit nor 64-bit:
arch/x86/mm/fault.o:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4632 32 24 4688 1250 fault.o.after
the md5 changed due to a change in a single instruction:
2e8a8241e7f0d69706776a5a26c90bc0 fault.o.before.asm
c5c3d36e725586eb74f0e10692f0193e fault.o.after.asm
Because a __LINE__ reference in a WARN_ONCE() has changed.
On 32-bit a few stack offsets changed - no code size difference
nor any functionality difference.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
clean up vmi_read_cycles to use max()
Reported-b: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: future-proof the split_large_page() function
Linus noticed that split_large_page() is not safe wrt. the
PAT bit: it is bit 12 on the 1GB and 2MB page table level
(_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
table level (_PAGE_BIT_PAT).
Currently it is not a problem because we never set
_PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
should this happen in the future the split_large_page() would
silently lift bit 12 into the lowlevel 4K pte and would start
corrupting the physical page frame offset. Not fun.
So add a debug warning, to make sure if something ever sets
the PAT bit then this function gets updated too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to prevent hard lockup on bad PMD permissions
If the PMD does not have the correct permissions for a page access,
but the PTE does, the spurious fault handler will mistake the fault
as a lazy TLB transaction. This will result in an infinite loop of:
fault -> spurious_fault check (pass) -> return to code -> fault
This patch adds a check and a warn on if the PTE passes the permissions
but the PMD does not.
[ Updated: Ingo Molnar suggested using WARN_ONCE with some text ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().
The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.
The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.
With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.
Also update the comment.
This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.
[ Huang Ying also experienced problems in this area when writing
the EFI code, but the real bug in split_large_page() was not
realized back then. ]
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix time warps under vmware
Similar to the check for TSC going backwards in the TSC clocksource,
we also need this check for VMI clocksource.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
x86, mce: use force_sig_info to kill process in machine check
x86, mce: reinitialize per cpu features on resume
x86, rcu: fix strange load average and ksoftirqd behavior
Impact: clenaup
Linker script will put startup_32 at predefined
address so using startup_32 will not bloat the
code size.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clenaup
Linker script will put startup_32 at predefined
address so using ENTRY will not bloat the code
size.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We are in setup stage so we use GLOBAL
instead of ENTRY and do not increase code
size.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
There was an attempt to bring build-time checking for
missed ENTRY_X86/END_X86 and KPROBE... pairs. Using
them will add messy in code. Get just rid of them.
This commit could be easily restored if the need appear
in future.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the code is time critical and this entry is called
from other places we use ENTRY to have it globally defined
and especially aligned.
Contrary we have some snippets which are size
critical. So we use plane ".globl name; name:"
directive. Introduce GLOBAL macro for this.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: Bugfix
The ifdef for the apic clear on shutdown for the 64bit intel thermal
vector was incorrect and never triggered. Fix that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix (with tolerant == 3)
do_exit cannot be called directly from the exception handler because
it can sleep and the exception handler runs on the exception stack.
Use force_sig() instead.
Based on a earlier patch by Ying Huang who debugged the problem.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
This fixes a long standing bug in the machine check code. On resume the
boot CPU wouldn't get its vendor specific state like thermal handling
reinitialized. This means the boot cpu wouldn't ever get any thermal
events reported again.
Call the respective initialization functions on resume
v2: Remove ancient init because they don't have a resume device anyways.
Pointed out by Thomas Gleixner.
v3: Now fix the Subject too to reflect v2 change
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
doc: mmiotrace.txt, buffer size control change
trace: mmiotrace to the tracer menu in Kconfig
mmiotrace: count events lost due to not recording
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race
Impact: build fix, cleanup
A couple of arch setup callbacks were mistakenly in apic_32.c, breaking
the build.
Also simplify the code a bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Cleanup; fix inappropriate macro use
ISA addresses on x86 are mapped 1:1 with the physical address space.
Since the ISA address space is only 24 bits (32 for VLB or LPC) it
will always fit in an unsigned int, and at least in the aha1542 driver
using a wider type would cause an undesirable promotion. Hence
explicitly cast the ISA bus addresses to unsigned int.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Move the 32-bit extended-arch APIC drivers to arch/x86/kernel/apic/
too, and rename apic_64.c to probe_64.c.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/ is getting a bit crowded, and the APIC
drivers are scattered into various different files.
Move them to arch/x86/kernel/apic/*, and also remove
the 'gen' prefix from those which had it.
Also move APIC related functionality: the IO-APIC driver,
the NMI and the IPI code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Now that all APIC code is consolidated there's nothing 'gen' about
apics anymore - so rename 'struct genapic' to 'struct apic'.
This shortens the code and is nicer to read as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- misc other cleanups that change the md5 signature
- consolidate global variables
- remove unnecessary __numaq_mps_oem_check() wrapper
- make numaq_mps_oem_check static
- update copyrights
- misc other cleanups pointed out by checkpatch
Signed-off-by: Ingo Molnar <mingo@elte.hu>