This reverts commit c33d4568ac.
Andrew Clayton and Hugh Dickins report that it's broken for them and
causes strange page table and slab corruption, and spontaneous reboots.
Let's get it right next time.
Cc: Andrew Clayton <andrew@rootshell.co.uk>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
EM64T CPUs have somewhat weird error reporting for non canonical RIPs in
SYSRET.
We can't handle any exceptions there because the exception handler would
end up running on the user stack which is unsafe.
To avoid problems any code that might end up with a user touched pt_regs
should return using int_ret_from_syscall. int_ret_from_syscall ends up
using IRET, which allows safe exceptions.
Cc: Ernie Petrides <petrides@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While testing kexec and kdump we hit problems where the new kernel would
freeze or instantly reboot. The easiest way to trigger it was to kexec a
kernel compiled for CONFIG_M586 on an athlon cpu. Compiling for CONFIG_MK7
instead would work fine.
The patch fixes a few problems with the kexec inline asm.
Signed-off-by: Chris Mason <mason@suse.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kbuild system takes advantage of an incorrect behavior in GNU make.
Once this behavior is fixed, all files in the kernel rebuild every time,
even if nothing has changed. This patch ensures kbuild works with both
the incorrect and correct behaviors of GNU make.
For more details on the incorrect behavior, see:
http://lists.gnu.org/archive/html/bug-make/2006-03/msg00003.html
Changes in this patch:
- Keep all targets that are to be marked .PHONY in a variable, PHONY.
- Add .PHONY: $(PHONY) to mark them properly.
- Remove any $(PHONY) files from the $? list when determining whether
targets are up-to-date or not.
Signed-off-by: Paul Smith <psmith@gnu.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
This reverts commit 13a229abc2.
Quoth Andi:
"After some consideration and feedback from various people it turns
out this wasn't that good an idea. It has some problems and needs
more work. Since it was only an optimization anyways it's best to
just back it out again for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit 9ec4b1f356 made kprobes not compile
without module support, so just make that clear in the Kconfig file.
Also, since it's marked EXPERIMENTAL, make that dependency explicit too.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The commit e2c0388866 added
setup_additional_cpus to setup.c but this is only defined if
CONFIG_HOTPLUG_CPU is set. This patch changes the #ifdef to reflect that.
Signed-off-by: Brian Magnuson <magnuson@rcn.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The previous experiment for using apicmaintimer on ATI systems didn't
work out very well. In particular laptops with C2/C3 support often
don't let it tick during idle, which makes it useless. There were also
some other bugs that made the apicmaintimer often not used at all.
I tried some other experiments - running timer over RTC and some other
things but they didn't really work well neither.
I rechecked the specs now and it turns out this simple change is
actually enough to avoid the double ticks on the ATI systems. We just
turn off IRQ 0 in the 8254 and only route it directly using the IO-APIC.
I tested it on a few ATI systems and it worked there. In fact it worked
on all chipsets (NVidia, Intel, AMD, ATI) I tried it on.
According to the ACPI spec routing should always work through the
IO-APIC so I think it's the correct thing to do anyways (and most of the
old gunk in check_timer should be thrown away for x86-64).
But for 2.6.16 it's best to do a fairly minimal change:
- Use the known to be working everywhere-but-ATI IRQ0 both over 8254
and IO-APIC setup everywhere
- Except on ATI disable IRQ0 in the 8254
- Remove the code to select apicmaintimer on ATI chipsets
- Add some boot options to allow to override this (just paranoia)
In 2.6.17 I hope to switch the default over to this for everybody.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SMP time selection originally ran after all CPUs were brought up because
it needed to know the number of CPUs to decide if it needs an MP safe
timer or not.
This is not needed anymore because we know present CPUs early.
This fixes a couple of problems:
- apicmaintimer didn't always work because it relied on state that was
set up time_init_gtod too late.
- The output for the used timer in early kernel log was misleading
because time_init_gtod could actually change it later. Now always
print the final timer choice
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It didn't set up the CPU possible map early enough, so the
option didn't actually work.
Noticed by Heiko Carstens
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
Old check for the IO-APIC watchdog during the timer check was wrong -
it obviously should only drop into this if the IO-APIC watchdog is used.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes x86-64 use the common X86_PM_TIMER Kconfig entry in drivers/acpi
And since PM timer is needed for correct timing on a lot of systems
now (e.g. AMD dual cores) and we often get bug reports from people
who forgot to set it make it depend on CONFIG_EMBEDDED. x86-64 had
this change before and it's a good thing.
I also fixed the description slightly to make this more clear.
Cc: len.brown@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Big Unisys systems have multiple clusters too, but they have an
synchronized TSC.
I'm using the SMBIOS to check for vendor == IBM.
Cc: Chris McDermott <lcm@us.ibm.com>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In previous versions of pci-gart.c, no_iommu was used to determine if IOMMU was
disabled in the GART DMA mapping functions. This changed in 2.6.16 and now
gart_xxx() functions are only called if gart is enabled. Therefore, uses of
no_iommu in the GART code are no longer necessary and can be removed.
Also, it removes double deceleration of no_iommu and force_iommu in pci.h and
proto.h, by removing the deceleration in pci.h.
Lastly, end_pfn off by one error.
Tested (along with patch 1/2) on dual opteron with gart enabled, iommu=soft,
and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit 9c869edac5 moved the i386 topology.c
file. That change broke x86-64 compiles, as it uses the same file.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Undo setting of CONFIG_DEBUG_INFO in the previous defconfig update. It
will make every build much slower and need more disk space and isn't a good
default.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't print KERN_INFO in the middle of a printk line.
printk(KERN_INFO "OEM ID: %s ",str);
is just above this. This is already fixed up in i386 copy.
Signed-off-by: Martin J. Bligh <mbligh@google.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously the numa hash code would be confused by holes in the node space
and stop early. This is the first part of the fix for the non boot issue
with empty nodes on Opterons.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Code was refusing good SRATs because about 12K got lost somewhere.
Allow less than 1MB of difference before rejecting it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
But do it after everything else to risk less from recursive
crashes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many laptops have problems with ticking the local APIC timer in C2/C3.
The code added earlier to use it by default on ATI didn't really work
for them. Don't enable it when the system supports C2/C3.
This doesn't fix the problem fully, but at least it's not worse than before.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This caused a sigreturn with bad argument on a preemptible kernel
to complain with
Debug: sleeping function called from invalid context at /home/lsrc/quilt/linux/include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1
Call Trace: {__might_sleep+190} {profile_task_exit+21}
{__do_exit+34} {do_wait+0}
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Along with that, also suppress the memory touching altogether when the
watchdog is not running, to eliminate needless crosstalk. Plus ad a call
to it to make things consistent (one could also consider removing the call
in enable_timer_nmi_watchdog()).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The early initialization of cpu_to_node code as it is now only updates the
cpu_to_node array, and does not update cpu_pda()->nodemember. This will
cause numa_node_id() to return 0 on systems where CPU 0 is not on Node 0.
This leads to a kernel panic in slab.c.
I've tested the patch below on a 16 processor x86_64 ES7000-600 server, and
no longer see the panic I saw with the original 2.6.16-rc3.
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't touch the non DMA members in the sg list in dma_map_sg in the IOMMU
Some drivers (in particular ST) ran into problems because they reused the sg
lists after passing them to pci_map_sg(). The merging procedure in the K8
GART IOMMU corrupted the state. This patch changes it to only touch the dma*
entries during merging, but not the other fields. Approach suggested by Dave
Miller.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We found a problem with x86_64 kernels with preemption enabled, where
having multiple tasks doing ptrace singlesteps around the same time will
cause the system to 'oops'. The problem seems that a task can get
preempted out of the do_debug() processing while it is running on the
DEBUG_STACK stack. If another task on that same cpu then enters do_debug()
and uses the same per-cpu DEBUG_STACK stack, the previous preempted tasks's
stack contents can be corrupted, and the system will oops when the
preempted task is context switched back in again.
The typical oops looks like the following:
Unable to handle kernel paging request at ffffffffffffffae RIP: <ffffffff805452a1>{thread_return+34}
PGD 103027 PUD 102429067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in:
Pid: 3786, comm: ssdd Not tainted 2.6.15.2 #1
RIP: 0010:[<ffffffff805452a1>] <ffffffff805452a1>{thread_return+34}
RSP: 0018:ffffffff80824058 EFLAGS: 000136c2
RAX: ffff81017e12cea0 RBX: 0000000000000000 RCX: 00000000c0000100
RDX: 0000000000000000 RSI: ffff8100f7856e20 RDI: ffff81017e12cea0
RBP: 0000000000000046 R08: ffff8100f68a6000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff81017e12cea0 R12: ffff81000c2d53e8
R13: ffff81017f5b3be8 R14: ffff81000c0036e0 R15: 000001056cbfc899
FS: 00002aaaaaad9b00(0000) GS:ffffffff80883800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffffffffae CR3: 00000000f6fcf000 CR4: 00000000000006e0
Process ssdd (pid: 3786, threadinfo ffff8100f68a6000, task ffff8100f7856e20)
Stack: ffffffff808240d8 ffffffff8012a84a ffff8100055f6c00 0000000000000020
0000000000000001 ffff81000c0036e0 ffffffff808240b8 0000000000000000
0000000000000000 0000000000000000
Call Trace: <#DB>
<ffffffff8012a84a>{try_to_wake_up+985}
<ffffffff8012c0d3>{kick_process+87}
<ffffffff8013b262>{signal_wake_up+48}
<ffffffff8013b5ce>{specific_send_sig_info+179}
<ffffffff80546abc>{_spin_unlock_irqrestore+27}
<ffffffff8013b67c>{force_sig_info+159}
<ffffffff801103a0>{do_debug+289} <ffffffff80110278>{sync_regs+103}
<ffffffff8010ed9a>{paranoid_userspace+35}
Unable to handle kernel paging request at 00007fffffb7d000 RIP: <ffffffff8010f2e4>{show_trace+465}
PGD f6f25067 PUD f6fcc067 PMD f6957067 PTE 0
Oops: 0000 [2] PREEMPT SMP
This patch disables preemptions for the task upon entry to do_debug(), before
interrupts are reenabled, and then disables preemption before exiting
do_debug(), after disabling interrupts. I've noticed that the task can be
preempted either at the end of an interrupt, or on the call to
force_sig_info() on the spin_unlock_irqrestore() processing. It might be
better to attempt to code a fix in entry.S around the code that calls
do_debug().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
The IBM Summit 3 chipset doesn't implement the HPET timer replacement
option. Since the current Linux code relies on it use a mixed mode with
both PIT for the interrupt and HPET counters for the time keeping. That
was already implemented, but didn't work properly because it was still
using the last interrupt offset in HPET. This resulted in x460 not
booting. Fix this up by using the free running HPET counter.
Shouldn't affect any other machine because they either use full HPET mode
or no HPET at all.
TBD needs a similar 32bit fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The *at patches introduced fstatat and, due to inusfficient research, I
used the newfstat functions generally as the guideline. The result is that
on 32-bit platforms we don't have all the information needed to implement
fstatat64.
This patch modifies the code to pass up 64-bit information if
__ARCH_WANT_STAT64 is defined. I renamed the syscall entry point to make
this clear. Other archs will continue to use the existing code. On x86-64
the compat code is implemented using a new sys32_ function. this is what
is done for the other stat syscalls as well.
This patch might break some other archs (those which define
__ARCH_WANT_STAT64 and which already wired up the syscall). Yet others
might need changes to accomodate the compatibility mode. I really don't
want to do that work because all this stat handling is a mess (more so in
glibc, but the kernel is also affected). It should be done by the arch
maintainers. I'll provide some stand-alone test shortly. Those who are
eager could compile glibc and run 'make check' (no installation needed).
The patch below has been tested on x86 and x86-64.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits in
the node's cpumask when a cpu goes down or cpu bring up is cancelled. This
is buggy since there are pieces of common code where the cpumask is checked
in the cpu down code path to decide on things (like in the slab down path).
PPC does the right thing, but x86_64 and ia64 don't (This was the reason
Sonny hit upon a slab bug during cpu offline on ppc and could not reproduce
on other arches). This patch fixes it for x86_64. I won't attempt ia64 as
I cannot test it.
Credit for spotting this should go to Alok.
(akpm: this was applied, then reverted. But it's OK now because we now use
for_each_cpu() in the right places).
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 10f4dc8b27.
Quoth Andi Kleen:
"Kiran decided that it makes the problem worse than it was before.
Fixing it fully requires more work which is too much for 2.6.16. So
please revert that commit for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains a printk reorder to remove the current problem of
displaying "PCI-DMA: Disabling IOMMU." and then "PCI-DMA: using GART
IOMMU" 20 lines later in dmesg.
It also constains a printk reorder in swiotlb to state swiotlb
enablement prior to describing the location of the bounce buffers, and a
printk reorder to state gart enablement prior to describing the
aperature.
Also constains a whitespace cleanup in arch/x86_64/kernel/setup.c
Tested (along with patch 2/2) on dual opteron with gart enabled,
iommu=soft, and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hack for 2.6.16. In 2.6.17 all code that uses NR_CPUs should
be audited and changed to only touch possible CPUs.
Don't mark the reference per cpu data init data (so it stays
around after boot) and point all impossible CPUs to it. This way
they reference some valid - although shared memory. Usually
this is only initialization like INIT_LIST_HEADs and there
won't be races because these CPUs never run. Still somewhat hackish.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's bad juju to touch the APIC when it hasn't been enabled.
I also moved ack_bad_irq for x86-64 out of line following i386.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Checking of the validity of pointers should be consistently done before
dereferencing the pointer.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conditionalize two unwind directives to match other similarly
conditional code.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Cc: Jim Houston <jim.houston@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some broken motherboards (at least one NForce3 based AMD64 laptop)
the PIT timer runs at a incorrect frequency. This patch adds a new
option "apicpmtimer" that allows to use the APIC timer and calibrate it
using the PMTimer. It requires the earlier patch that allows to run the
main timer from the APIC.
Specifying apicpmtimer implies apicmaintimer.
The option defaults to off for now.
I tested it on a few systems and the resulting APIC timer frequencies
were usually a bit off, but always <1%, which should be tolerable.
TBD figure out heuristic to enable this automatically on the affected
systems TBD perhaps do it on all NForce3s or using DMI?
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kprobes cannot deal with the funny calling conventions when it
runs on a different stack when it returns. If someone wants
to instrument context switch they can add a probe to schedule()
instead.
Cc: jkenisto@us.ibm.com, prasanna@in.ibm.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Align the start of the per-cpu section to the configured number of bytes in a
cache line. This stops a BUG_ON() from triggering in load_module() when
DEFINE_PER_CPU() is used in a module and the section isn't cacheline-aligned.
Rusty also found this and sent a patch in a while ago
(http://lkml.org/lkml/2004/10/19/17), I don't know what came of that.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[ AK: I redid Kevin's fix to be simpler, but the idea and original
analysis of the problem is from Kevin]
This avoid allocation failures on some SATA systems like Nvidia CK8
when the IOMMU gets fragmented. Modern SATA devices have quite large queues
(128 entries) and the FS with ext2/3 is good enough now that it often
passes whole 128 page sg lists down to the driver. These require
512K of continuous free space in the IOMMU aperture to map when merged.
When the IOMMU is fragmented this could lead to spurious IO errors
due to failing mappings.
Short term fix is to just try to map the SG list again unmerged
page by page - this way fragmentation doesn't matter anymore.
The code for that was already there, but it just wasn't enabled for the
merge case.
According to Kevin at least the Nvidia device doesn't seem to benefit
from merging much anyways, so the only slowdown is from trying
to do an unnecessary merge attempt.
Kevin plans to implement better fragmentation avoidance in the future,
but that wouldn't be 2.6.16 material.
TBD: should add some statistic counters to count how often that really
happens.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I broke this earlier when moving the patch from i386 to x86-64.
Need to return the virtual address here, not the physical address.
This fixes some boot time crashes on x86-64.
Cc: gregkh@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Check if the processor/memory affinity entries are long enough
according to the ACPI 3.0 spec.
- Ignore memory affinity entries that define a zero length region.
All based on BIOS issues found in the field @)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
attached patch is 2 more cases i found via running the reference_init.pl
script. These were easy to spot just knowing the file names. There is
one another about init/main.c that i cant exactly zero in. (partly
because i dont know how to interpret the data thats spewed out of the tool).
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It has been enabled by default for some time now and is cheap enough
so it doesn't matter anyways.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits
in the node's cpumask when a cpu goes down or cpu bring up is cancelled.
This is buggy since there are pieces of common code where the cpumask is
checked in the cpu down code path to decide on things (like in the slab
down path). PPC does the right thing, but x86_64 and ia64 don't (This
was the reason Sonny hit upon a slab bug during cpu offline on ppc and
could not reproduce on other arches). This patch fixes it for x86_64.
I won't attempt ia64 as I cannot test it.
Credit for spotting this should go to Alok.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They cause quite bad performance regressions on Netburst
This is temporary until we can get new optimized functions
for these CPUs.
This undoes changes that were done in 2.6.15 and in 2.6.16-rc1,
essentially bringing the code back to 2.6.14 level. Only change
is I renamed the X86_FEATURE_K8_C flag to X86_FEATURE_REP_GOOD
and fixed the check for the flag and also fixed some comments.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This avoids BUG_ONs in the low level allocator when an illegal
GFP mask is added.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At resume time, TSC's value or something similar might be changed a lot
against suspend time. This could make system gets a very big lost ticks.
See http://bugzilla.kernel.org/show_bug.cgi?id=5825
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They all have problems with IRQ 0 routing, so just use the APIC on them.
Can be overwritten with "noapicmaintimer"
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another piece from the no-idle-tick patch.
This can be enabled with the "apicmaintimer" option.
This is mainly useful when the PIT/HPET interrupt is unreliable.
Note there are some systems that are known to stop the APIC
timer in C3. For those it will never work, but this case
should be automatically detected.
It also only works with PM timer right now. When HPET is used
the way the main timer handler computes the delay doesn't work.
It should be a bit more efficient because there is one less
regular interrupt to process on the boot processor.
Requires earlier bugfix from Venkatesh
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A kprobe executes IRET early and that could cause NMI recursion
and stack corruption.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This assembly version is measurably faster than the generic version in
lib/iomap_copy.c.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to use the compat function here.
Pointer out by Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle more bogus MCFG entries
Some Asus P4 boards seem to have broken MCFG tables with
only a single entry for busses 0-0. Special case these
and assume they mean all busses can be accessed.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix a typo/mis-merge in one of the previous patches.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the ability to disability packet split at compile time and use the legacy receive path on PCI express hardware. Made this a CONFIG option and modified the Kconfig, to reflect the new option.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: John Ronciak <john.ronciak@intel.com>
Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Add x86-64 specific memory hot-add functions, Kconfig options,
and runtime kernel page table update functions to make
hot-add usable on x86-64 machines. Also, fixup the nefarious
conditional locking and exports pointed out by Andi.
Tested on Intel and IBM x86-64 memory hot-add capable systems.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another try at this.
For 32bit follow the 32bit implementation from Ingo -
mappings are growing down from the end of stack now
and vary randomly by 1GB.
Randomized mappings for 64bit just vary the normal mmap break
by 1TB. I didn't bother implementing full flex mmap for 64bit
because it shouldn't be needed there.
Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... as they are no longer needed. Since there were hard-coded numbers in the
file, the patch also adds a mechanism to avoid these (otherwise potential
future changes would again and again require adjusting these numbers).
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The comparison of the initrd start address against "&_end" is
unnecessary and incorrect. Make it match the x86 code that just
compares the passed-in arguments.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For not fully explained reasons it broke mem=... on several setups.
Also minor cleanup.
Cc: axboe@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To avoid mistakes.
I got a few reports where people got broken timing because they didn't
have the PMTIMER fallback.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This unbreaks recursive kprobes which didn't work anymore
due to an earlier patch which converted the debug entry point
to use an IST.
This also allows nesting of the debug entry point too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o This fix was posted for i386 long back. Posting it for x86_64.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110380103229830&w=2
o This patch fixes the problem of secondary cpus boot up. This situation
is faced when kernel is built for default locations like 16MB and
onwards. In this configuration, only primary cpu (BP) comes and
secondary cpus don't boot.
o Problem occurs because in trampoline code, lgdt is not able to load the
GDT as it happens to be situated beyond 16MB. This is due to the fact
that cpu is still in real mode and default operand size is 16bit.
o This patch uses lgdtl instead of lgdt to force operand size to 32
instead of 16.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... reducing the amount of changes Xen has to do.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The explicit and implicit calls to setup_early_printk() were passing
inconsistent arguments.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously they would be only allocated before the kernel text at
1MB. This limited the maximum supported memory to 128GB.
Now allow the e820 allocator to put them everywhere. Try
to put them beyond any DMA zones to avoid filling them up.
This should free some GFP_DMA memory compared to earlier kernels.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
hard_smp_processor_id would return the local APIC id instead
of the Linux processor id. On big systems they are often
not identical. safe_smp_processor_id is just a wrapper
around it that does the necessary conversions.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove support for obsolete hardware and cleanup.
- Remove checks for non integrated APICs
- Replace apic_write_around with apic_write.
- Remove apic_read_around
- Remove APIC version reads used by old workarounds
- Remove old workaround for Simics
- Fix indentation
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When building in a separate objtree, file names produced by BUG() & Co. can
get fairly long; printing only the first 50 characters may thus result in
(almost) no useful information. The following change makes it so that rather
the last 50 characters of the filename get printed.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Especially under Xen, where the console cannot be adjusted to more than 25
lines, it is fairly important that the information displayed during a panic
is as compact as possible. Below adjustments work towards that.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Due to a broken condition, the body of the loop that is intended to wait for
the Update-In-Progress bit to get set and then cleared again was never
entered; in fact, the entire loop was optimized out by the compiler. Here is
a change to fix the condition (and to also move the initialization of locals
out of the spin lock protected region).
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was only needed for APM
Pointed out by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
X86_FEATURE_K8_C was a synthetic Linux CPUID flag that was used for some
code optimizations in Opteron C stepping or later. But support for pre C
stepping optimizations has been removed, so this isn't needed anymore.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Saves about ~18K .text in defconfig
There would be more optimization potential, but that's for later.
Suggestion originally from Bill Irwin.
Fix from Andy Whitcroft.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They used to be used by the reboot code, but not anymore.
Noticed by Jan Beulich
Cc: JBeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Currently, during kexec reboot, IOAPIC is re-programmed back to virtual
wire mode if there was an i8259 connected to it. This enables getting
timer interrupts in second kernel in legacy mode.
o After putting into virtual wire mode, IOAPIC delivers the i8259 interrupts
to CPU0. This works well for kexec but not for kdump as we might crash
on a different CPU and second kernel will not see timer interrupts.
o This patch modifies the redirection table entry to deliver the timer
interrupts to the cpu we are rebooting (instead of hardcoding to zero).
This ensures that second kernel receives timer interrupts even on a
non-boot cpu.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce vSMP arch to the kernel.
This patch:
1. Adds CONFIG_X86_VSMP
2. Adds machine specific macros for local_irq_disabled, local_irq_enabled
and irqs_disabled
3. Writes to the vSMP CTL device to indicate kernel compiled with CONFIG_VSMP
Signed-off-by: Ravikiran Thirumalai <kiran@scalemp.com>
Signed-off-by: Shai Fultheim <shai@scalemp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently we attempt to restore virtual wire mode on reboot, which only
works if we can figure out where the i8259 is connected. This is very
useful when we are kexec another kernel and likely helpful to an peculiar
BIOS that make assumptions about how the system is setup.
Since the acpi MADT table does not provide the location where the i8259 is
connected we have to look at the hardware to figure it out.
Most systems have the i8259 connected the local apic of the cpu so won't be
affected but people running Opteron and some serverworks chipsets should be
able to use kexec now.
In addition this patch removes the hard coded assumption that the io_apic
that delivers isa interrups is always known to the kernel as io_apic 0.
There does not appear to be anything to guarantee that assumption is true.
And From: Vivek Goyal <vgoyal@in.ibm.com>
A minor fix to the patch which remembers the location of where i8259 is
connected. Now counter i has been replaced by apic. counter i is having
some junk value which was leading to non-detection of i8259 connected to
IOAPIC.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Setting RF (resume flag) allows a debugger to resume execution after a code
breakpoint without tripping the breakpoint again. It is reset by the CPU
after executing one instruction.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was set as an NMI, but the NMI bit always forces an interrupt
to end up at vector 2. So it was never used. Remove.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix
CC arch/x86_64/kernel/nmi.o
linux/arch/x86_64/kernel/nmi.c: In function ???check_nmi_watchdog???:
linux/arch/x86_64/kernel/nmi.c:155: warning: statement with no effect
on Uniprocessor builds.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch uses a static PDA array early at boot and reallocates processor PDA
with node local memory when kmalloc is ready, just before pda_init.
The boot_cpu_pda is needed since the cpu_pda is used even before pda_init for
that cpu is called (to set the static per-cpu areas offset table etc)
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch enables early intialization of cpu_to_node.
apicid_to_node is built by reading the SRAT table, from acpi_numa_init with
ACPI_NUMA and k8_scan_nodes with K8_NUMA.
x86_cpu_to_apicid is built by parsing the ACPI MADT table, from acpi_boot_init.
We combine these two tables and setup cpu_to_node.
Early intialization helps the static per_cpu_areas in getting pages from
correct node.
Change since last release:
Do not initialize early init_cpu_to_node for faking node cases.
Patch tested on TYAN dual core 4P board with K8 only, ACPI_NUMA.
Tested on EM64T NUMA. Also tested with numa=off, numa=fake, and running
a kernel compiled with NUMA on a regular EM64 2 way SMP.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The real vsyscall .text addresses are not mapped when the alternative()
replacement runs early, so use some black magic to access them using
the direct mapping.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They already do this in hardware and the Linux algorithm
actually adds errors.
Cc: mingo@elte.hu
Cc: rohit.seth@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Apic id is in most significant 8 bits of APIC_ID register. Current code
is trying to write apic id to least significant 8 bits. This patch fixes
it.
o This fix enables booting uni kdump capture kernel on a cpu with non-zero
apic id.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove exports that are already exported from the object's source file.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These functions are inlines and shouldn't be exported.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Remove optimization for old B stepping Opteron
- Make the fast path for copies with a multiple of eight length faster.
- Minor instruction rearrangement to hopefully avoid a pipeline
stall or two.
- Add comment about errata to consider.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AK: I hacked Muli's original patch a lot and there were a lot
of changes - all bugs are probably to blame on me now.
There were also some changes in the fall back behaviour
for swiotlb - in particular it doesn't try to use GFP_DMA
now anymore. Also all DMA mapping operations use the
same core dma_alloc_coherent code with proper fallbacks now.
And various other changes and cleanups.
Known problems: iommu=force swiotlb=force together breaks
needs more testing.
This patch cleans up x86_64's DMA mapping dispatching code. Right now
we have three possible IOMMU types: AGP GART, swiotlb and nommu, and
in the future we will also have Xen's x86_64 swiotlb and other HW
IOMMUs for x86_64. In order to support all of them cleanly, this
patch:
- introduces a struct dma_mapping_ops with function pointers for each
of the DMA mapping operations of gart (AMD HW IOMMU), swiotlb
(software IOMMU) and nommu (no IOMMU).
- gets rid of:
if (swiotlb)
return swiotlb_xxx();
- PCI_DMA_BUS_IS_PHYS is now checked against the dma_ops being set
This makes swiotlb faster by avoiding double copying in some cases.
Signed-Off-By: Muli Ben-Yehuda <mulix@mulix.org>
Signed-Off-By: Jon D. Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Broken BIOS on Iwill 8way systems reports these and it causes the bootmem
allocator to crash. Add a sanity check if all the PXMs in the
SRAT table cover all memory as reported by e820. If the sanity
check fails the SRAT is rejected and the code will fall back
to discover the NUMA topology using the K8 northbridge registers
when applicable.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a new notifier chain that is called with IDLE_START
when a CPU goes idle and IDLE_END when it goes out of idle.
The context can be idle thread or interrupt context.
Since we cannot rely on MONITOR/MWAIT existing the idle
end check currently has to be done in all interrupt
handlers.
They were originally inspired by the similar s390 implementation.
They have a variety of applications:
- They will be needed for CONFIG_NO_IDLE_HZ
- They can be used for oprofile to fix up the missing time
in idle when performance counters don't tick.
- They can be used for better C state management in ACPI
- They could be used for microstate accounting.
This is just infrastructure so far, no users.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix off by one when checking if the machine has enougn memory to need IOMMU
This caused the IOMMUs to be needlessly enabled for mem=4G
Based on a patch from Jon Mason
Signed-off-by: jdmason@us.ibm.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Whenever we see that a CPU is capable of C3 (during ACPI cstate init), we
disable local APIC timer and switch to using a broadcast from external timer
interrupt (IRQ 0).
Patch below adds the code for x86_64.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the finer control of local APIC timer. We cannot provide a sub-jiffy
control like this when we use broadcast from external timer in place of
local APIC. Instead of removing this only on systems that may end up using
broadcast from external timer (due to C3), I am going the
"I'm feeling lucky" way to remove this fully. Basically, I am not sure about
usefulness of this code today. Few other architectures also don't seem to
support this today.
If you are using profiling and fine grained control and don't like this going
away in normal case, yell at me right now.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I would like to throw out a suggestion for a possible change in the way that
the debug register traps are handled in do_debug() when the trap occurs
in kernel-mode.
In the x86_64 version of do_debug(), the code will skip around sending
a SIGTRAP to the current task if the trap occurred while in kernel mode.
On the i386-side of things, if the access happens to occur in kernel mode
(say during a read(2) of user's buffer that matches the address of a
debug register trap), then the do_debug() routine for i386 will go ahead
and call send_sigtrap() and send the SIGTRAP signal. The send_sigtrap()
code will also set the info.si_addr to NULL in this case (even though I
don't understand why, since the SIGTRAP siginfo processing doesn't use
the si_addr field...).
So I would like to suggest that the x86_64 do_debug() routine also
follow this type of behavior and have it go ahead and send the
SIGTRAP signal to the current task, even if the debug register trap
happens to have occurred in kernel mode. I have taken a stab at
a patch for this change below. (It includes the i386-ish change
for setting si_addr to NULL when the trap occurred in kernel mode.)
It seems like a useful feature to be able to 'watch' a user location that
might also be modified in the kernel via a system service call, and have the
debugger report that information back to the user, rather than to just
silently ignore the trap.
Additionally, I realize that users that pull in a kernel debugger such as
KGDB into their kernel might want to remove this change below when they add
in KGDB support. However, they could alternatively look at the current
task's thread.debugreg[] values to see if the trap occurred due to KGDB
or instead because of a user-space debugger trap, and still honor the
user SIGTRAP processing (instead of the KGDB breakpoint processing)
if the trap matches up with the thread.debugreg[] registers.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Much better to deal with these than with the magic numbers.
And remove the comment describing the bits - kernel source
is no replacement for an architecture manual.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't need to do the vmalloc check for the module range because its
PML4 is shared with the kernel text.
Also removed an unnecessary TLB flush.
Pointed out by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is on the same lines as Zachary Amsden's i386 GDT page alignemnt
patch in -mm, but for x86_64.
Patch to align and pad x86_64 GDT on page boundries.
[AK: some minor cleanups and fixed incorrect TLS initialization
in CPU init.]
Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This might help on distributions that use a 32bit biarch compiler.
First pass -m64 by default.
Secondly add some more .code32s because at least the Ubuntu biarch
32bit as called by gcc doesn't seem to handle -m64 -m32 as generated
by the Makefile without such assistance.
And finally make sure the linker script can be preprocessed
with a 32bit cpp.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attempt to avoid overflow in __delay caused varying precision
on different CPUs depending on differences in the CPU speed.
We should be able to do this multiplication with out overflowing
provided the
cpu is running at less than about 128 GHz. xloops < 20000 * 0x10c6.
loops_per_jiffy * HZ <= cpu_clock_speed. So if the cpu clock speed
< 2^64/(20000 * 0x10c6) = 2^64/ 51E6CC0 < 2^64/2^27 = 2^37 = 128G we
will not overflow the calculation.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we don't know the node a PCI bus is connected to return -1.
This matches the generic code.
Noticed by Ravikiran G Thirumalai <kiran@scalex86.org>
Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A lot of Opteron BIOS just pass 10 in all SLIT entries (10 is the
normalized unit). This is actually worse than the default heuristic
because it leads to pci_distance not knowing the difference between
local and remote nodes anymore. This messes up some NUMA
heuristics in generic code.
In this case it's better to fall back to the default heuristic
which just does nodea == nodeb ? 10 : 20.
This patch does some basic sanity checking on the SLIT and only accepts
the SLIT when it passes.
Invariants enforced are:
- Node to itself shall be 10
- Any other distance shouldn't be 10
- Distances smaller than 10 are illegal
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some people need it now on 64bit so reuse the i386 code for
x86-64. This will be also useful for future bug workarounds.
It is a bit simplified there because there is no need
to do it very early on x86-64. This means it doesn't need
early ioremap et.al. We run it as a core initcall right now.
I hope it's not needed for early setup.
I added a general CONFIG_DMI symbol in case IA64 or someone
else wants to reuse the code later too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was a backup file that somehow made it into the official
tree. Never used for anything. Remove.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The introduction of call_softirq switching to the interrupt stack several
releases earlier resulted in a problem with the code in show_trace, which
assumes that it can pick the previous stack pointer from the end of the
interrupt stack.
Cc: Andi Kleen <ak@muc.de>
Cc: Arjan van de Ven <arjanv@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Be more careful with TF handling to fix some copy protection codes in wine
patch originally for i386 by Linus, then ported to x86_64 by Andi Kleen
see: [PATCH] x86_64: Some fixes for single step handling
commit: be61bff789
But it was never applied to the ia32 emulation code which breaks some
copy-protection schemes under wine when running on x86_64.
Signed-off-by: Peter Beutner <p.beutner@gmx.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
So why are we calling smp_send_stop from machine_halt?
We don't.
Looking more closely at the bug report the problem here
is that halt -p is called which triggers not a halt but
an attempt to power off.
machine_power_off calls machine_shutdown which calls smp_send_stop.
If pm_power_off is set we should never make it out machine_power_off
to the call of do_exit. So pm_power_off must not be set in this case.
When pm_power_off is not set we expect machine_power_off to devolve
into machine_halt.
So how do we fix this?
Playing too much with smp_send_stop is dangerous because it
must also be safe to be called from panic.
It looks like the obviously correct fix is to only call
machine_shutdown when pm_power_off is defined. Doing
that will make Andi's assumption about not scheduling
true and generally simplify what must be supported.
This turns machine_power_off into a noop like machine_halt
when pm_power_off is not defined.
If the expected behavior is that sys_reboot(LINUX_REBOOT_CMD_POWER_OFF)
becomes sys_reboot(LINUX_REBOOT_CMD_HALT) if pm_power_off is NULL
this is not quite a comprehensive fix as we pass a different parameter
to the reboot notifier and we set system_state to a different value
before calling device_shutdown().
Unfortunately any fix more comprehensive I can think of is not
obviously correct. The core problem is that there is no architecture
independent way to detect if machine_power will become a noop, without
calling it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I noticed that some lowlevel send_IPI_mask helpers had a hotplug/preempt
race whereupon the cpu_online_map was read before disabling preemption;
...
cpumask_t mask = cpu_online_map;
int cpu = get_cpu();
cpu_clear(cpu, mask);
...
But then i realised that there is no need for these lowlevel functions to
be going through all this trouble when all the callers are already made
hotplug/preempt safe.
Signed-off-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is one CPU here whose MCE bank count is 6. This patch increases
x86_64's MCE bank count.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following is probably a good idea given that the atomic_set() isn't
a barrier here either.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This
- switches the INT3 handler to run on an IST stack (to cope with
breakpoints set by a kernel debugger on places where the kernel's
%gs base hasn't been set up, yet); the IST stack used is shared with
the INT1 handler's
[AK: this also allows setting a kprobe on the interrupt/exception entry
points]
- allows nesting of INT1/INT3 handlers so that one can, with a kernel
debugger, debug (at least) the user-mode portions of the INT1/INT3
handling; the nesting isn't actively enabled here since a kernel-
debugger-free kernel doesn't need it
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Print bits for RDTSCP, SVM, CR8-LEGACY.
Also now print power flags on i386 like x86-64 always did.
This will add a new line in the 386 cpuinfo, but that shouldn't
be an issue - did that in the past too and I haven't heard
of any breakage.
I shrunk some of the fields in the i386 cpuinfo_x86 to chars
to make up for the new int "x86_power" field. Overall it's
smaller than before.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define it for i386 too.
This is a synthetic flag that signifies that the CPU's TSC runs
at a constant P state invariant frequency.
Fix up the logic on x86-64/i386 to set it on all known CPUs.
Use the AMD defined bit to set it on future AMD CPUs.
Cc: venkatesh.pallipadi@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Was only used by the floppy driver to work around some ancient
hardware bug that should never occur on any 64bit system.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most users don't need it so no need to waste memory.
This means an user has to specify the appropiate number of
hotplug CPUs on the command line with additional_cpus=...
or fix their BIOS to follow the convention in
Documentation/x86-64/cpu-hotplug-spec
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adjust page fault protection error check before considering it to be
a vmalloc synchronization candidate.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure no iret can fault without attached recovery code.
Cannot happen in the normal case, but might be useful
with kernel debuggers
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since a double fault always implies that kernel data structures are
corrupt, this fault should neither be handed to user mode handling,
nor should the handler allow resuming the faulting code stream (since
architecturally this isn't a fault, but an abort).
Note that this slightly depends on the previously submitted patch
adjusting the prototype of notify_die() (a compiler warning will result
without that other patch).
AK: Removed obsolete CONFIG_CHECKING code, added comments
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adjusts things so that handlers of the die() notifier will have
sufficient information about the trap currently being handled. It also
adjusts the notify_die() prototype to (again) match that of i386.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Other than apparently commonly assumed, the bound instruction does not
require the corresponding IDT entry to have DPL 3.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As a follow-up to the introduction of CONFIG_UNWIND_INFO, this
separates the generation of frame unwind information for x86-64 from
that of full debug information.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on the documentation recently posted by Richard Brunner.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Frame unwind information was still incorrect for ia32_ptregs_common
(sorry, my fault), and could be improved for some of the other entry
points.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
arch: Use <linux/capability.h> where capable() is used.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a window where a probe gets removed right after the probe is hit
on some different cpu. In this case probe handlers can't find a matching
probe instance related to break address. In this case we need to read the
original instruction at break address to see if that is not a break/int3
instruction and recover safely.
Previous code had a bug where we were not checking for the above race in
case of reentrant probes and the below patch fixes this race.
Tested on IA64, Powerpc, x86_64.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch (against 2.6.15-rc5-mm3) fixes a kprobes build break
due to changes introduced in the kprobe locking in 2.6.15-rc5-mm3. In
addition, the patch reverts back the open-coding of kprobe_mutex.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently arch_remove_kprobes() is only implemented/required for x86_64 and
powerpc. All other architecture like IA64, i386 and sparc64 implementes a
dummy function which is being called from arch independent kprobes.c file.
This patch removes the dummy functions and replaces it with
#define arch_remove_kprobe(p, s) do { } while(0)
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on some feedback from Oleg Nesterov, I have made few changes to
previously posted patch.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since Kprobes runtime exception handlers is now lock free as this code path is
now using RCU to walk through the list, there is no need for the
register/unregister{_kprobe} to use spin_{lock/unlock}_isr{save/restore}. The
serialization during registration/unregistration is now possible using just a
mutex.
In the above process, this patch also fixes a minor memory leak for x86_64 and
powerpc.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These days ioctl32.h is only used for communication of fs/compat.c and
fs/compat_ioctl.c and doesn't contain anything of interest to drivers.
Remove inclusion in various drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that all these entries in the arch ioctl32.c files are gone [1], we can
build fs/compat_ioctl.c as a normal object and kill tons of cruft. We need a
special do_ioctl32_pointer handler for s390 so the compat_ptr call is done.
This is not needed but harmless on all other architectures. Also remove some
superflous includes in fs/compat_ioctl.c
Tested on ppc64.
[1] parisc still had it's PPP handler left, which is not fully correct
for ppp and besides that ppp uses the generic SIOCPRIV ioctl so it'd
kick in for all netdevice users. We can introduce a proper handler
in one of the next patch series by adding a compat_ioctl method to
struct net_device but for now let's just kill it - parisc doesn't
compile in mainline anyway and I don't want this to block this
patchset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements generic handling of RTC_IRQP_READ32, RTC_IRQP_SET32,
RTC_EPOCH_READ32 and RTC_EPOCH_SET32 in fs/compat_ioctl.c. It's based on the
x86_64 code which needed a little massaging to be endian-clean.
parisc used COMPAT_IOCTL or generic w_long handlers for these whichce is wrong
and can't work because the ioctls encode sizeof(unsigned long) in their ioctl
number. parisc also duplicated COMPAT_IOCTL entries for other rtc ioctls
which I remove in this patch, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthew Wilcox <matthew@wil.cx>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The comment in compat.c is wrong, every architecture provides a
get_compat_sigevent() for the IPC compat code already.
This basically moves the x86_64 version to common code and removes all the
others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I have heard some complaints about people not finding CONFIG_CRASH_DUMP
option and also some objections about its dependency on CONFIG_EMBEDDED.
The following patch ends that dependency. I thought of hiding it under
CONFIG_KEXEC, but CONFIG_PHYSICAL_START could also be used for some reasons
other than kexec/kdump and hence left it visible. I will also update the
documentation accordingly.
o Following patch removes the config dependency of CONFIG_PHYSICAL_START
on CONFIG_EMBEDDED. The reason being CONFIG_CRASH_DUMP option for
kdump needs CONFIG_PHYSICAL_START which makes CONFIG_CRASH_DUMP depend
on CONFIG_EMBEDDED. It is not always obvious for kdump users to choose
CONFIG_EMBEDDED.
o It also shifts the palce where this option appears, to make it closer
to kexec and kdump options.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Moving the crash_dump.c file to arch dependent part as kmap_atomic_pfn is
specific to i386 and highmem may not exist in other archs.
- Use ioremap for x86_64 to map the previous kernel memory.
- In copy_oldmem_page(), we now directly copy to the user/kernel buffer and
avoid the unneccesary copy to a kmalloc'd page.
Signed-off-by: Rachita Kothiyal <rachita@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Saving the cpu registers of all cpus before booting in to the crash
kernel.
- crash_setup_regs will save the registers of the cpu on which panic has
occured. One of the concerns ppc64 folks raised is that after capturing the
register states, one should not pop the current call frame and push new one.
Hence it has been inlined. More call frames later get pushed on to stack
(machine_crash_shutdown() and machine_kexec()), but one will not want to
backtrace those.
- Not very sure about the CFI annotations. With this patch I am getting
decent backtrace with gdb. Assuming, compiler has generated enough
debugging information for crash_kexec(). Coding crash_setup_regs() in pure
assembly makes it tricky because then it can not be inlined and we don't
want to return back after capturing register states we don't want to pop
this call frame.
- Saving the non-panicing cpus registers will be done in the NMI handler
while shooting down them in machine_crash_shutdown.
- Introducing CRASH_DUMP option in Kconfig for x86_64.
Signed-off-by: Murali M Chakravarthy <muralim@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
)
From: Vivek Goyal <vgoyal@in.ibm.com>
- Implementing the machine_crash_shutdown for x86_64 which will be called by
crash_kexec (called in case of a panic, sysrq etc.). Here we do things
similar to i386. Disable the interrupts, shootdown the cpus and shutdown
LAPIC and IOAPIC.
Changes in this version:
- As the Eric's APIC initialization patches are reverted back, reintroducing
LAPIC and IOAPIC shutdown.
- Added some comments on CPU hotplug, modified code as suggested by Andi
kleen.
Signed-off-by: Murali M Chakravarthy <muralim@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- elfcorehdr= specifies the location of elf core header stored by the
crashed kernel. This command line option will be passed by the kexec-tools
to capture kernel.
Changes in this version :
- Added more comments in kernel-parameters.txt and in code.
Signed-off-by: Murali M Chakravarthy <muralim@in.ibm.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
)
From: Vivek Goyal <vgoyal@in.ibm.com>
- This patch introduces the memmap option for x86_64 similar to i386.
- memmap=exactmap enables setting of an exact E820 memory map, as specified
by the user.
Changes in this version:
- Used e820_end_of_ram() to find the max_pfn as suggested by Andi kleen.
- removed PFN_UP & PFN_DOWN macros
- Printing the user defined map also.
Signed-off-by: Murali M Chakravarthy <muralim@in.ibm.com>
Signed-off-by: Hariprasad Nellitheertha <nharipra@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- In case of system crash, current state of cpu registers is saved in memory
in elf note format. So far memory for storing elf notes was being allocated
statically for NR_CPUS.
- This patch introduces dynamic allocation of memory for storing elf notes.
It uses alloc_percpu() interface. This should lead to better memory usage.
- Introduced based on Andi Kleen's and Eric W. Biederman's suggestions.
- This patch also moves memory allocation for elf notes from architecture
dependent portion to architecture independent portion. Now crash_notes is
architecture independent. The whole idea is that size of memory to be
allocated per cpu (MAX_NOTE_BYTES) can be architecture dependent and
allocation of this memory can be architecture independent.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As the Crypto API now allows multiple implementations to be registered
for the same algorithm, we no longer have to play tricks with Kconfig
to select the right AES implementation.
This patch sets the driver name and priority for all the AES
implementations and removes the Kconfig conditions on the C implementation
for AES.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
A lot of crypto code needs to read/write a 32-bit/64-bit words in a
specific gender. Many of them open code them by reading/writing one
byte at a time. This patch converts all the applicable usages over
to use the standard byte order macros.
This is based on a previous patch by Denis Vlasenko.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This removes the dependency from vmlinux to install, thus avoiding the
current situation where "make install" has a nasty tendency to leave
root-turds in the working directory.
It also updates x86-64 to be in sync with i386.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Configurable 16-bit UID and friends support
This allows turning off the legacy 16 bit UID interfaces on embedded platforms.
text data bss dec hex filename
3330172 529036 190556 4049764 3dcb64 vmlinux-baseline
3328268 529040 190556 4047864 3dc3f8 vmlinux
From: Adrian Bunk <bunk@stusta.de>
UID16 was accidentially disabled for !EMBEDDED.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes it possible for boot code to use screen_info without dragging in
all of tty.h.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The ptrace_get_task_struct() helper that I added as part of the ptrace
consolidation is useful in variety of places that currently opencode it.
Switch them to the common helpers.
Add a ptrace_traceme() helper that needs to be explicitly called, and simplify
the ptrace_get_task_struct() interface. We don't need the request argument
now, and we return the task_struct directly, using ERR_PTR() for error
returns. It's a bit more code in the callers, but we have two sane routines
that do one thing well now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch moves the rtc_interrupt() prototype to rtc.h and removes the
prototypes from C files.
It also renames static rtc_interrupt() functions in
arch/arm/mach-integrator/time.c and arch/sh64/kernel/time.c to avoid compile
problems.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Paul Gortmaker <p_gortmaker@yahoo.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
____cacheline_maxaligned_in_smp is currently used to align critical structures
and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
L1_CACHE_SHIFT_MAX useless.
However, we have been using ____cacheline_maxaligned_in_smp to align
structures on the internode cacheline size. As per Andi's suggestion,
following patch kills ____cacheline_maxaligned_in_smp and introduces
INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
Arches needing L3/Internode cacheline alignment can define
INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp
With this patch, L1_CACHE_SHIFT_MAX can be killed
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_migrate_pages implementation using swap based page migration
This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
The intent of sys_migrate is to migrate memory of a process. A process may
have migrated to another node. Memory was allocated optimally for the prior
context. sys_migrate_pages allows to shift the memory to the new node.
sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory. Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed. However, a user may decide
to manually control the migration.
This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends. The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory. sys_migrate_pages does not modify policies in contrast to Ray's
implementation.
The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset). When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets. The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.
Patch supports ia64, i386 and x86_64.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many arches make shared objects for VDSOs. Generally exclude them.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
With physical CPU hotplug, the CPU is hot removed and it should not receive
any interrupts. Disabling interrupt is much safer. This basically is what we
do in ia64 & x86.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark some key kernel datastructures readonly. This patch was previously
posted on Jun 28th but was back then not merged because nothing was enforcing
rodata anyway.. well that changed now :)
Patch by Christoph Lameter <christoph@lameter.com> and Dave Jones
<davej@redhat.com>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86-64 specific parts to make the .rodata section read only
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bug fix required for the .rodata work on x86-64:
when change_page_attr() and friends need to break up a 2Mb page into 4Kb
pages, it always set the NX bit on the PMD, which causes the cpu to consider
the entire 2Mb region to be NX regardless of the actual PTE perms. This is
fine in general, with one big exception: the 2Mb page that covers the last
part of the kernel .text! The fix is to not invent a new permission for the
new PMD entry, but to just inherit the existing one minus the PSE bit.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, we do not pass the correct start_pfn to e820_hole_size, to
calculate holes. Following patch fixes that.
The bug results in incorrect number of node_present_pages for each pgdat
and causes ugly output in /sys and probably VM inbalances.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Sighed-off-by: Shair Fultheim <shai@scalex86.org>
Sighed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the #define for ACPI_LEVEL_SENSITIVE instead of assuming
non-zero, because ACPICA 20051021 changes its value to zero.
Also, use uniform variable names:
edge_level -> triggering
active_high_low -> polarity
Signed-off-by: Len Brown <len.brown@intel.com>
Now needs to include the type 1 functions ("direct") too.
Reported by Pavel Roskin <proski@gnu.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As reported by Keith Mannthey, there are problems in populate_memnodemap()
The bug was that the compute_hash_shift() was returning 31, with incorrect
initialization of memnodemap[]
To correct the bug, we must use (1UL << shift) instead of (1 << shift) to
avoid an integer overflow, and we must check that shift < 64 to avoid an
infinite loop.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On systems that do not support the HPET legacy functions (basically the IBM
x460, but there could be others), in time_init() we accidentally fall into a
PM timer conditional and set the vxtime_hz value to the PM timer's frequency.
We then use this value with the HPET for timekeeping.
This patch (which mimics the behavior in time_init_gtod) corrects the
collision.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a register set is passed in don't try to fix up the pointer.
Noticed by Al Viro
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They report all busses as MMCONFIG capable, but it never works for the
internal devices in the CPU's builtin northbridge.
It just probes all func 0 devices on bus 0 (the internal northbridge is
currently always on bus 0) and if they are not accessible using MCFG they are
put into a special fallback bitmap.
On systems where it isn't we assume the BIOS vendor supplied correct MCFG.
Requires the earlier patch for mmconfig type1 fallback
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When there is no entry for a bus in MCFG fall back to type1. This is
especially important on K8 systems where always some devices can't be accessed
using mmconfig (in particular the builtin northbridge doesn't support it for
its own devices)
Cc: <gregkh@suse.de>
Cc: <jgarzik@pobox.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's illegal because it can sleep.
Use a two step lookup scheme instead. First look up the vm_struct, then
change the direct mapping, then finally unmap it. That's ok because nobody
can change the particular virtual address range as long as the vm_struct is
still in the global list.
Also added some LinuxDoc documentation to iounmap.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Disabling LAPIC timer isn't sufficient. In some situations, such as we
enabled NMI watchdog, there is still unexpected interrupt (such as NMI)
invoked in offline CPU. This also avoids offline CPU receives spurious
interrupt and anything similar.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: "Seth, Rohit" <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Otherwise TSC->HPET fallback could see incorrect state and crash later.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When multiple probes are registered at the same address and if due to some
recursion (probe getting triggered within a probe handler), we skip calling
pre_handlers and just increment nmissed field.
The below patch make sure it walks the list for multiple probes case.
Without the below patch we get incorrect results of nmissed count for
multiple probe case.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Completed a major overhaul of the Resource Manager code -
specifically, optimizations in the area of the AML/internal
resource conversion code. The code has been optimized to
simplify and eliminate duplicated code, CPU stack use has
been decreased by optimizing function parameters and local
variables, and naming conventions across the manager have
been standardized for clarity and ease of maintenance (this
includes function, parameter, variable, and struct/typedef
names.)
All Resource Manager dispatch and information tables have
been moved to a single location for clarity and ease of
maintenance. One new file was created, named "rsinfo.c".
The ACPI return macros (return_ACPI_STATUS, etc.) have
been modified to guarantee that the argument is
not evaluated twice, making them less prone to macro
side-effects. However, since there exists the possibility
of additional stack use if a particular compiler cannot
optimize them (such as in the debug generation case),
the original macros are optionally available. Note that
some invocations of the return_VALUE macro may now cause
size mismatch warnings; the return_UINT8 and return_UINT32
macros are provided to eliminate these. (From Randy Dunlap)
Implemented a new mechanism to enable debug tracing for
individual control methods. A new external interface,
acpi_debug_trace(), is provided to enable this mechanism. The
intent is to allow the host OS to easily enable and disable
tracing for problematic control methods. This interface
can be easily exposed to a user or debugger interface if
desired. See the file psxface.c for details.
acpi_ut_callocate() will now return a valid pointer if a
length of zero is specified - a length of one is used
and a warning is issued. This matches the behavior of
acpi_ut_allocate().
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
What is the value shown in "cpu MHz" of /proc/cpuinfo when CPUs are capable of
changing frequency?
Today the answer is: It depends.
On i386:
SMP kernel - It is always the boot frequency
UP kernel - Scales with the frequency change and shows that was last set.
On x86_64:
There is one single variable cpu_khz that gets written by all the CPUs. So,
the frequency set by last CPU will be seen on /proc/cpuinfo of all the
CPUs in the system. What you see also depends on whether you have constant_tsc
capable CPU or not.
On ia64:
It is always boot time frequency of a particular CPU that gets displayed.
The patch below changes this to:
Show the last known frequency of the particular CPU, when cpufreq is present. If
cpu doesnot support changing of frequency through cpufreq, then boot frequency
will be shown. The patch affects i386, x86_64 and ia64 architectures.
Signed-off-by: Venkatesh Pallipadi<venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Linux invokes the AML _PDC method (Processor Driver Capabilities)
to tell the BIOS what features it can handle. While the ACPI
spec says nothing about the OS invoking _PDC multiple times,
doing so with changing bits seems to hopelessly confuse the BIOS
on multiple platforms up to and including crashing the system.
Factor out the _PDC invocation so Linux invokes it only once.
http://bugzilla.kernel.org/show_bug.cgi?id=5483
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Fix a bug in kprobes that can cause an Oops or even a crash when a return
probe is installed on one of the following functions: sys_execve,
do_execve, load_*_binary, flush_old_exec, or flush_thread. The fix is to
remove the call to kprobe_flush_task() in flush_thread(). This fix has
been tested on all architectures for which the return-probes feature has
been implemented (i386, x86_64, ppc64, ia64). Please apply.
BACKGROUND
Up to now, we have called kprobe_flush_task() under two situations: when a
task exits, and when it execs. Flushing kretprobe_instances on exit is
correct because (a) do_exit() doesn't return, and (b) one or more
return-probed functions may be active when a task calls do_exit(). Neither
is the case for sys_execve() and its callees.
Initially, the mistaken call to kprobe_flush_task() on exec was harmless
because we put the "real" return address of each active probed function
back in the stack, just to be safe, when we recycled its
kretprobe_instance. When support for ppc64 and ia64 was added, this safety
measure couldn't be employed, and was eventually dropped even for i386 and
x86_64. sys_execve() and its callees were informally blacklisted for
return probes until this fix was developed.
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix up booting with sparse mem enabled. Otherwise it would just
cause an early PANIC at boot.
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is needed for large multinode IBM systems which have a sparse
APIC space in clustered mode, fully covering the available 8 bits.
The previous kernels would limit the local APIC number to 127,
which caused it to reject some of the CPUs at boot.
I increased the maximum and shrunk the apic_version array a bit
to make up for that (the version is only 8 bit, so don't need
an full int to store)
Cc: Chris McDermott <lcm@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CONFIG_CHECKING covered some debugging code used in the early times
of the port. But it wasn't even SMP safe for quite some time
and the bugs it checked for seem to be gone.
This patch removes all the code to verify GS at kernel entry. There
haven't been any new bugs in this area for a long time.
Previously it also covered the sysctl for the page fault tracing.
That didn't make much sense because that code was unconditionally
compiled in. I made that a boot option now because it is typically
only useful at boot.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current x86_64 NUMA memory code is inconsequent when it comes to node
memory ranges. The exact behaviour varies depending on which config option
that is used.
setup_node_bootmem() has start and end as arguments and these are used to
calculate the size of the node like this: (end - start). This is all fine
if end is pointing to the first non-available byte. The problem is that the
current x86_64 code sometimes treats it as the last present byte and sometimes
as the first non-available byte. The result is that some configurations might
lose a page at the end of the range.
This patch tries to fix CONFIG_ACPI_NUMA, CONFIG_K8_NUMA and CONFIG_NUMA_EMU
so they all treat the end variable as the first non-available byte. This is
the same way as the single node code.
The patch is boot tested on dual x86_64 hardware with the above configurations,
but maybe the removed code is needed as some workaround?
Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The logging for boot errors was turned off because it was broken
on some AMD systems. But give Intel EM64T systems a chance because they are
supposed to be correct there.
The advantage is that there is a chance to actually log uncorrected
machine checks after the reset.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On x86_64 arches, there is no way to choose ACPI_NUMA without having to choose
K8_NUMA. CONFIG_K8_NUMA is not needed for Intel EM64T NUMA boxes. It also
looks odd if you have to select ACPI_NUMA from the power management menu.
This patch fixes those oddities. Patch does the following:
1. Makes NUMA a config option like other arches
2. Makes topology detection options like K8_NUMA dependent on NUMA
3. Choosing ACPI NUMA detection can be done from the standard
"Processor type and features" menu
AK: I fixed up the dependencies and changed the help texts a bit
on top of Kiran's patch.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Keeping this function does not makes sense because it's a copied (and
buggy) copy of sys_time. The only difference is that now.tv_sec (which is
a time_t, i.e. a 64-bit long) is copied (and truncated) into a int
(32-bit).
The prototype is the same (they both take a long __user *), so let's drop
this and redirect it to sys_time (and make sure it exists by defining
__ARCH_WANT_SYS_TIME).
Only disadvantage is that the sys_stime definition is also compiled (may be
fixed if needed by adding a separate __ARCH_WANT_SYS_STIME macro, and
defining it for all arch's defining __ARCH_WANT_SYS_TIME except x86_64).
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
B stepping were the first shipping Opterons. memcpy/memset/copy_page/
clear_page had special optimized version for them. These are really
old and in the minority now and the difference to the generic versions
(using rep microcode) is not that big anyways. So just remove them.
TODO: figure out optimized versions for Intel Netburst based EM64T
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Old code could retry for 10 seconds worst time. Only try it
for one second now.
Suggested by Yinghai Lu
Cc: Yinghai.Lu@amd.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fields obtained through cpuid vector 0x1(ebx[16:23]) and
vector 0x4(eax[14:25], eax[26:31]) indicate the maximum values and might not
always be the same as what is available and what OS sees. So make sure
"siblings" and "cpu cores" values in /proc/cpuinfo reflect the values as seen
by OS instead of what cpuid instruction says. This will also fix the buggy BIOS
cases (for example where cpuid on a single core cpu says there are "2" siblings,
even when HT is disabled in the BIOS.
http://bugzilla.kernel.org/show_bug.cgi?id=4359)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When they were disabled before (e.g. after a panic) it's better
to keep them off, otherwise followon panics can happen from timer
interrupt handlers etc.
Drawback is that pageup in the console won't work anymore though.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They report 40bit, but only have 36bits of physical address space.
This caused problems with setting up the correct masks for MTRR.
CPUID workaround for steppings 0F33h(supporting x86) and 0F34h(supporting x86
and EM64T). Detail info can be found at:
http://download.intel.com/design/Xeon/specupdt/30240216.pdfhttp://download.intel.com/design/Pentium4/specupdt/30235221.pdf
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Compute the highest possible value for memnode_shift, in order to reduce
footprint of memnodemap[] to the minimum, thus making all users
(phys_to_nid(), kfree()), more cache friendly.
Before the patch :
Node 0 MemBase 0000000000000000 Limit 00000001ffffffff
Node 1 MemBase 0000000200000000 Limit 00000003ffffffff
Using 23 for the hash shift. Max adder is 3ffffffff
After the patch :
Node 0 MemBase 0000000000000000 Limit 00000001ffffffff
Node 1 MemBase 0000000200000000 Limit 00000003ffffffff
Using 33 for the hash shift.
In this case, only 2 bytes of memnodemap[] are used, instead of 2048
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows to run 64bit signal handlers in 64bit processes that run small
code snippets in compat mode.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With a NR_CPUS==128 kernel with CPU hotplug enabled we would waste 4MB
on per CPU data of all possible CPUs. The reason was that HOTPLUG
always set up possible map to NR_CPUS cpus and then we need to allocate
that much (each per CPU data is roughly ~32k now)
The underlying problem is that ACPI didn't tell us how many hotplug CPUs
the platform supports. So the old code just assumed all, which would
lead to this memory wastage.
This implements some new heuristics:
- If the BIOS specified disabled CPUs in the ACPI/mptables assume they
can be enabled later (this is bending the ACPI specification a bit,
but seems like a obvious extension)
- The user can overwrite it with a new additionals_cpus=NUM option
- Otherwise use half of the available CPUs or 2, whatever is more.
Cc: ashok.raj@intel.com
Cc: len.brown@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Minor victory on the continuous quest against all stray extern.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adding __initdata_* to asm-generic/sections.h
Replaces a lot of open coded externs in arch/x86_64/*
I had to change __bss_end to __bss_stop to match the other architectures.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should zap the low mappings, as soon as possible, so that we can catch
kernel bugs more effectively. Previously early boot had NULL mapped
and didn't trap on NULL references.
This patch introduces boot_level4_pgt, which will always have low identity
addresses mapped. Druing boot, all the processors will use this as their
level4 pgt. On BP, we will switch to init_level4_pgt as soon as we enter C
code and zap the low mappings as soon as we are done with the usage of
identity low mapped addresses. On AP's we will zap the low mappings as
soon as we jump to C code.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Not go from the CPU number to an mapping array.
Mode number is often used now in fast paths.
This also adds a generic numa_node_id to all the topology includes
Suggested by Eric Dumazet
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix
arch/x86_64/kernel/aperture.c: In function #iommu_hole_init#:
arch/x86_64/kernel/aperture.c:199: warning: #aper_order# may be used uninitialized in this function
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
According to cpuid instruction in IA32 SDM-Vol2, when computing cpu model,
we need to consider extended model ID for family 0x6 also.
AK: Also added fixes/simplifcation from Petr Vandrovec
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove duplicate __cpuinit in smp.c. Already defined in init.h which is
already included.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's a patch that builds on Natalie Protasevich's IRQ compression
patch and tries to work for MPS boots as well as ACPI. It is meant for
a 4-node IBM x460 NUMA box, which was dying because it had interrupt
pins with GSI numbers > NR_IRQS and thus overflowed irq_desc.
The problem is that this system has 270 GSIs (which are 1:1 mapped with
I/O APIC RTEs) and an 8-node box would have 540. This is much bigger
than NR_IRQS (224 for both i386 and x86_64). Also, there aren't enough
vectors to go around. There are about 190 usable vectors, not counting
the reserved ones and the unused vectors at 0x20 to 0x2F. So, my patch
attempts to compress the GSI range and share vectors by sharing IRQs.
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
MC4_MISC - DRAM Errors Threshold Register realized under AMD K8 Rev F.
This register is used to count correctable and uncorrectable ECC errors that occur during DRAM read operations.
The user may interface through sysfs files in order to change the threshold configuration.
bank%d/error_count - reads current error count, write to clear.
bank%d/interrupt_enable - set/clear interrupt enable.
bank%d/threshold_limit - read/write the threshold limit.
APIC vector 0xF9 in hw_irq.h.
5 software defined bank ids in mce.h.
new apic.c function to setup threshold apic lvt.
defaults to interrupt off, count enabled, and threshold limit max.
sysfs interface created on /sys/devices/system/threshold.
AK: added some ifdefs to make it compile on UP
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The VM needs to know about lost memory in zones to accurately
balance dirty pages. This patch accounts mem_map in there too,
which fixes a constant errror of a few percent. Also some
other misc mappings and the kernel text itself are accounted
too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.
As a bit of historical background: when the x86-64 port
was originally designed we had some discussion if we should
use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
both. Both was ruled out at this point because it was in early
2.4 when VM is still quite shakey and had bad troubles even
dealing with one DMA zone. We settled on the 16MB DMA zone mainly
because we worried about older soundcards and the floppy.
But this has always caused problems since then because
device drivers had trouble getting enough DMA able memory. These days
the VM works much better and the wide use of NUMA has proven
it can deal with many zones successfully.
So this patch adds both zones.
This helps drivers who need a lot of memory below 4GB because
their hardware is not accessing more (graphic drivers - proprietary
and free ones, video frame buffer drivers, sound drivers etc.).
Previously they could only use IOMMU+16MB GFP_DMA, which
was not enough memory.
Another common problem is that hardware who has full memory
addressing for >4GB misses it for some control structures in memory
(like transmit rings or other metadata). They tended to allocate memory
in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
(even on AMD systems the IOMMU tends to be quite small) especially if you have
many devices. With the new zone pci_alloc_consistent can just put
this stuff into memory below 4GB which works better.
One argument was still if the zone should be 4GB or 2GB. The main
motivation for 2GB would be an unnamed not so unpopular hardware
raid controller (mostly found in older machines from a particular four letter
company) who has a strange 2GB restriction in firmware. But
that one works ok with swiotlb/IOMMU anyways, so it doesn't really
need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
it seems to be the most common restriction.
The new zone is so far added only for x86-64.
For other architectures who don't set up this
new zone nothing changes. Architectures can set a compatibility
define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
as GFP_DMA. Otherwise it's a nop because on 32bit architectures
it's normally not needed because GFP_NORMAL (=0) is DMA able
enough.
One problem is still that GFP_DMA means different things on different
architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA
(trusting it to be 4GB) #elif __x86_64__ (use other hacks like
the swiotlb because 16MB is not enough) ... . This was quite
ugly and is now obsolete.
These should be now converted to use GFP_DMA32 unconditionally. I haven't done
this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
which will use GFP_DMA32 transparently.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
attached patch renames one instance of
/sys/devices/system/timer
to
/sys/devices/system/timer_pit
to avoid a name clash with another instance created in time.c.
Acked-by: Andi Kleen <ak@muc.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce
confusion, and make their semantics rigid. Improves efficiency of
resched_task and some cpu_idle routines.
* In resched_task:
- TIF_NEED_RESCHED is only cleared with the task's runqueue lock held,
and as we hold it during resched_task, then there is no need for an
atomic test and set there. The only other time this should be set is
when the task's quantum expires, in the timer interrupt - this is
protected against because the rq lock is irq-safe.
- If TIF_NEED_RESCHED is set, then we don't need to do anything. It
won't get unset until the task get's schedule()d off.
- If we are running on the same CPU as the task we resched, then set
TIF_NEED_RESCHED and no further action is required.
- If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set
after TIF_NEED_RESCHED has been set, then we need to send an IPI.
Using these rules, we are able to remove the test and set operation in
resched_task, and make clear the previously vague semantics of
POLLING_NRFLAG.
* In idle routines:
- Enter cpu_idle with preempt disabled. When the need_resched() condition
becomes true, explicitly call schedule(). This makes things a bit clearer
(IMO), but haven't updated all architectures yet.
- Many do a test and clear of TIF_NEED_RESCHED for some reason. According
to the resched_task rules, this isn't needed (and actually breaks the
assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock
held). So remove that. Generally one less locked memory op when switching
to the idle thread.
- Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner
most polling idle loops. The above resched_task semantics allow it to be
set until before the last time need_resched() is checked before going into
a halt requiring interrupt wakeup.
Many idle routines simply never enter such a halt, and so POLLING_NRFLAG
can be always left set, completely eliminating resched IPIs when rescheduling
the idle task.
POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Run idle threads with preempt disabled.
Also corrected a bugs in arm26's cpu_idle (make it actually call schedule()).
How did it ever work before?
Might fix the CPU hotplugging hang which Nigel Cunningham noted.
We think the bug hits if the idle thread is preempted after checking
need_resched() and before going to sleep, then the CPU offlined.
After calling stop_machine_run, the CPU eventually returns from preemption and
into the idle thread and goes to sleep. The CPU will continue executing
previous idle and have no chance to call play_dead.
By disabling preemption until we are ready to explicitly schedule, this bug is
fixed and the idle threads generally become more robust.
From: alexs <ashepard@u.washington.edu>
PPC build fix
From: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
MIPS build fix
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
EXPORT_SYMBOL's for phys_proc_id and cpu_core_id were added this year but
never used.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reorganize the preempt_disable/enable calls to eliminate the extra preempt
depth. Changes based on Paul McKenney's review suggestions for the kprobes
RCU changeset.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes to the arch kprobes infrastructure to take advantage of the locking
changes introduced by usage of RCU for synchronization. All handlers are now
run without any locks held, so they have to be re-entrant or provide their own
synchronization.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 changes to track kprobe execution on a per-cpu basis. We now track the
kprobe state machine independently on each cpu using a arch specific kprobe
control block.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following set of patches are aimed at improving kprobes scalability. We
currently serialize kprobe registration, unregistration and handler execution
using a single spinlock - kprobe_lock.
With these changes, kprobe handlers can run without any locks held. It also
allows for simultaneous kprobe handler executions on different processors as
we now track kprobe execution on a per processor basis. It is now necessary
that the handlers be re-entrant since handlers can run concurrently on
multiple processors.
All changes have been tested on i386, ia64, ppc64 and x86_64, while sparc64
has been compile tested only.
The patches can be viewed as 3 logical chunks:
patch 1: Reorder preempt_(dis/en)able calls
patches 2-7: Introduce per_cpu data areas to track kprobe execution
patches 8-9: Use RCU to synchronize kprobe (un)registration and handler
execution.
Thanks to Maneesh Soni, James Keniston and Anil Keshavamurthy for their
review and suggestions. Thanks again to Anil, Hien Nguyen and Kevin Stafford
for testing the patches.
This patch:
Reorder preempt_disable/enable() calls in arch kprobes files in preparation to
introduce locking changes. No functional changes introduced by this patch.
Signed-off-by: Ananth N Mavinakayahanalli <ananth@in.ibm.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sys_ptrace boilerplate code (everything outside the big switch
statement for the arch-specific requests) is shared by most architectures.
This patch moves it to kernel/ptrace.c and leaves the arch-specific code as
arch_ptrace.
Some architectures have a too different ptrace so we have to exclude them.
They continue to keep their implementations. For sh64 I had to add a
sh64_ptrace wrapper because it does some initialization on the first call.
For um I removed an ifdefed SUBARCH_PTRACE_SPECIAL block, but
SUBARCH_PTRACE_SPECIAL isn't defined anywhere in the tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-By: David Howells <dhowells@redhat.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton suggested to move kprobes from kernel hacking menu, since
kernel hacking menu is in-appropriate for the Kprobes. This patch moves
Kprobes and Oprofile under instrumentation menu.
(akpm: it's not a natural fit, but things like djprobes and the s390 guys'
statistics library need a home)
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: John Levon <levon@movementarian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes the x86-64 find_[first|next]_zero_bit() function for the
end-of-range case. It didn't test for a zero size, and the "rep scas"
would do entirely the wrong thing.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reads from an HPET register require a round trip to the south bridge and are
almost as slow as PCI reads. By caching the last value we've written to the
comparator register, we can eliminate all HPET reads from the fast path in the
emulated RTC interrupt handler.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure that the RTC timer is in non-periodic mode; some stupid BIOS might
have initialized it to periodic mode.
Furthermore, don't set the SETVAL bit in the config register. This wouldn't
have any effect unless the timer was in period mode (which it isn't), and then
the actual timer frequency would be half that of the desired one because
incrementing the comparator in the interrupt handler would be done after the
hardware has already incremented it itself.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the emulated RTC interrupt is no longer needed, we better disable it;
otherwise, we get a spurious interrupt whenever the timer has rolled over and
reaches the same comparator value.
Having a superfluous interrupt every five minutes doesn't hurt much, but it's
bad style anyway. ;-)
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Acked-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define jiffies_64 in kernel/timer.c rather than having 24 duplicated
defines in each architecture.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This ioctl doesn't exist for native i386.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Every user of init_timer() also needs to initialize ->function and ->data
fields. This patch adds a simple setup_timer() helper for that.
The schedule_timeout() is patched as an example of usage.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to
mark pages that should be freed in case of an error during resume.
This allows us to simplify the code and to use swsusp_free() in all of the
swsusp's resume error paths, which makes them actually work.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle 32-bit mtrr ioctls in the mtrr driver instead of the ia32
compatability layer.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If VMX feature is available in the CPU, this patch will make it visible in
the /proc/cpuinfo with the cpuid detection.
Signed-Off-By: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mxcsr_feature_mask_init isn't needed in suspend/resume time (we can use
boot time mask). And actually it's harmful, as it clear task's saved
fxsave in resume. This bug is widely seen by users using zsh.
(akpm: my eyes. Fixed some surrounding whitespace mess)
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I just found out that some precision is unnecessarily lost in the
arch/i386/kernel/timers/timer_tsc.c:set_cyc2ns_scale function. It uses a
cpu_mhz parameter when it could use a cpu_khz. In the specific case of an
Intel P4 running at 3001.171 Mhz, the truncation to 3001 Mhz leads to an
imprecision of 19 microseconds per second : this is very sad for a timer with
nearly nanosecond accuracy.
Fix the x86_64 architecture too.
Cc: george anzinger <george@mvista.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
First step in pushing down the page_table_lock. init_mm.page_table_lock has
been used throughout the architectures (usually for ioremap): not to serialize
kernel address space allocation (that's usually vmlist_lock), but because
pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.
Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
and drop it when allocating a new one, to check lest a racing task already
did. Similarly no page_table_lock in vmalloc's map_vm_area.
Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
user mms, which are converted only by a later patch, for now they have to lock
differently according to whether or not it's init_mm.
If sources get muddled, there's a danger that an arch source taking
init_mm.page_table_lock will be mixed with common source also taking it (or
neither take it). So break the rules and make another change, which should
break the build for such a mismatch: remove the redundant mm arg from
pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).
Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
took page_table_lock for no good reason.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but
that's not so good if an mm_counter_t is a special type. And how is rss
initialized? By set_mm_counter, all over the place. Come on, we just need to
initialize them both at once by set_mm_counter in mm_init (which follows the
memcpy when forking).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
changes to swiotlb.c made in commit 281dd25cdc
since this file has been moved from arch/ia64/lib/swiotlb.c to
lib/swiotlb.c
Signed-off-by: Tony Luck <tony.luck@intel.com>
CPU hotplug fills up the possible map to NR_CPUs, but it did that after
setting up per CPU data. This lead to CPU data not getting allocated
for all possible CPUs, which lead to various side effects.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Noticed by Terence Ripperda
Undo wrong change in global_flush_tlb. We need to flush the caches in all
cases, not just when pages were reverted. This was a bogus optimization
added earlier, but it was wrong.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes the setup of the alignment of the signal frame, so that all
signal handlers are run with a properly aligned stack frame.
The current code "over-aligns" the stack pointer so that the stack frame
is effectively always mis-aligned by 4 bytes. But what we really want
is that on function entry ((sp + 4) & 15) == 0, which matches what would
happen if the stack were aligned before a "call" instruction.
Signed-off-by: Markus F.X.J. Oberhumer <markus@oberhumer.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch makes swsusp avoid the possible temporary corruption
of page translation tables during resume on x86-64. This is achieved by
creating a copy of the relevant page tables that will not be modified by
swsusp and can be safely used by it on resume.
The problem is that during resume on x86-64 swsusp may temporarily
corrupt the page tables used for the direct mapping of RAM. If that
happens, a page fault occurs and cannot be handled properly, which leads
to the solid hang of the affected system. This leads to the loss of the
system's state from before suspend and may result in the loss of data or
the corruption of filesystems, so it is a serious issue. Also, it
appears to happen quite often (for me, as often as 50% of the time).
The problem is related to the fact that (at least) one of the PMD
entries used in the direct memory mapping (starting at PAGE_OFFSET)
points to a page table the physical address of which is much greater
than the physical address of the PMD entry itself. Moreover,
unfortunately, the physical address of the page table before suspend
(i.e. the one stored in the suspend image) happens to be different to
the physical address of the corresponding page table used during resume
(i.e. the one that is valid right before swsusp_arch_resume() in
arch/x86_64/kernel/suspend_asm.S is executed). Thus while the image is
restored, the "offending" PMD entry gets overwritten, so it does not
point to the right physical address any more (i.e. there's no page
table at the address pointed to by it, because it points to the address
the page table has been at during suspend). Consequently, if the PMD
entry is used later on, and it _is_ used in the process of copying the
image pages, a page fault occurs, but it cannot be handled in the normal
way and the system hangs.
In principle we can call create_resume_mapping() from
swsusp_arch_resume() (ie. from suspend_asm.S), but then the memory
allocations in create_resume_mapping(), resume_pud_mapping(), and
resume_pmd_mapping() must be made carefully so that we use _only_
NosaveFree pages in them (the other pages are overwritten by the loop in
swsusp_arch_resume()). Additionally, we are in atomic context at that
time, so we cannot use GFP_KERNEL. Moreover, if one of the allocations
fails, we should free all of the allocated pages, so we need to trace
them somehow.
All of this is done in the appended patch, except that the functions
populating the page tables are located in arch/x86_64/kernel/suspend.c
rather than in init.c. It may be done in a more elegan way in the
future, with the help of some swsusp patches that are in the works now.
[AK: move some externs into headers, renamed a function]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Drop global bit from early low mappings
Suggested by Linus, originally also proposed by Suresh.
This fixes a race condition with early start of udev, originally
tracked down by Suresh B. Siddha. The problem was that switching
to the user space VM would not clear the global low mappings
for the beginning of memory, which lead to memory corruption.
Drop the global bits.
The kernel mapping stays global because it should stay constant.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2.6.14-rc2 does not assign cpus to proper nodeids on our em64t numa boxen.
Our boxes use acpi srat for parsing the numa information.
srat_detect_node() used phys_proc_id[] to get to the cpu's local apic id,
but phys_proc_id[] represents the cpu<->initial_apic_id mapping. The
following patch fixes this problem. Now apicid_to_node[] is properly
indexed with the local apic id.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The tests Alok carried out on Petr's box confirmed that cpu_to_node[BP] is
not setup early enough by numa_init_array due to the x86_64 changes in
2.6.14-rc*, and unfortunately set wrongly by the work around code in
numa_init_array(). cpu_to_node[0] gets set with 1 early and later gets set
properly to 0 during identify_cpu() when all cpus are brought up, but
confusing the numa slab in the process.
Here is a quick fix for this. The right fix obviously is to have
cpu_to_node[bsp] setup early for numa_init_array(). The following patch
will fix the problem now, and the code can stay on even when
cpu_to_node{BP] gets fixed early correctly.
Thanks to Petr for access to his box.
Signed off by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the BP node_to_cpumask. 2.6.14-rc* broke the boot cpu bit as the
cpu_to_node(0) is now not setup early enough for numa_init_array.
cpu_to_node[] is setup much later at srat_detect_node on acpi srat based
em64t machines. This seems like a problem on amd machines too, Tested on
em64t though. /sys/devices/system/node/node0/cpumap shows up sanely after
this patch.
Signed off by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The up()/down() orders are incorrect in arch/x86_64/kprobes.c file.
kprobe_mutext is used to protect the free kprobe instruction slot list.
arch_prepare_kprobe applies for a slot from the free list, and
arch_remove_kprobe returns a slot to the free list. The incorrect up()/down()
orders to operate on kprobe_mutex fail to protect the free list. If 2 threads
try to get/return kprobe instruction slot at the same time, the free slot list
might be broken, or a free slot might be applied by 2 threads.
Signed-off-by: Zhang Yanmin <Yanmin.zhang@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attempt to fixup the lockless mce log buffer introduced an infinite loop
when trying to find a free entry.
And:
Using rcu_dereference() to load mcelog.next doesn't seem to be sufficient
enough to ensure that mcelog.next is loaded each time around the loop in
mce_log(). Instead, use an explicit rmb() to ensure that the compiler gets it
right.
AK: turned the smp_wmbs into true wmbs to make sure they are not
reordered by the compiler on UP.
Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I checked with AMD and they requested to only disable it for family 15.
Also disable it for i386 too. And some style fixes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The swiotlb implementation is shared by both IA-64 and EM64T. However,
the source itself lives under arch/ia64. This patch moves swiotlb.c
from arch/ia64/lib to lib/ and fixes-up the appropriate Makefile and
Kconfig files. No actual changes are made to swiotlb.c.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This should resolve the issue seen in bugme bug #5105, where it is assumed
that dualcore x86_64 systems have synced TSCs. This is not the case, and
alternate timesources should be used instead.
For more details, see:
http://bugzilla.kernel.org/show_bug.cgi?id=5105
Andi's earlier concerns that the TSCs should be synced on dualcore systems
have been resolved by confirmation from AMD folks that they can be
unsynced.
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They seem to have been due to AMD errata 63/122; the fix is to disable
TLB flush filtering in SMP configurations.
Confirmed to fix the problem by Andrew Walrond <andrew@walrond.org>
[ Let's see if we'll have a better fix eventually, this is the Q&D
"let's get this fixed and out there" version ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Several implementations were essentialy a common piece of C code using
the cmpxchg() macro. Put the implementation in one spot that everyone
can share, and convert sparc64 over to using this.
Alpha is the lone arch-specific implementation, which codes up a
special fast path for the common case in order to avoid GP reloading
which a pure C version would require.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 66759a01ad introduced the fix for
time ticking too fast on some boards by disabling one of the doubly
connected timer pins on ATI boards.
However, it ends up being _much_ too broad a brush, and that just makes
some other ATI boards not work at all since they now have no timer
source.
So disable the automatic ATI southbridge detection, and just rely on
people who see this problem disabling it by hand with the option
"disable_timer_pin_1" on the kernel command line.
Maybe somebody can figure out the proper tests at a later date.
Acked-by: Peter Osterlund <petero2@telia.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pavel Emelianov and Kirill Korotaev observe that fs and arch users of
security_vm_enough_memory tend to forget to vm_unacct_memory when a
failure occurs further down (typically in setup_arg_pages variants).
These are all users of insert_vm_struct, and that reservation will only
be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some
cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't.
So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into
Committed_AS each time they're run. But don't add VM_ACCOUNT to them,
it's inappropriate to reserve against the very unlikely case that gdb
be used to COW a vDSO page - we ought to do something about that in
do_wp_page, but there are yet other inconsistencies to be resolved.
The safe and economical way to fix this is to let insert_vm_struct do
the security_vm_enough_memory check when it finds VM_ACCOUNT is set.
And the MIPS irix_brk has been calling security_vm_enough_memory before
calling do_brk which repeats it, doubly accounting and so also leaking.
Remove that, and all the fs and arch calls to security_vm_enough_memory:
give it a less misleading name later on.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Like previously done for i386, get the x86_64 watchdog tick calculation
into a state where it can also be used on CPUs with frequencies beyond
4GHz.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the add_taint() interface for setting tainted bit flags instead of
doing it manually.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Original patch from Bertro Simul
This is probably still not quite correct, but seems to be
the best solution so far.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As mentioned before, the size of the bug frame can be further reduced while
continuing to use instructions to encode the information.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... and with that all instances in arch/x86_64 are gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the same patch that went into i386 just before 2.6.13
came out. I still can't build 64-bit user apps, so I tested
with program (see below) in 32-bit mode on 64-bit kernel:
Before:
$ fpsig
handler: nr = 8, si = 0x0804bc90, vuc = 0x0804bd10
handler: altstack is at 0x0804b000, ebp = 0x0804bc7c
handler: si_signo = 8, si_errno = 0, si_code = 0 [unknown]
handler: fpu cwd = 0xb40, fpu swd = 0xbaa0
handler: i387 unmasked precision exception, rounded up
After:
$ fpsig
handler: nr = 8, si = 0x0804bc90, vuc = 0x0804bd10
handler: altstack is at 0x0804b000, ebp = 0x0804bc7c
handler: si_signo = 8, si_errno = 0, si_code = 6 [inexact result]
handler: fpu cwd = 0xb40, fpu swd = 0xbaa0
handler: i387 unmasked precision exception, rounded up
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The x86_64 nmi code is missing a newline in one of its messages.
I added a space before the CPU id for readability and killed the trailing
space on the previous line as well.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rather than blindly re-enabling interrupts in oops_end(), save their state
in oope_begin() and then restore that state.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The only difference was the inline assembly, so move that into
asm/msr.h and merge with the i386 version.
This adds some missing sysfs support code to x86-64.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Being the foundation for reliable stack unwinding, this fixes CFI unwind
annotations in many low-level x86_64 routines, plus a config option
(available to all architectures, and also present in the previously sent
patch adding such annotations to i386 code) to enable them separatly
rather than only along with adding full debug information.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Report PXMs instead of nodes
- Report the correct PXM, not always the one of node 1.
- Only warn for the case of a PXM overlapping by itself
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick points out it never worked because PageReserved was
set and it might cause problems later on. Also HOTPLUG_CPU
is much more common now so let's care not too much
about the !hotplug case.
Cc: nickpiggin@yahoo.com.au
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It only offers extremly dubious security advantages and
is not worth the overhead in this critical path.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The global bit was not set in the first 2MB page, instead
it had a bit in the free AVL section which is useless.
Fixed thus.
Noticed by Eric Biederman
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 idle=poll might be a little less responsive than it should: unlike
mwait_idle, and unlike i386, its poll_idle left TIF_POLLING_NRFLAG set.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds console and earlyprintk support for a host file
on AMD's SimNow simulator.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of using a global spinlock to protect the state
of the remote TLB flush use a lock and state for each sending CPU.
To tell the receiver where to look for the state use 8 different
call vectors. Each CPU uses a specific vector to trigger flushes on other
CPUs. Depending on the received vector the target CPUs look into
the right per cpu variable for the flush data.
When the system has more than 8 CPUs they are hashed to the 8 available
vectors. The limited global vector space forces us to this right now.
In future when interrupts are split into per CPU domains this could be
fixed, at the cost of needing more IPIs in flat mode.
Also some minor cleanup in the smp flush code and remove some outdated
debug code.
Requires patch to move cpu_possible_map setup earlier.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If we use 64bit kernel on ia64/x86_64/s390 architecture, and we run
32bit binary on 32bit compatibility mode, sendfile system call seems be
not set offset argument.
This is because sendfile's return value is not zero but the code regards
the result by return value is zero or not.
This problem will be affect to ia64/x86_64/s390 and not affect to other
architecture does not affect other architecture (mips/parisc/ppc64/sparc64).
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Include build number in oops output
Helps me to match oopses to correct kernel.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The resume code uses CPU hotplug now so at resume time
we only ever see one CPU.
Pointed out by Yu Luming.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The FLATMEM people added it, but there doesn't seem a good reason
because end_pfn is identical.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It could be wrong for kexec or other cases. Read it from
the CPU instead.
Signed-off-by: Murali <muralim@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One machine is constantly throwing NMI watchdog timeouts in mce_log
This was one attempt to fix it.
(AK: this doesn't actually fix the bug I'm seeing unfortunately, probably
drop. I don't like it that the reader can spin forever now waiting
for a writer)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Originally from Stuart Hayes.
When setting up the APIC for the Uniprocessor kernel don't
assume the CPU has an APIC ID of zero.
This fixes boot with the UP kernel on Dell PowerEdge 6800/6850 4way systems.
Cc: Stuart.Hayes@dell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In particular on systems where the local APIC space and node space
is very different from the Linux CPU number space.
Previously the older NUMA setup code directly parsing the K8
northbridge registers had some issues on 8 socket or dual core
systems. This patch fixes them.
This is mainly done by fixing some confusion between Linux
CPU numbers and local APIC ids. We now pass the local APIC IDs
to later code, which avoids mismatches.
Also add some heuristics to detect cases where the Hypertransport
nodeids and the local APIC IDs don't match, but are shifted
by a constant offset.
This is still all quite hackish, hopefully BIOS writers fill
in correct SRATs instead.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do that later when the CPU boots. SRAT just stores the APIC<->Node
mapping node. This fixes problems on systems where the order
of SRAT entries does not match the MADT.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We used to disable them to work around a bug, but that
is not needed anymore. Keeping them enabled avoids the NMI
watchdog triggering in some cases.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handles case where BIOS gives CPUs very large APIC numbers correctly.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was just needed for the Numasaurus, which fortunately
doesn't support x86-64 CPUs.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No x86-64 chipset has this bug
Generated code doesn't change because it was always disabled.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that Greg implemented MCFG/_SEG support this shouldn't be needed
anymore
Cc: gregkh@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the new macros for x86_64 too.
Note that the current scripts includes different definitions; more exactly,
it only contains part of the DWARF2 sections and the .comment one from
Stabs. Shouldn't be a problem, anyway.
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
get_cpu_vendor() no longer has any users in other files.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
apply power", which was due to the CardBus IOMEM register region being
allocated at an address that was actually inside the RAM window that had
been reserved for video frame-buffers in an UMA setup.
The BIOS _should_ have marked that region reserved in the e820 memory
descriptor tables, but did not.
It is fixed by rounding up the default starting address of PCI memory
allocations, so that we leave a bigger gap after the final known memory
location. The amount of rounding depends on how big the unused memory
gap is that we can allocate IOMEM from.
Based on example code by Linus.
Acked-by: Greg KH <greg@kroah.com>
Acked-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a lost fput in 32bit tiocgdev ioctl on x86-64
[ chrisw: Updated to use fget_light/fput_light ]
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Maxim Giryaev <gem@sw.ru>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
enforce_max_cpus nukes out cpu_present_map and cpu_possible_map making it
impossible to add new cpus in the system. Since it doesnt provide any
additional value apart this call and reference is removed.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The use of non-shortcut version of routines breaking CPU hotplug. The option
to select this via cmdline also is deleted with the physflat patch, hence
directly placing this code under CONFIG_HOTPLUG_CPU.
We dont want to use broadcast mode IPI's when hotplug is enabled. This causes
bad effects in send IPI to a cpu that is offline which can trip when the cpu
is in the process of being kicked alive.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sanitized and fixed floppy dependencies: split the messy dependencies for
BLK_DEV_FD by introducing a new symbol (ARCH_MAY_HAVE_PC_FDC), making
BLK_DEV_FD depend on that one and taking declarations of ARCH_MAY_HAVE_PC_FDC
to arch/*/Kconfig. While we are at it, fixed several obvious cases when
BLK_DEV_FD should have been excluded (architectures lacking asm/floppy.h
are *not* going to have floppy.c compile, let alone work).
If you can come up with better name for that ("this architecture might
have working PC-compatible floppy disk controller"), you are more than
welcome - just s/ARCH_MAY_HAVE_PC_FDC/your_prefered_name/g in the patch
below...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a race condition where in system used to hang or sometime
crash within minutes when kprobes are inserted on ISR routine and a task
routine.
The fix has been stress tested on i386, ia64, pp64 and on x86_64. To
reproduce the problem insert kprobes on schedule() and do_IRQ() functions
and you should see hang or system crash.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a bug in kprobes's handling of a corner case on i386 and
x86_64. On an SMP system, if one CPU unregisters a kprobe just after
another CPU hits that probepoint, kprobe_handler() on the latter CPU sees
that the kprobe has been unregistered, and attempts to let the CPU continue
as if the probepoint hadn't been hit. The bug is that on i386 and x86_64,
we were neglecting to set the IP back to the beginning of the probed
instruction. This could cause an oops or crash.
This bug doesn't exist on ppc64 and ia64, where a breakpoint instruction
leaves the IP pointing to the beginning of the instruction. I don't know
about sparc64. (Dave, could you please advise?)
This fix has been tested on i386 and x86_64 SMP systems. To reproduce the
problem, set one CPU to work registering and unregistering a kprobe
repeatedly, and another CPU pounding the probepoint in a tight loop.
Acked-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the x86_64 architecture specific changes to prevent the
possible race conditions.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
64 bit architectures all implement their own compatibility sys_open(),
when in fact the difference is simply not forcing the O_LARGEFILE
flag. So use the a common function instead.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch cleans up a commonly repeated set of changes to the NTP state
variables by adding two helper inline functions:
ntp_clear(): Clears the ntp state variables
ntp_synced(): Returns 1 if the system is synced with a time server.
This was compile tested for alpha, arm, i386, x86-64, ppc64, s390, sparc,
sparc64.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark variables which are usually accessed for reads with __readmostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Frank Sorenson <frank@tuxrocks.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Uses of RCU for dynamically changeable NMI handlers need to use the new
rcu_dereference() and rcu_assign_pointer() facilities. This change makes
it clear that these uses are safe from a memory-barrier viewpoint, but the
main purpose is to document exactly what operations are being protected by
RCU. This has been tested on x86 and x86-64, which are the only
architectures affected by this change.
Signed-off-by: <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP.
When enabled then per-CPU watchdog threads are started, which try to run
once per second. If they get delayed for more than 10 seconds then a
callback from the timer interrupt detects this condition and prints out a
warning message and a stack dump (once per lockup incident). The feature
is otherwise non-intrusive, it doesnt try to unlock the box in any way, it
only gets the debug info out, automatically, and on all CPUs affected by
the lockup.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-Off-By: Matthias Urlichs <smurf@smurf.noris.de>
Signed-off-by: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows a valid iommu placed immediately after memory to work, to be
recognized as after the last byte of memory and not overlapping it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Need to ensure we dont get prempted when we clear ourself from mask when using
clustered mode genapic code.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Up to date I've been using the GS value to determine the processor number
in dumps from show_regs, however this can be cumbersome to do if you don't
have the vmlinux to verify with the address of cpu_pda, how about the
following? I considered using hard_smp_processor_id for robustness but we
already dereference current so we're already relying on MSR_GS_BASE being
sane.
Signed-off-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When handling writes to /proc/irq, current code is re-programming rte
entries directly. This is not recommended and could potentially cause
chipset's to lockup, or cause missing interrupts.
CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the
interrupt is pending. The same needs to be done for /proc/irq handling as well.
Otherwise user space irq balancers are really not doing the right thing.
- Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for
lack of a generic name.
- added move_irq out of IRQ_BALANCE, and added this same to X86_64
- Added new proc handler for write, so we can do deferred write at irq
handling time.
- Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead
it now shows only active cpu masks, or exactly what was set.
- Provided a common move_irq implementation, instead of duplicating
when using generic irq framework.
Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off.
Tested UP builds as well.
MSI testing: tbd: I have cards, need to look for a x-over cable, although I
did test an earlier version of this patch. Will test in a couple days.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Zwane Mwaikambo <zwane@holomorphy.com>
Grudgingly-acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix remaining bits of u32 vs. pm_message confusion. Should not break
anything.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reset the ISA DMA controller into a known state after a suspend. Primary
concern was reenabling the cascading DMA channel (4).
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch moves the common code in x86 and x86-64's semaphore.c into a
single file in lib/semaphore-sleepers.c. The arch specific asm stubs are
left in the arch tree (in semaphore.c for i386 and in the asm for x86-64).
There should be no changes in code/functionality with this patch.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It has been reported that the way Linux handles NODEFER for signals is
not consistent with the way other Unix boxes handle it. I've written a
program to test the behavior of how this flag affects signals and had
several reports from people who ran this on various Unix boxes,
confirming that Linux seems to be unique on the way this is handled.
The way NODEFER affects signals on other Unix boxes is as follows:
1) If NODEFER is set, other signals in sa_mask are still blocked.
2) If NODEFER is set and the signal is in sa_mask, then the signal is
still blocked. (Note: this is the behavior of all tested but Linux _and_
NetBSD 2.0 *).
The way NODEFER affects signals on Linux:
1) If NODEFER is set, other signals are _not_ blocked regardless of
sa_mask (Even NetBSD doesn't do this).
2) If NODEFER is set and the signal is in sa_mask, then the signal being
handled is not blocked.
The patch converts signal handling in all current Linux architectures to
the way most Unix boxes work.
Unix boxes that were tested: DU4, AIX 5.2, Irix 6.5, NetBSD 2.0, SFU
3.5 on WinXP, AIX 5.3, Mac OSX, and of course Linux 2.6.13-rcX.
* NetBSD was the only other Unix to behave like Linux on point #2. The
main concern was brought up by point #1 which even NetBSD isn't like
Linux. So with this patch, we leave NetBSD as the lonely one that
behaves differently here with #2.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some nodes can have large holes on x86-64.
This fixes problems with the VM allowing too many dirty pages because it
overestimates the number of available RAM in a node. In extreme cases you
can end up with all RAM filled with dirty pages which can lead to deadlocks
and other nasty behaviour.
This patch just tells the VM about the known holes from e820. Reserved
(like the kernel text or mem_map) is still not taken into account, but that
should be only a few percent error now.
Small detail is that the flat setup uses the NUMA free_area_init_node() now
too because it offers more flexibility.
(akpm: lotsa thanks to Martin for working this problem out)
Cc: Martin Bligh <mbligh@mbligh.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Delete the ability to build an ACPI kernel that does
not include PCI support. When such a machine is created
and it requires a tuned kernel, send a patch.
http://bugzilla.kernel.org/show_bug.cgi?id=1364
Signed-off-by: Len Brown <len.brown@intel.com>
I mistakedly disabled fusion support in an earlier update. Fusion
is commonly used on many x86-64 systems, so this was a problem.
This patch fixes that.
Signed-off-by: And Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code to detect IO links on Opteron would not check
if the node had actually memory. This could lead to pci_bus_to_node
returning an invalid node, which might cause crashes later
when dma_alloc_coherent passes it to page_alloc_node().
The bug has been there forever but for some reason
it is causing now crashes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Plug a race in TSC synchronization
We need to do tsc_sync_wait() before the CPU is set online to prevent
multiple CPUs from doing it in parallel - which won't work because TSC
sync has global unprotected state.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oops. I knew I didn't have the physical versus logical cpu identifiers right
when I generated that patch. It's not nearly as bad as I feared at the time
though.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
modprobe aes does not work on x86_64. i386 has a similar line, this could
be the right fix. Would be nice to have in 2.6.13 final.
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't log machine check events left over from boot. Too many BIOSes leave
bogus events in there.
This unfortunately also makes it impossible to log events that caused a
reboot. For people with non broken BIOS there is mce=bootlog
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the sparse mem changes and the kexec changes
were merged into setup.c they came in, in the wrong order.
This patch changes the order so we don't run sparse_init
which uses the bootmem allocator until we all of the
reserve_bootmem calls has been made.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The IA32 ptrace emulation currently returns the wrong registers for fs/gs;
it's returning what x86_64 calls gs_base. We need regs.gsindex in order
for GDB to correctly locate the TLS area. Without this patch, the 32-bit
GDB testsuite bombs on a 64-bit kernel. With it, results look about like
I'd expect, although there are still a handful of kernel-related failures
(vsyscall related?).
Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 had hardcoded the VM_ numbers so it broke down when the numbers
were changed.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch adds boundary check for the MAX_GSI_NUM. Same as the update for
i386, the patch addresses a problem with ACPI SCI IRQ. The patch corrects
the code such that SCI IRQ is skipped and duplicate entry is avoided. The
VIA chipset uses 4-bit IRQ register for internal interrupt routing, and
therefore cannot handle IRQ numbers assigned to its devices. The patch
corrects this problem by allowing PCI IRQs below 16.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I was observing reproducible crashes on the "movw %bx,(%rsi)" instruction
below while a process in a recvfrom() system call was copying packet data
to user space. The patch below fixes the exception table and causes the
crash to no longer reproduce. Please apply.
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sync_tsc was using smp_call_function to ask the boot processor to report
it's tsc value. smp_call_function performs an IPI_send_allbutself which is
a broadcast ipi. There is a window during processor startup during which
the target cpu has started and before it has initialized it's interrupt
vectors so it can properly process an interrupt. Receveing an interrupt
during that window will triple fault the cpu and do other nasty things.
Why cli does not protect us from that is beyond me.
The simple fix is to match ia64 and provide a smp_call_function_single.
Which avoids the broadcast and is more efficient.
This certainly fixes the problem of getting stuck on boot which was
very easy to trigger on my SMP Hyperthreaded Xeon, and I think
it fixes it for the right reasons.
Minor changes by AK
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the standard hardware page table manipulation macros.
This is possible now that linux works with all 4 levels
of the page tables.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In an uncensored copy of code from i386 to x86_64 I wound up
with inline assembly with the wrong constraints. Use input
constraints instead of output constraints.
So I know the assembler will do the right thing specify the size
of the operand lidtq and lgdtq instead of just lidt and lgdt.
Make load_segments use an input constraint, and delete the macro fun.
Without having to reload %cs like I do on i386 this code is noticeably
simpler.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While booting with SMT disabled in bios, when using acpi srat to setup
cpu_to_node[], sparse apic_ids create problems.
Without this patch, intel x86_64 boxes with hyperthreading disabled in the
bios (and which rely on srat for numa setup) endup having incorrect values in
cpu_to_node[] arrays, causing sched domains to be built incorrectly etc.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This avoids some potential stack overflows with very deep softirq callchains.
i386 does this too.
TOADD CFI annotation
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Save a byte here and there. Ultimatively useless, but these things always
catch my eyes when reading the code so just fix them for now.
Also I got at least one patch fixing of them already, which gives a good
excuse.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Icecream preprocesses c sources locally, and sends the result off to a remote
host for compiling. It does not recognize includes at assembler level. The
fix is to put the assemberincludes an a separate .s file, which will always be
assembled locally.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use physical mode instead of logical mode to address more CPUs. This is also
used in the CPU hotplug case to avoid a race.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Will be obsolete with physflat.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Not used anymore since quite some time. Just uses -m32 instead.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch will create machinecheck sysdev directories per CPU. All of the
cpus still share the same ctl banks. When compiled with CONFIG_HOTPLUG_CPU,
it will also bring up/down sysdev directories as cpus go up/down. I have
tested the patch along with CONFIG_HOTPLUG_CPU option on in 2.6.13-rc1 kernel.
Minor changes by AK: remove useless unload function
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Keith Manning
Print a boot message for hotplug memory zones
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Minor cleanup.
Move things into their include files, remove obsolete includes, fix
indentation, remove obsolete special cases etc.
I also added the per cpu section to asm-generic/sections.h and fixed
init/main.c to use it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No need to print kernel addresses there and clarify what the APIC-ID is.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Does not change any semantics because numa_add_cpu checks for CPU 0 anyways.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Various code needs this information now before the actual SMP bootup. Instead
of computing it on the fly while booting the other CPUs set it up now while
initial MPtable/MADT parsing.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the x86_64 cpu hotplug changes went in it added a check in
default_do_nmi() which kills NMI delivery on any CPU but the BSP.
The NMI watchdog is brought up quite some time before the online bit is set
in num_online_cpus so this won't work very well. The nmi watchdogs on cpus
that are not BSP will never be reprogrammed and no NMIs.
Why was this check added? How does an offlined cpu receive an NMI?
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Cc: <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
turn many #if $undefined_string into #ifdef $undefined_string to fix some
warnings after -Wno-def was added to global CFLAGS
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixes boot up lockups on some machines where CPU apic ids don't start with
0
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
i386 machine_power_off was disabling the local apic
and all of it's users wanted to be on the boot cpu.
So call machine_shutdown which places us on the boot
cpu and disables the apics. This keeps us in sync
and reduces the number of cases we need to worry about in
the power management code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is not safe to call set_cpus_allowed() in interrupt
context and disabling the apics is complicated code.
So unconditionally skip machine_shutdown in machine_emergency_reboot
on x86_64.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We only want to shutdown the apics if reboot_force
is not specified. Be we are doing this both
in machine_shutdown which is called unconditionally
and if (!reboot_force). So simply call machine_shutdown
if (!reboot_force). It looks like something
went weird with merging some of the kexec patches for
x86_64, and caused this.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
machine_restart, machine_halt and machine_power_off are machine
specific hooks deep into the reboot logic, that modules
have no business messing with. Usually code should be calling
kernel_restart, kernel_halt, kernel_power_off, or
emergency_restart. So don't export machine_restart,
machine_halt, and machine_power_off so we can catch buggy users.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add inotify syscall entries to x86-64.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add missing fsnotify_open() hook to sys32_open().
Add fsnotify_open() hook to sys32_open() on x86-64.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A malicious 32bit app can have an elf section at 0xffffe000. During
exec of this app, we will have a memory leak as insert_vm_struct() is
not checking for return value in syscall32_setup_pages() and thus not
freeing the vma allocated for the vsyscall page.
Check the return value and free the vma incase of failure.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the second time this has happened: inserting a new section requires
that we adjust the arithmetic which is used to calculate the vsyscall page's
offset.
Cc: Christoph Lameter <christoph@lameter.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create a new top-level menu named "Networking" thus moving
net related options and protocol selection way from the drivers
menu and up on the top-level where they belong.
To implement this all architectures has to source "net/Kconfig" before
drivers/*/Kconfig in their Kconfig file. This change has been
implemented for all architectures.
Device drivers for ordinary NIC's are still to be found
in the Device Drivers section, but Bluetooth, IrDA and ax25
are located with their corresponding menu entries under the new
networking menu item.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new section called ".data.read_mostly" for data items that are read
frequently and rarely written to like cpumaps etc.
If these maps are placed in the .data section then these frequenly read
items may end up in cachelines with data is is frequently updated. In that
case all processors in an SMP system must needlessly reload the cachelines
again and again containing elements of those frequently used variables.
The ability to share these cachelines will allow each cpu in an SMP system
to keep local copies of those shared cachelines thereby optimizing
performance.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Christoph Lameter <christoph@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There has been some discuss about solving the SMP MTRR suspend/resume
breakage, but I didn't find a patch for it. This is an intent for it. The
basic idea is moving mtrr initializing into cpu_identify for all APs (so it
works for cpu hotplug). For BP, restore_processor_state is responsible for
restoring MTRR.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implementation:
===============
The encrypt/decrypt code is based on an x86 implementation I did a while
ago which I never published. This unpublished implementation does
include an assembler based key schedule and precomputed tables. For
simplicity and best acceptance, however, I took Gladman's in-kernel code
for table generation and key schedule for the kernel port of my
assembler code and modified this code to produce the key schedule as
required by my assembler implementation. File locations and Kconfig are
kept similar to the i586 AES assembler implementation.
It may seem a little bit strange to use 32 bit I/O and registers in the
assembler implementation but this gives the best code size. My
implementation takes one instruction more per round compared to
Gladman's x86 assembler but it doesn't require any stack for local
variables or saved registers and it is less serialized than Gladman's
code.
Note that all comparisons to Gladman's code were done after my code was
implemented. I did only use FIPS PUB 197 for the implementation so my
implementation is independent work.
If anybody has a better assembler solution for x86_64 I'll be pleased to
have my code replaced with the better solution.
Testing:
========
The implementation passes the in-kernel crypto testing module and I'm
running it without any problems on my laptop where it is mainly used for
dm-crypt.
Microbenchmark:
===============
The microbenchmark was done in userspace with similar compile flags as
used during kernel compile.
Encrypt/decrypt is about 35% faster than the generic C implementation.
As the generic C as well as my assembler implementation are both table
I don't really expect that there is much room for further
improvements though I'll be glad to be corrected here.
The key schedule is about 5% slower than the generic C implementation.
This is due to the fact that some more work has to be done in the key
schedule routine to fit the schedule to the assembler implementation.
Code Size:
==========
Encrypt and decrypt are together about 2.1 Kbytes smaller than the
generic C implementation which is important with regard to L1 cache
usage. The key schedule routine is about 100 bytes larger than the
generic C implementation.
Data Size:
==========
There's no difference in data size requirements between the assembler
implementation and the generic C implementation.
License:
========
Gladmans's code is dual BSD/GPL whereas my assembler code is GPLv2 only
(I'm not going to change the license for my code). So I had to change
the module license for the x86_64 aes module from 'Dual BSD/GPL' to
'GPL' to reflect the most restrictive license within the module.
Signed-off-by: Andreas Steinmetz <ast@domdv.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following renames arch_init, a kprobes function for performing any
architecture specific initialization, to arch_init_kprobes in order to
cleanup the namespace.
Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes
build from the last return probe patch.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that we have access to the whole MCFG table, let's properly use it
for all pci device accesses (as that's what it is there for, some boxes
don't put all the busses into one entry.)
If, for some reason, the table is incorrect, we fallback to the "old
style" of mmconfig accesses, namely, we just assume the first entry in
the table is the one for us, and blindly use it.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch is the first step in properly handling the MCFG PCI table.
It defines the structures properly, and saves off the table so that the
pci mmconfig code can access it. It moves the parsing of the table a
little later in the boot process, but still before the information is
needed.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The following patch contains the x86_64 specific changes for the new
return probe design. Changes include:
* Removing the architecture specific functions for querying a return probe
instance off a stack address
* Complete rework onf arch_prepare_kretprobe() and trampoline_probe_handler()
* Removing trampoline_post_handler()
* Adding arch_init() so that now we handle registering the return probe
trampoline instead of kernel/kprobes.c doing it
NOTE:
Note that with this new design, the dependency on calculating a pointer to
the task off the stack pointer no longer exist (resolving the problem of
interruption stacks as pointed out in the original feedback to this port.)
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that PPC64 has no-execute support, here is a second try to fix the
single step out of line during kprobe execution. Kprobes on x86_64 already
solved this problem by allocating an executable page and using it as the
scratch area for stepping out of line. Reuse that.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I believe at least for seccomp it's worth to turn off the tsc, not just for
HT but for the L2 cache too. So it's up to you, either you turn it off
completely (which isn't very nice IMHO) or I recommend to apply this below
patch.
This has been tested successfully on x86-64 against current cogito
repository (i686 compiles so I didn't bother testing ;). People selling
the cpu through cpushare may appreciate this bit for a peace of mind.
There's no way to get any timing info anymore with this applied
(gettimeofday is forbidden of course). The seccomp environment is
completely deterministic so it can't be allowed to get timing info, it has
to be deterministic so in the future I can enable a computing mode that
does a parallel computing for each task with server side transparent
checkpointing and verification that the output is the same from all the 2/3
seller computers for each task, without the buyer even noticing (for now
the verification is left to the buyer client side and there's no
checkpointing, since that would require more kernel changes to track the
dirty bits but it'll be easy to extend once the basic mode is finished).
Eliminating a cold-cache read of the cr4 global variable will save one
cacheline during the tlb flush while making the code per-cpu-safe at the
same time. Thanks to Mikael Pettersson for noticing the tlb flush wasn't
per-cpu-safe.
The global tlb flush can run from irq (IPI calling do_flush_tlb_all) but
it'll be transparent to the switch_to code since the IPI won't make any
change to the cr4 contents from the point of view of the interrupted code
and since it's now all per-cpu stuff, it will not race. So no need to
disable irqs in switch_to slow path.
Signed-off-by: Andrea Arcangeli <andrea@cpushare.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1. Establish a simple API for process freezing defined in linux/include/sched.h:
frozen(process) Check for frozen process
freezing(process) Check if a process is being frozen
freeze(process) Tell a process to freeze (go to refrigerator)
thaw_process(process) Restart process
frozen_process(process) Process is frozen now
2. Remove all references to PF_FREEZE and PF_FROZEN from all
kernel sources except sched.h
3. Fix numerous locations where try_to_freeze is manually done by a driver
4. Remove the argument that is no longer necessary from two function calls.
5. Some whitespace cleanup
6. Clear potential race in refrigerator (provides an open window of PF_FREEZE
cleared before setting PF_FROZEN, recalc_sigpending does not check
PF_FROZEN).
This patch does not address the problem of freeze_processes() violating the rule
that a task may only modify its own flags by setting PF_FREEZE. This is not clean
in an SMP environment. freeze(process) is therefore not SMP safe!
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove some of the unnecessary differences between arch/i386 and
arch/x86_64. This patch fixes more whitespace issues, some miscellaneous
typos, a wrong URL and a factually incorrect statement about the current
boot sector code.
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Put function prototypes for memset() and memcpy() ahead of where
there are used, to kill sparse warnings:
arch/x86_64/boot/compressed/../../../../lib/inflate.c:317:3: warning: undefined identifier 'memset'
arch/x86_64/boot/compressed/../../../../lib/inflate.c:601:11: warning: undefined identifier 'memcpy'
arch/x86_64/boot/compressed/misc.c:151:2: warning: undefined identifier 'memcpy'
arch/x86_64/boot/compressed/../../../../lib/inflate.c:317:3: warning: call with no type!
arch/x86_64/boot/compressed/../../../../lib/inflate.c:601:17: warning: call with no type!
arch/x86_64/boot/compressed/misc.c:151:9: warning: call with no type!
Signed-off-by: randy_dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Following patch provides purely cosmetic changes and corrects CodingStyle
guide lines related certain issues like below in kexec related files
o braces for one line "if" statements, "for" loops,
o more than 80 column wide lines,
o No space after "while", "for" and "switch" key words
o Changes:
o take-2: Removed the extra tab before "case" key words.
o take-3: Put operator at the end of line and space before "*/"
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Makes kexec_crashdump() take a pt_regs * as an argument. This allows to
get exact register state at the point of the crash. If we come from direct
panic assertion NULL will be passed and the current registers saved before
crashdump.
This hooks into two places:
die(): check the conditions under which we will panic when calling
do_exit and go there directly with the pt_regs that caused the fatal
fault.
die_nmi(): If we receive an NMI lockup while in the kernel use the
pt_regs and go directly to crash_kexec(). We're probably nested up badly
at this point so this might be the only chance to escape with proper
information.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Following patch exports kexec global variable "crash_notes" to user space
through sysfs as kernel attribute in /sys/kernel.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the x86_64 implementation of the crashkernel option. It reserves
a window of memory very early in the bootup process, so we never use
it for anything but the kernel to switch to when the running
kernel panics.
In addition to reserving this memory a resource structure is registered
so looking at /proc/iomem it is clear what happened to that memory.
ISSUES:
Is it possible to implement this in a architecture generic way?
What should be done with architectures that always use an iommu and
thus don't report their RAM memory resources in /proc/iomem?
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the x86_64 implementation of machine kexec. 32bit compatibility
support has been implemented, and machine_kexec has been enhanced to not care
about the changing internal kernel paget table structures.
From: Alexander Nyberg <alexn@dsv.su.se>
build fix
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Factor out the apic and smp shutdown code from machine_restart so it can be
called by in the kexec reboot path as well.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For one kernel to report a crash another kernel has created we need
to have 2 kernels loaded simultaneously in memory. To accomplish this
the two kernels need to built to run at different physical addresses.
This patch adds the CONFIG_PHYSICAL_START option to the x86_64 kernel
so we can do just that. You need to know what you are doing and
the ramifications are before changing this value, and most users
won't care so I have made it depend on CONFIG_EMBEDDED
bzImage kernels will work and run at a different address when compiled
with this option but they will still load at 1MB. If you need a kernel
loaded at a different address as well you need to boot a vmlinux.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The vmlinux on x86_64 does not report the correct physical address of
the kernel. Instead in the physical address field it currently
reports the virtual address of the kernel.
This is patch is a bug fix that corrects vmlinux to report the
proper physical addresses.
This is potentially a help for crash dump analysis tools.
This definitiely allows bootloaders that load vmlinux as a standard
ELF executable. Bootloaders directly loading vmlinux become of
practical importance when we consider the kexec on panic case.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When coming out of apic mode attempt to set the appropriate
apic back into virtual wire mode. This improves on previous versions
of this patch by by never setting bot the local apic and the ioapic
into veritual wire mode.
This code looks at data from the mptable to see if an ioapic has
an ExtInt input to make this decision. A future improvement
is to figure out which apic or ioapic was in virtual wire mode
at boot time and to remember it. That is potentially a more accurate
method, of selecting which apic to place in virutal wire mode.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Eric W. Biederman <ebiederm@xmission.com
The following patch simply adds a shutdown method to the x86_64 i8259 code.
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Eric W. Biederman <ebiederm@xmission.com>
It is ok to reserve resources > 4G on x86_64 struct resource is 64bit now :)
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch consolidates the CONFIG_PREEMPT and CONFIG_PREEMPT_BKL
preemption options into kernel/Kconfig.preempt. This, besides reducing
source-code, also enables more centralized tweaking of preemption related
options.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2.6.12-rc6-mm1 has a few remaining synchronize_kernel()s, some (but not
all) in comments. This patch changes these synchronize_kernel() calls (and
comments) to synchronize_rcu() or synchronize_sched() as follows:
- arch/x86_64/kernel/mce.c mce_read(): change to synchronize_sched() to
handle races with machine-check exceptions (synchronize_rcu() would not cut
it given RCU implementations intended for hardcore realtime use.
- drivers/input/serio/i8042.c i8042_stop(): change to synchronize_sched() to
handle races with i8042_interrupt() interrupt handler. Again,
synchronize_rcu() would not cut it given RCU implementations intended for
hardcore realtime use.
- include/*/kdebug.h comments: change to synchronize_sched() to handle races
with NMIs. As before, synchronize_rcu() would not cut it...
- include/linux/list.h comment: change to synchronize_rcu(), since this
comment is for list_del_rcu().
- security/keys/key.c unregister_key_type(): change to synchronize_rcu(),
since this is interacting with RCU read side.
- security/keys/process_keys.c install_session_keyring(): change to
synchronize_rcu(), since this is interacting with RCU read side.
Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes register saving so that each register is only saved once,
and adds missing saving of %cr8 on x86-64. Some reordering so that
save/restore is more logical/safer (segment registers should be restored
after gdt).
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sleep code uses wrong version of lgdt, that does the wrong thing when
gdt is beyond 16MB or so.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides an option to switch broadcast or use mask version for
sending IPI's. If CONFIG_HOTPLUG_CPU is defined, we choose not to use
broadcast shortcuts by default, otherwise we choose broadcast mode as default.
both cases, one can change this via startup cmd line option, to choose
no-broadcast mode.
no_ipi_broadcast=1
This is provided on request from Andi Kleen, since he doesnt agree with
replacing IPI shortcuts as a solution for CPU hotplug. Without removing
broadcast IPI's, it would mean lots of new code for __cpu_up() path, which
would acheive the same results.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Broadcast IPI's provide un-expected behaviour for cpu hotplug. CPU's in
offline state also end up receiving the IPI. Once the cpus become online they
receive these stale IPI's which are bad and introduce unexpected behaviour.
This is easily avoided by not sending a broadcast and addressing just the
CPU's in online map. Doing prelim cycle counts it appears there is no big
overhead and numbers seem around 0x3000-0x3900 on an average on x86 and x86_64
systems with CPUS running 3G, both for broadcast and mask version of the
API's.
The shortcuts are useful only for flat mode (where the perf shows no
degradation), and in cluster mode, its unicast anyway. Its simpler to just
not use broadcast anymore.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is a minor cleanup to the cpu sibling/core map. It is required
that this setup happens on a per-cpu bringup time.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Experimental CPU hotplug patch for x86_64
-----------------------------------------
This supports logical CPU online and offline.
- Test with maxcpus=1, and then kick other cpu's off to test if init code
is all cleaned up. CONFIG_SCHED_SMT works as well.
- idle threads are forked on demand from keventd threads for clean startup
TBD:
1. Not tested on a real NUMA machine (tested with numa=fake=2)
2. Handle ACPI pieces for physical hotplug support.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Shaohua.li<shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds __cpuinit and __cpuinitdata sections that need to exist past
boot to support cpu hotplug.
Caveat: This is done *only* for EM64T CPU Hotplug support, on request from
Andi Kleen. Much of the generic hotplug code in kernel, and none of the other
archs that support CPU hotplug today, i386, ia64, ppc64, s390 and parisc dont
mark sections with __cpuinit, but only mark them as __devinit, and
__devinitdata.
If someone is motivated to change generic code, we need to make sure all
existing hotplug code does not break, on other arch's that dont use __cpuinit,
and __cpudevinit.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch includes x86_64 architecture specific changes to support temporary
disarming on reentrancy of probes.
Signed-of-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The architecture independent code of the current kprobes implementation is
arming and disarming kprobes at registration time. The problem is that the
code is assuming that arming and disarming is a just done by a simple write
of some magic value to an address. This is problematic for ia64 where our
instructions look more like structures, and we can not insert break points
by just doing something like:
*p->addr = BREAKPOINT_INSTRUCTION;
The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
functions:
* void arch_arm_kprobe(struct kprobe *p)
* void arch_disarm_kprobe(struct kprobe *p)
and then adds the new functions for each of the architectures that already
implement kprobes (spar64/ppc64/i386/x86_64).
I thought arch_[dis]arm_kprobe was the most descriptive of what was really
happening, but each of the architectures already had a disarm_kprobe()
function that was really a "disarm and do some other clean-up items as
needed when you stumble across a recursive kprobe." So... I took the
liberty of changing the code that was calling disarm_kprobe() to call
arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
with the recursive kprobe case.
So far this patch as been tested on i386, x86_64, and ppc64, but still
needs to be tested in sparc64.
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch adds the x86_64 architecture specific implementation
for function return probes.
Function return probes is a mechanism built on top of kprobes that allows
a caller to register a handler to be called when a given function exits.
For example, to instrument the return path of sys_mkdir:
static int sys_mkdir_exit(struct kretprobe_instance *i, struct pt_regs *regs)
{
printk("sys_mkdir exited\n");
return 0;
}
static struct kretprobe return_probe = {
.handler = sys_mkdir_exit,
};
<inside setup function>
return_probe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("sys_mkdir");
if (register_kretprobe(&return_probe)) {
printk(KERN_DEBUG "Unable to register return probe!\n");
/* do error path */
}
<inside cleanup function>
unregister_kretprobe(&return_probe);
The way this works is that:
* At system initialization time, kernel/kprobes.c installs a kprobe
on a function called kretprobe_trampoline() that is implemented in
the arch/x86_64/kernel/kprobes.c (More on this later)
* When a return probe is registered using register_kretprobe(),
kernel/kprobes.c will install a kprobe on the first instruction of the
targeted function with the pre handler set to arch_prepare_kretprobe()
which is implemented in arch/x86_64/kernel/kprobes.c.
* arch_prepare_kretprobe() will prepare a kretprobe instance that stores:
- nodes for hanging this instance in an empty or free list
- a pointer to the return probe
- the original return address
- a pointer to the stack address
With all this stowed away, arch_prepare_kretprobe() then sets the return
address for the targeted function to a special trampoline function called
kretprobe_trampoline() implemented in arch/x86_64/kernel/kprobes.c
* The kprobe completes as normal, with control passing back to the target
function that executes as normal, and eventually returns to our trampoline
function.
* Since a kprobe was installed on kretprobe_trampoline() during system
initialization, control passes back to kprobes via the architecture
specific function trampoline_probe_handler() which will lookup the
instance in an hlist maintained by kernel/kprobes.c, and then call
the handler function.
* When trampoline_probe_handler() is done, the kprobes infrastructure
single steps the original instruction (in this case just a top), and
then calls trampoline_post_handler(). trampoline_post_handler() then
looks up the instance again, puts the instance back on the free list,
and then makes a long jump back to the original return instruction.
So to recap, to instrument the exit path of a function this implementation
will cause four interruptions:
- A breakpoint at the very beginning of the function allowing us to
switch out the return address
- A single step interruption to execute the original instruction that
we replaced with the break instruction (normal kprobe flow)
- A breakpoint in the trampoline function where our instrumented function
returned to
- A single step interruption to execute the original instruction that
we replaced with the break instruction (normal kprobe flow)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make use of the user_mode macro where it's possible. This is useful for Xen
because it will need only to redefine only the macro to a hypervisor call.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add 2 macros to set and get debugreg on x86_64. This is useful for Xen
because it will need only to redefine each macro to a hypervisor call.
Signed-off-by: Vincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I suggest to change the way IRQs are handed out to PCI devices.
Currently, each I/O APIC pin gets associated with an IRQ, no matter if the
pin is used or not. It is expected that each pin can potentually be
engaged by a device inserted into the corresponding PCI slot. However,
this imposes severe limitation on systems that have designs that employ
many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset.
It is used in ES7000, and currently, there is no way to boot the system
with more that 9 I/O APICs.
The simple change below allows to boot a system with say 64 (or more) I/O
APICs, each providing 1 slot, which otherwise impossible because of the IRQ
gaps created for unused lines on each I/O APIC. It does not resolve the
problem with number of devices that exceeds number of possible IRQs, but
eases up a tension for IRQs on any large system with potentually large
number of devices.
I only implemented this for the ACPI boot, since if the system is this big
and using newer chipsets it is probably (better be!) an ACPI based system
:). The change is completely "mechanical" and does not alter any internal
structures or interrupt model/implementation. The patch works for both
i386 and x86_64 archs. It works with MSIs just fine, and should not
intervene with implementations like shared vectors, when they get worked
out and incorporated.
To illustrate, below is the interrupt distribution for 2-cell ES7000 with
20 I/O APICs, and an Ethernet card in the last slot, which should be eth1
and which was not configured because its IRQ exceeded allowable number (it
actially turned out huge - 480!):
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer
4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0
16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3
19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx
97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx
192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0
NMI: 0 0 0 0 0 0 0 0
LOC: 273423 274528 272829 274228 274092 273761 273827 273694
ERR: 7
MIS: 0
Even though the system doesn't have that many devices, some don't get
enabled only because of IRQ numbering model.
This is the IRQ picture after the patch was applied:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer
4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0
17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx
18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3
21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0
25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0 0 0
LOC: 107981 107636 108899 108698 108489 108326 108331 108254
ERR: 7
MIS: 0
Not only we see the card in the last I/O APIC, but we are not even close to
using up available IRQs, since we didn't waste any.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the x86_64 version of the signal fix I just posted for i386.
This problem was first noticed on PPC and has already been fixed there.
But the exact same issue applies to other platforms in the same way. The
signal blocking for sa_mask and the handled signal takes place after the
handler setup. When the stack is bogus, the handler setup forces a
SIGSEGV. But then this will be blocked, and returning to user mode will
fault again and iterate. This patch fixes the problem by checking whether
signal handler setup failed, and not doing the signal-blocking if so. This
copies what was done in the ppc code. I think all architectures' signal
handler setup code follows this pattern and needs the change.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently the x86-64 HPET code assumes the entire HPET implementation from
the spec is present. This breaks on boxes that do not implement the
optional legacy timer replacement functionality portion of the spec.
This patch fixes this issue, allowing x86-64 systems that cannot use the
HPET for the timer interrupt and RTC to still use the HPET as a time
source. I've tested this patch on a system systems without HPET, with HPET
but without legacy timer replacement, as well as HPET with legacy timer
replacement.
This version adds a minor check to cap the HPET counter value in
gettimeoffset_hpet to avoid possible time inconsistencies. Please ignore
the A2 version I sent to you earlier.
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make the timer frequency selectable. The timer interrupt may cause bus
and memory contention in large NUMA systems since the interrupt occurs
on each processor HZ times per second.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow early printk code to take advantage of the full size of the screen, not
just the first 25 lines.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define pcibus_to_node to be able to figure out which NUMA node contains a
given PCI device. This defines pcibus_to_node(bus) in
include/linux/topology.h and adjusts the macros for i386 and x86_64 that
already provided a way to determine the cpumask of a pci device.
x86_64 was changed to not build an array of cpumasks anymore. Instead an
array of nodes is build which can be used to generate the cpumask via
node_to_cpumask.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Issue:
Current tsc based delay_calibration can result in significant errors in
loops_per_jiffy count when the platform events like SMIs
(System Management Interrupts that are non-maskable) are present. This could
lead to potential kernel panic(). This issue is becoming more visible with 2.6
kernel (as default HZ is 1000) and on platforms with higher SMI handling
latencies. During the boot time, SMIs are mostly used by BIOS (for things
like legacy keyboard emulation).
Description:
The psuedocode for current delay calibration with tsc based delay looks like
(0) Estimate a value for loops_per_jiffy
(1) While (loops_per_jiffy estimate is accurate enough)
(2) wait for jiffy transition (jiffy1)
(3) Note down current tsc (tsc1)
(4) loop until tsc becomes tsc1 + loops_per_jiffy
(5) check whether jiffy changed since jiffy1 or not and refine
loops_per_jiffy estimate
Consider the following cases
Case 1:
If SMIs happen between (2) and (3) above, we can end up with a
loops_per_jiffy value that is too low. This results in shorted delays and
kernel can panic () during boot (Mostly at IOAPIC timer initialization
timer_irq_works() as we don't have enough timer interrupts in a specified
interval).
Case 2:
If SMIs happen between (3) and (4) above, then we can end up with a
loops_per_jiffy value that is too high. And with current i386 code, too
high lpj value (greater than 17M) can result in a overflow in
delay.c:__const_udelay() again resulting in shorter delay and panic().
Solution:
The patch below makes the calibration routine aware of asynchronous events
like SMIs. We increase the delay calibration time and also identify any
significant errors (greater than 12.5%) in the calibration and notify it to
user.
Patch below changes both i386 and x86-64 architectures to use this
new and improved calibrate_delay_direct() routine.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attached patch causes the various arch specific install.sh scripts to
look for ${CROSS_COMPILE}installkernel rather than just installkernel (in
both /sbin/ and ~/bin/ where the script already did this). This allows you
to have e.g. arm-linux-installkernel as a handy way to install on your
cross target. It also prevents the script picking up on the host
/sbin/installkernel which causes the script to fall through and do the
install itself (which is what I actually use myself, with $INSTALL_PATH
set).
I don't believe it causes back-compatibility problems since calling the
host installkernel was never likely to work or be what you wanted when
cross compiling anyway. If $CROSS_COMPILE isn't set then nothing changes.
I only use ARM and i386 myself but I figured it couldn't hurt to do the
whole lot. I've cc'd those who I hope are the arch maintainers for files
that I've touched.
Signed-off-by: Ian Campbell <icampbell@arcom.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds in the necessary support for sparsemem such that x86-64
kernels may use sparsemem as an alternative to discontigmem for NUMA
kernels. Note that this does no preclude one from continuing to build NUMA
kernels using discontigmem, but merely allows the option to build NUMA
kernels with sparsemem.
Interestingly, the use of sparsemem in lieu of discontigmem in NUMA kernels
results in reduced text size for otherwise equivalent kernels as shown in
the example builds below:
text data bss dec hex filename
2371036 765884 1237108 4374028 42be0c vmlinux.discontig
2366549 776484 1302772 4445805 43d66d vmlinux.sparse
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In order to use the alternative sparsemem implmentation for NUMA kernels,
we need to reorganize the config options. This patch effectively abstracts
out the CONFIG_DISCONTIGMEM options to CONFIG_NUMA in most cases. Thus,
the discontigmem implementation may be employed as always, but the
sparsemem implementation may be used alternatively.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the requisite arch specific Kconfig options to enable the use of the
sparsemem implementation for NUMA kernels on x86-64.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch pulls out all remaining direct references to contig_page_data
from arch/x86-64, thus saving an ifdef in one case.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For all architectures, this just means that you'll see a "Memory Model"
choice in your architecture menu. For those that implement DISCONTIGMEM,
you may eventually want to make your ARCH_DISCONTIGMEM_ENABLE a "def_bool
y" and make your users select DISCONTIGMEM right out of the new choice
menu. The only disadvantage might be if you have some specific things that
you need in your help option to explain something about DISCONTIGMEM.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.
The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.
The problem is twofold:
1) the free_area_cache is used to continue a search for memory where
the last search ended. Before the change new areas were always
searched from the base address on.
So now new small areas are cluttering holes of all sizes
throughout the whole mmap-able region whereas before small holes
tended to close holes near the base leaving holes far from the base
large and available for larger requests.
2) the free_area_cache also is set to the location of the last
munmap-ed area so in scenarios where we allocate e.g. five regions of
1K each, then free regions 4 2 3 in this order the next request for 1K
will be placed in the position of the old region 3, whereas before we
appended it to the still active region 1, placing it at the location
of the old region 2. Before we had 1 free region of 2K, now we only
get two free regions of 1K -> fragmentation.
The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache. If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.
The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.
Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.
Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.
Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.
The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.
Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.
In the new code, there are two externally visible symbols:
- smp_processor_id(): debug variant.
- raw_smp_processor_id(): nondebug variant. Replaces all existing
uses of _smp_processor_id() and __smp_processor_id(). Defined
by every SMP architecture in include/asm-*/smp.h.
There is one new internal symbol, dependent on DEBUG_PREEMPT:
- debug_smp_processor_id(): internal debug variant, mapped to
smp_processor_id().
Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file. All related comments got updated and/or
clarified.
I have build/boot tested the following 8 .config combinations on x86:
{SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
architectures are untested, but should work just fine.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Appended patch will setup compatibility mode TASK_SIZE properly. This will
fix atleast three known bugs that can be encountered while running
compatibility mode apps.
a) A malicious 32bit app can have an elf section at 0xffffe000. During
exec of this app, we will have a memory leak as insert_vm_struct() is
not checking for return value in syscall32_setup_pages() and thus not
freeing the vma allocated for the vsyscall page. And instead of exec
failing (as it has addresses > TASK_SIZE), we were allowing it to
succeed previously.
b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area
may return addresses beyond 32bits, ultimately causing corruption
because of wrap-around and resulting in SEGFAULT, instead of returning
ENOMEM.
c) 32bit app doing this below mmap will now fail.
mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0);
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Martin Bligh determined that this patch is causing his test box to not boot.
Revert.
Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Even after the previous fix you can still set CONFIG_ACPI_BOOT
indirectly even without CONFIG_ACPI by choosing CONFIG_PCI and
CONFIG_PCI_MMCONFIG.
That doesn't build very well either.
This makes PCI_MMCONFIG depend on ACPI, fixing that hole.
[ I guess in theory Kconfig could follow the whole chain of dependencies
for things that get selected, but that sounds insanely complicated, so
we'll just fix up these things by hand. --Linus ]
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This works around the too fast timer seen on some ATI boards.
I don't feel confident enough about it yet to enable it by default, but give
users the option.
Patch and debugging from Christopher Allen Wing <wingc@engin.umich.edu>, with
minor tweaks (renamed the option and documented it)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The test case at
http://cvs.sourceforge.net/viewcvs.py/posixtest/posixtestsuite/conforman
ce/interfaces/clock_nanosleep/1-5.c fails if it runs as a 32bit process on
x86_86 machines.
The root cause is the sub 32bit process fails to restart the syscall after it
is interrupted by a signal.
The syscall number of sys_restart_syscall in table sys_call_table is
__NR_restart_syscall (219) while it's __NR_ia32_restart_syscall
(0) in ia32_sys_call_table. When regs->rax==(unsigned
long)-ERESTART_RESTARTBLOCK, function do_signal doesn't distinguish if
the process is 64bit or 32bit, and always sets restart syscall number
as __NR_restart_syscall (219).
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Caused oopses again. Also fix potential mismatch in checking if
change_page_attr was needed.
To do it without races I needed to change mm/vmalloc.c to export a
__remove_vm_area that does not take vmlist lock.
Noticed by Terence Ripperda and based on a patch of his.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was a "off by one quad word" error in there. I don't think it is
exploitable because it will only store into a unused area, but better to plug
it.
Found and fixed by John Blackwood
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Remove duplicated ifdef
- Make core_id match what Intel uses
- Initialize phys_proc_id correctly for non DC case
- Handle non power of two core numbers.
Fixes for both i386 and x86-64
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the assumption that LAPIC entries contain the BSP as its
first entry. This is a slight improvement to the temporary fix submitted by
Suresh Siddha.
- Removes assumption that LAPIC entries contain BSP first.
- Builds x86_acpiid_to_apicid[] and bios_cpu_apicid[] properly with BSP as
first entry.
- Made maxcpus=1 boot on these systems. Since the parsing earlier in
arch/x86_64/kernel/mpparse.c stopped after maxcpus entries, other entries
were not processed, this causes kernel not to boot on these systems.
TBD: x86_acpiid_to_apicid and bios_cpu_apicid[] seem to be exactly the
same. This could be removed, but might need more work to cleanup.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Collected NMI watchdog fixes.
- Fix call of check_nmi_watchdog
- Remove earlier move of check_nmi_watchdog to later. It does not fix the
race it was supposed to fix fully.
- Remove unused P6 definitions
- Add support for performance counter based watchdog on P4 systems.
This allows to run it only once per second, which saves some CPU time.
Previously it would run at 1000Hz, which was too much.
Code ported from i386
Make this the default on Intel systems.
- Use check_nmi_watchdog with local APIC based nmi
- Fix race in touch_nmi_watchdog
- Fix bug that caused incorrect performance counters to be programmed in a
few cases on K8.
- Remove useless check for local APIC
- Use local_t and per_cpu variables for per CPU data.
- Keep other CPUs busy during check_nmi_watchdog to make sure they really
tick when in lapic mode.
- Only check CPUs that are actually online.
- Various other fixes.
- Fix fallback path when MSRs are unimplemented
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Originally from Matt Tolentino
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use bitmap_zero instead of bitmap_empty to initialise cpu mask This makes it
actually run reliable instead of relying on stack state.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PTEs can point to ioremap mappings too, and these are often outside
mem_map. The NUMA hash page lookup functions cannot handle out of bounds
accesses properly.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allowed user programs to set a non canonical segment base, which would cause
oopses in the kernel later.
Credit-to: Alexander Nyberg <alexn@dsv.su.se>
For identifying and reporting this bug.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This works around an AMD Erratum.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are unfortunately more and more multi processor Opteron systems which
don't have HPET timer support in the southbridge. This covers in particular
Nvidia and VIA chipsets. They also don't guarantee that the TSCs are
synchronized between CPUs; and especially with MP powernow the systems are
nearly unusable because the time gets very inconsistent between CPUs.
The timer code for x86-64 was originally written under the assumption that we
could fall back to the HPET timer on such systems. But this doesn't work
there.
Another alternative is to use the ACPI PM timer as primary time source. This
patch does that. The kernel only uses PM timer when there is no other choice
because it has some disadvantages.
Ported over from i386. It should be faster than the i386 version because I
dropped the "read three times" workaround, but is still considerable slower
than HPET and also does not work together with vsyscalls which have to be
disabled.
Cc: <mark.langsdorf@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is unnecessary on modern Intel or AMD systems, and that is all we support
on x86-64
Also causes problems on various systems
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is not very useful to the user and more an kernel internal implementation
detail. So hide it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The new TSC sync algorithm recently submitted did not work too well.
The result was that some MP machines where the TSC came up of the BIOS very
unsynchronized and that did not have HPET support were nearly unusable because
the time would jump forwards and backwards between CPUs.
After a lot of research ;-) and some more prototypes I ended up with just
using the one from IA64 which looks best. It has some internal self tuning
that should adapt to changing interconnect latencies. It holds up in my tests
so far.
I believe it was originally written by David Mosberger, I just ported it over
to x86-64. See the inline comment for a description.
This cleans up the code because it uses smp_call_function for syncing instead
of having custom hooks in SMP bootup.
Please note that the cycle numbers it outputs are too optimistic because they
do not take into account the latency of WRMSR and RDTSC, which can be hundreds
of cycles. It seems to be able to sync a dual Opteron to 200-300 cycles,
which is probably good enough.
There is a timing window during AP bootup where interrupts can see
inconsistent time before the TSC is synced. It is hard to avoid unfortunately
because we can only do the TSC sync after some setup, and we need to enable
interrupts before that. I just ignored it for now.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It could be in a memory hole not mapped in mem_map and that causes the hash
lookup to go off to nirvana.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Last round hopefully of cpu_core_id changes hopefully fow now:
- Always initialize cpu_core_id for all CPUs, even when no dual core setup
is detected. This prevents funny /proc/cpuinfo output
- Do the same with phys_proc_id[] even when no HyperThreading - dito.
- Use the CPU APIC-ID from CPUID 1 instead of the linux virtual CPU number
to identify the core for AMD dual core setups.
Patch for i386/x86-64.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Cleans up the system exit call slightly and synchronizes with my tree again.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NR_CPUs can be quite big these days. kmalloc the per CPU array instead of
putting it onto the stack
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When I do a "diff -Nur arch/i386 arch/x86_64" to see what is different between these two
architectures, I see some differences due to whitespace issues only. The attached patch removes
some of the noise by fixing up the following files:
- arch/i386/boot/bootsect.S
- arch/i386/boot/video.S
- arch/x86_64/boot/bootsect.S
Signed-off-by: Daniel Dickman <didickman@yahoo.com>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Kprobes could not handle the insertion of a probe on the ret/lret
instruction and used to oops after single stepping since kprobes was
modifying eip/rip incorrectly. Adjustment of eip/rip is not required after
single stepping in case of ret/lret instruction, because eip/rip points to
the correct location after execution of the ret/lret instruction. This
patch fixes the above problem.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In include/asm-x86_64/string.h there are such comments:
/* Use C out of line version for memcmp */
#define memcmp __builtin_memcmp
int memcmp(const void * cs,const void * ct,size_t count);
This would mean that if the compiler does not decide to use __builtin_memcmp,
it emits a call to memcmp to be satisfied by the C out-of-line version in
lib/string.c. What happens is that after preprocessing, in lib/string.i you
may find the definition of "__builtin_strcmp".
Actually, by accident, in the object you will find the definition of strcmp
and such (maybe a trick intended to redirect calls to __builtin_memcmp to the
default memcmp when the definition is not expanded); however, this particular
case is not a documented feature as far as I can see.
Also, the EXPORT_SYMBOL does not work, so it's duplicated in the arch.
I simply added some #undef to lib/string.c and removed the (now duplicated)
exports in x86-64 and UML/x86_64 subarchs (the second ones are introduced by
another patch I just posted for -mm).
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These are some trivial fixes for the x86-64 subarch module support. The only
potential problem is that I have to modify arch/x86_64/kernel/module.c, to
avoid copying the whole of it.
I can't use it verbatim because it depends on a special vmalloc-like area for
modules, which for now (maybe that's to fix, I guess not) UML/x86-64 has not.
I went the easy way and reused the i386 vmalloc()-based allocator.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A bunch of drivers use ISA DMA helpers or their equivalents for
platforms that have ISA with different DMA controller (a lot of ARM
boxen). Currently there is no way to put such dependency in Kconfig -
CONFIG_ISA is not it (e.g. it is not set on platforms that have no ISA
slots, but have on-board devices that pretend to be ISA ones).
New symbol added - ISA_DMA_API. Set when we have functional
enable_dma()/set_dma_mode()/etc. set of helpers. Next patches in the
series will add missing dependencies for drivers that need them.
I'm very carefully staying the hell out of the recurring flamefest on
what exactly CONFIG_ISA would mean in ideal world - added symbol has a
well-defined meaning and for now I really want to treat it as completely
independent from the mess around CONFIG_ISA.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another large rollup of various patches from Adrian which make things static
where they were needlessly exported.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This strcpy can run off the end of saved_command_line, and we don't need it any more anyway.
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Brings sanitize_e820_map() in x86-64 in sync with that of i386.
x86_64 version was missing the changes from this patch.
http://linux.bkbits.net:8080/linux-2.6/cset@3e5e4083Y3HevldZl5KCy94V4DcZww?nav=index.html|src/|src/arch|src/arch/i386|src/arch/i386/kernel|related/arch/i386/kernel/setup.c
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The specifications that talk about E820 map doesn't have an upper limit on
the number of e820 entries. But, today's kernel has a hard limit of 32.
With increase in memory size, we are seeing the number of E820 entries
reaching close to 32. Patch below bumps the number upto 128.
The patch changes the location of EDDBUF in zero-page (as it comes after E820).
As, EDDBUF is not used by boot loaders, this patch should not have any effect
on bootloader-setup code interface.
Patch covers both i386 and x86-64.
Tested on:
* grub booting bzImage
* lilo booting bzImage with EDID info enabled
* pxeboot of bzImage
Side-effect:
bss increases by ~ 2K and init.data increases by ~7.5K
on all systems, due to increase in size of static arrays.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
http://bugme.osdl.org/show_bug.cgi?id=4426
vendor_id : AuthenticAMD
cpu family : 6
model : 10
model name : AMD Athlon(tm) XP
stepping : 0
cpu MHz : 2204.807
<snipped>
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow
bogomips : 4358.14
We're marking bit 0 of extended function 0x80000001 cpuid as PNI support on
AMD processors, when it actually denotes x87 FPU present. Patch for i386
and x86_64 below.
Signed-off-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The recent support for K8 multicore was misported from x86-64 to i386, due
to an unnecessary inconsistency between the CPUID code. Sure, there is are
no x86-64 VIA chips yet, but it should happen eventually.
This patch fixes the i386 bug as well as makes x86-64 match i386 in the
handing of the CPUID array.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A bug against an xSeries system showed up recently noting that the
check_nmi_watchdog() test was failing.
I have been investigating it and discovered in both i386 and x86_64 the
recent change to the routine to use the cpu_callin_map has uncovered a
problem. Prior to that change, on an SMP box, the test was trivally
passing because all cpu's were found to not yet be online, but now with the
callin_map they are discovered, it goes on to test the counter and they
have not yet begun to increment, so it announces a CPU is stuck and bails
out.
On all the systems I have access to test, the announcement of failure is
also bougs... by the time you can login and check /proc/interrupts, the
NMI count is happily incrementing on all CPUs. Its just that the test is
being done too early.
I have tried moving the call to the test around a bit, and it was always
too early. I finally hit on this proposed solution, it delays the routine
via a late_initcall(), seems like the right solution to me.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The new i386/x86_64 assemblers no longer accept instructions for moving
between a segment register and a 32bit memory location, i.e.,
movl (%eax),%ds
movl %ds,(%eax)
To generate instructions for moving between a segment register and a
16bit memory location without the 16bit operand size prefix, 0x66,
mov (%eax),%ds
mov %ds,(%eax)
should be used. It will work with both new and old assemblers. The
assembler starting from 2.16.90.0.1 will also support
movw (%eax),%ds
movw %ds,(%eax)
without the 0x66 prefix. I am enclosing patches for 2.4 and 2.6 kernels
here. The resulting kernel binaries should be unchanged as before, with
old and new assemblers, if gcc never generates memory access for
unsigned gsindex;
asm volatile("movl %%gs,%0" : "=g" (gsindex));
If gcc does generate memory access for the code above, the upper bits
in gsindex are undefined and the new assembler doesn't allow it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We were calling ptrace_notify() after auditing the syscall and arguments,
but the debugger could have _changed_ them before the syscall was actually
invoked. Reorder the calls to fix that.
While we're touching ever call to audit_syscall_entry(), we also make it
take an extra argument: the architecture of the syscall which was made,
because some architectures allow more than one type of syscall.
Also add an explicit success/failure flag to audit_syscall_exit(), for
the benefit of architectures which return that in a condition register
rather than only returning a single register.
Change type of syscall return value to 'long' not 'int'.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
The addition of the PT_NOTE didn't take in the x86_64 version of the i386
vDSO, because I forgot the linker script bit in that copy.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
->pretcode in struct rt_sigframe is a userland pointer (and already
treated as such by code using that field).
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The labels after the last put_user patch were misplaced so
exceptions on the real mov instructions would not be handled.
Noted by Brian Gerst <bgerst@didntduck.org>
The new out of line put_user() assembly on x86_64 changes %rcx without
telling GCC about it causing things like:
http://bugme.osdl.org/show_bug.cgi?id=4515
See to it that %rcx is not changed (made it consistent with get_user()).
Signed-off-by: Alexander Nyberg <alexn@telia.com>
Signed-off-by: ak@suse.de
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I thought I'm done with fixing u32 vs. pm_message_t ... unfortunately that
turned out not to be the case... Here are fixes x86-64.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- broken sibling_map setup in x86_64
- grouping all the core and HT related cpuinfo fields.
We are reasonably sure that adding new cpuinfo fields after "siblings" field,
will not cause any app failure. Thats because today's /proc/cpuinfo
format is completely different on x86, x86_64 and we haven't heard of any
x86 app breakage because of this issue. Grouping these fields will
result in more or less common format on all architectures (ia64, x86 and
x86_64) and will cause less confusion.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This will allow hotplug CPU in the future and in general cleans up a lot of
crufty code. It also should plug some races that the old hackish way
introduces. Remove one old race workaround in NMI watchdog setup that is not
needed anymore.
I removed the old total sum of bogomips reporting code. The brag value of
BogoMips has been greatly devalued in the last years on the open market.
Real CPU hotplug will need some more work, but the infrastructure for it is
there now.
One drawback: the new TSC sync algorithm is less accurate than before. The
old way of zeroing TSCs is too intrusive to do later. Instead the TSC of the
BP is duplicated now, which is less accurate.
akpm:
- sync_tsc_bp_init seems to have the sense of `init' inverted.
- SPIN_LOCK_UNLOCKED is deprecated - use DEFINE_SPINLOCK.
Cc: <rusty@rustcorp.com.au>
Cc: <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was confusingly named.
Signed-off-by: Andi Kleen <ak@suse.de>
DESC
x86_64: Switch SMP bootup over to new CPU hotplug state machine
EDESC
From: "Andi Kleen" <ak@suse.de>
This will allow hotplug CPU in the future and in general cleans up a lot of
crufty code. It also should plug some races that the old hackish way
introduces. Remove one old race workaround in NMI watchdog setup that is not
needed anymore.
I removed the old total sum of bogomips reporting code. The brag value of
BogoMips has been greatly devalued in the last years on the open market.
Real CPU hotplug will need some more work, but the infrastructure for it is
there now.
One drawback: the new TSC sync algorithm is less accurate than before. The
old way of zeroing TSCs is too intrusive to do later. Instead the TSC of the
BP is duplicated now, which is less accurate.
Cc: <rusty@rustcorp.com.au>
Cc: <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Exceptions and hardware interrupts can, to a certain degree, nest, so when
attempting to follow the sequence of stacks used in order to dump their
contents this has to be accounted for. Also, IST stacks have their tops
stored in the TSS, so there's no need to add the stack size to get to their
ends.
Minor changes from AK.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up the code greatly. Now uses the infrastructure from the Intel dual
core patch Should fix a final bug noticed by Tyan of not detecting the nodes
correctly in some corner cases.
Patch for x86-64 and i386
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Appended patch adds the support for Intel dual-core detection and displaying
the core related information in /proc/cpuinfo.
It adds two new fields "core id" and "cpu cores" to x86 /proc/cpuinfo and the
"core id" field for x86_64("cpu cores" field is already present in x86_64).
Number of processor cores in a die is detected using cpuid(4) and this is
documented in IA-32 Intel Architecture Software Developer's Manual (vol 2a)
(http://developer.intel.com/design/pentium4/manuals/index_new.htm#sdm_vol2a)
This patch also adds cpu_core_map similar to cpu_sibling_map.
Slightly hacked by AK.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Calling a notifier three times in the debug handler does not make much sense,
because a debugger can figure out the various conditions by itself. Remove
the additional calls to DIE_DEBUG and DIE_DEBUGSTEP completely.
This matches what i386 does now.
This also makes sure interrupts are always still disabled when calling a
debugger, which prevents:
BUG: using smp_processor_id() in preemptible [00000001] code: tpopf/1470
caller is post_kprobe_handler+0x9/0x70
Call Trace:<ffffffff8024f10f>{smp_processor_id+191} <ffffffff80120e69>{post_kpro
be_handler+9}
<ffffffff80120f7a>{kprobe_exceptions_notify+58}
<ffffffff80144fc0>{notifier_call_chain+32} <ffffffff80110daf>{do_debug+335}
<ffffffff8010f513>{debug+127} <EOE>
on preemptible debug kernels with kprobes when single stepping in user space.
This was probably a bug even on non preempt kernels, this function was
supposed to be running with interrupts off according to a comment there.
Note to third part debugger maintainers: please double check your debugger can
still single step.
Cc: <prasanna@in.ibm.com>
Cc: <jbeulich@novell.com>
Cc: <kaos@sgi.com>
Cc: <jim.houston@ccur.com>
Cc: <jfv@bluesong.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This might save memory on some Opteron systems without AGP bridge.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Look for gaps in the e820 memory map to put PCI resources in.
This hopefully fixes problems with the PCI code assigning 32bit BARs MMIO
resources which are >32bit.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They are rumoured to be much more reliable than the RIP in the stack frame on
P4s.
This is a borderline case because the code is very simple. Please note there
are no plans to add support for all the MCE register MSRs.
Cc: <venkatesh.pallipadi@intel.com>
Cc: <racing.guo@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The NMI watchdog code did this incorrectly
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There were no reports about the previous warning for FPU exceptions in the
kernel, so make it a die() now.
Also improve the error messages slightly.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On Intel Noconas the TSC ticks with a constant frequency. Don't scale the
factor used by udelay when cpufreq changes the frequency.
This generalizes an earlier patch by Intel for this.
Cc: <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Could lead to a lost reschedule event when the process already rescheduled on
exception exit, and needs it again while still being in the kernel. Unlikely
case though.
Also remove one redundant cli in another entry.S path.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes various issues in the return path for "paranoid"
handlers (= running on a private exception stack that act like NMIs).
Generalize previous hack to switch back to process stack for
scheduling/signal handling purposes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This removes some unnecessary code in the assembly files.
Matches i386 behaviour.
In addition don't clear the work check mask after work has been done.
This fixes some theoretical signal/other event losses.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ported from i386/Linus
Fix another TF corner case. Need to do the special TF handling for all
signals to make debuggers happy
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ported from i386/Linus
Still won't handle other TF changing instructions like IRET or LAHF.
Prefix handling must be double checked...
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ported from i386/Linus
Be more careful with TF handling to fix some copy protection codes in Wine
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ported from i386 (originally from Linus)
clean up ptrace single-stepping, make PT_DTRACE exact.
(This makes the naming of "DTRACE" purely historical, since
on x86 it now means "single step in progress").
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use a real VMA to map the 32bit vsyscall page
This interacts better with Hugh's upcomming VMA walk optimization
Also removes some ugly special cases.
Code roughly modelled after the ppc64 vdso version from Ben Herrenschmidt.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I had strange NMI watchdog timeouts running sysrq-T across 9600-baud serial.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 genapic mechanism should be aware of machines that use physical APIC
mode regardless of how many clusters/processors are detected.
ACPI 3.0 FADT makes this determination very simple by providing a feature
flag "force_apic_physical_destination_mode" to state whether the machine
unconditionally uses physical APIC mode.
Unisys' next generation x86_64 ES7000 will need to utilize this FADT
feature flag in order to boot the x86_64 kernel in the correct APIC mode.
This patch has been tested on both x86_64 commodity and ES7000 boxes.
Signed-off-by: Jason Davis <jason.davis@unisys.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Port over a i386 kludge from rusty to x86-64
I don't think it is a full solution, but the upcomming smp bootup rewrite
will solve it.
This fixes BUGs at bootup on bigger x86-64 systems.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Only display physical id/siblings when there are siblings or dual core.
In 2.6.11 I accidentially broke it and it was always displaying these
fields But for compatibility to all these /proc parsers around it is better
to do it in the old way again.
Noticed by Suresh Siddha
Cc: <Suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use the i386 PT_NOTE segment in x86_64 as well.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!