Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for
each arch. Its definition is sometimes configurable. Indeed, ia64 defines 5
NODES_SHIFT values in the current git tree. But it looks a bit messy.
SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has
been changeable by config. Suitable node's number may be changed in the
future even if it is other architecture. So, I wrote configurable node's
number.
This patch set defines just default value for each arch which needs multi
nodes except ia64. But, it is easy to change to configurable if necessary.
On ia64 the number of nodes can be already configured in generic ia64 and SN2
config. But, NODES_SHIFT is defined for DIG64 and HP'S machine too. So, I
changed it so that all platforms can be configured via CONFIG_NODES_SHIFT. It
would be simpler.
See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Or rather compute it based on the table length automatically.
This also has the intended side effect of not warning for new system calls
anymore.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix CONFIG_REORDER.
The value of cflags-y was assined to CFLAGS before cflags-y was assigned
the value used for CONFIG_REORDER.
Use cflags-y for all CFLAGS options in the Makefile to avoid this
happening again.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In linux-2.6.16, we have noticed a problem where the gs base value
returned from an arch_prtcl(ARCH_GET_GS, ...) call will be incorrect if:
- the current/calling task has NOT set its own gs base yet to a
non-zero value,
- some other task that ran on the same processor previously set their
own gs base to a non-zero value.
In this situation, the ARCH_GET_GS code will read and return the
MSR_KERNEL_GS_BASE msr register.
However, since the __switch_to() code does NOT load/zero the
MSR_KERNEL_GS_BASE register when the task that is switched IN has a zero
next->gs value, the caller of arch_prctl(ARCH_GET_GS, ...) will get back
the value of some previous tasks's gs base value instead of 0.
Change the arch_prctl() ARCH_GET_GS code to only read and return
the MSR_KERNEL_GS_BASE msr register if the 'gs' register of the calling
task is non-zero.
Side note: Since in addition to using arch_prctl(ARCH_SET_GS, ...),
a task can also setup a gs base value by using modify_ldt() and write
an index value into 'gs' from user space, the patch below reads
'gs' instead of using thread.gs, since in the modify_ldt() case,
the thread.gs value will be 0, and incorrect value would be returned
(the task->thread.gs value).
When the user has not set its own gs base value and the 'gs'
register is zero, then the MSR_KERNEL_GS_BASE register will not be
read and a value of zero will be returned by reading and returning
'task->thread.gs'.
The first patch shown below is an attempt at implementing this
approach.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).
If HZ changes, this drift can become even more pronounced.
HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.
Vojtech comments:
"It's not entirely correct (it assumes the HPET ticks totally
exactly), but it's significantly better than assuming the PIT error
there."
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mostly to get better handling when a extended config space
access has to fallback to Type1.
Cc: gregkh@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously only the first bus would be checked against Type 1.
Why 16? Checking all would need too much memory and we
can assume that systems with more than 16 busses have better than
average quality BIOS.
This is an additional defense against bad MCFG tables.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Intel EM64T CPUs handle uncanonical return addresses differently
from AMD CPUs.
The exception is reported in the SYSRET, not the next instruction.
This leads to the kernel exception handler running on the user stack
with the wrong GS because the kernel didn't expect exceptions
on this instruction.
This version of the patch has the teething problems that plagued an earlier
version fixed.
This is CVE-2006-0744
Thanks to Ernie Petrides and Asit B. Mallick for analysis and initial
patches.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Machine checks can stall the machine for a long time and
it's not good to trigger the nmi watchdog during that.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Needed for other checks later in ACPI.
Pointed out by Len Brown
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces a user for the e820_all_mapped function:
There have been several machines that don't have a working MMCONFIG,
often because of a buggy MCFG table in the ACPI bios. This patch adds a
simple sanity check that detects a whole bunch of these cases, and when
it detects it, linux now boots rather than crash-and-burns.
The accuracy of this detection can in principle be improved if there was
a "is this entire range in e820 with THIS attribute", but no such
function exist and the complexity needed for this is not really worth
it; this simple check already catches most cases anyway.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a e820_all_mapped() function which checks if the entire range
<start,end> is mapped with type.
This is done by moving the local start variable to the end of each
known-good region; if at the end of the function the start address is
still before end, there must be a part that's not of the correct type;
otherwise it's a good region.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename e820_mapped to e820_any_mapped since it tests if any part of the
range is mapped according to the type.
Later steps will introduce e820_all_mapped which will check if the
entire range is mapped with the type. Both have their merit.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The node setup code would try to allocate the node metadata in the node
itself, but that fails if there is no memory in there.
This can happen with memory hotplug when the hotplug area defines an so
far empty node.
Now use bootmem to try to allocate the mem_map in other nodes.
And if it fails don't panic, but just ignore the node.
To make this work I added a new __alloc_bootmem_nopanic function that
does what its name implies.
TBD should try to use nearby nodes here. Currently we just use any.
It's hard to do it better because bootmem doesn't have proper fallback
lists yet.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Keith Mannthey, Andi Kleen
Implement memory hotadd without sparsemem. The memory in the SRAT
hotadd area is just preserved instead and can be activated later.
There are a few restrictions:
- Only one continuous hotadd area allowed per node
The main problem is dealing with the many buggy SRAT tables
that are out there. The strategy here is to reject anything
suspicious.
Originally from Keith Mannthey, with several hacks and changes by AK
and also contributions from Andrew Morton
[ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.
Rebuilding zonelist is necessary when the system has just memory <
4G at boot, and hot add memory > 4G. because x86_64 has DMA32,
ZONE_NORAML is not included into zonelist at boot time if system
doesn't have memory >4G at boot.
[AK: should just force the higher zones at boot time when SRAT tells us]
2) zone and node's spanned_pages and present_pages are not incremented.
They should be.
For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
possible 1T +memory. (Microsoft requires "write all possible memory
in SRAT") When we reserve memmap for possible 1T memory, Linux will
not work well in +minimum 4G configuraion ;)
[AK: needs limiting to 5-10% of max memory]
]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Memory hotadd doesn't need SPARSEMEM, but can be handled by just preallocating
mem_maps. This only needs some untangling of ifdefs to enable the necessary
code even without SPARSEMEM.
Originally from Keith Mannthey, hacked by AK.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Just call IRET always, no need for any special cases.
Needed for the next bug fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Otherwise, illegal configurations like X86_VOYAGER=y, PCI=y are
possible.
This patch also fixes the options select'ing ACPI to also select PCI.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
The only user of get_wchan is the proc fs - and proc can't be built modular.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).
And __setup() is used in obsolete_checksetup().
start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()
If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.
If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].
Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.
This patch fixes a wrong usage of it, however fixes obvious one only.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark unwind info for signal trampolines using the new S augmentation flag
introduced in: http://gcc.gnu.org/PR26208.
GCC 4.2 (or patched earlier GCC) will be able to special case unwinding
through frames right above signal trampolines. As the augmentations start
with z flag and S is at the very end of the augmentation string, older GCCs
will just skip the S flag as unknown (that's why an augmentation flag was
chosen over say a new CFA opcode).
Signed-off-by: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64: add the futex_atomic_cmpxchg_inuser() assembly implementation, and
wire up the new syscalls.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new sched domain for representing multi-core with shared caches
between cores. Consider a dual package system, each package containing two
cores and with last level cache shared between cores with in a package. If
there are two runnable processes, with this appended patch those two
processes will be scheduled on different packages.
On such systems, with this patch we have observed 8% perf improvement with
specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2
users).
This new domain will come into play only on multi-core systems with shared
caches. On other systems, this sched domain will be removed by domain
degeneration code. This new domain can be also used for implementing power
savings policy (see OLS 2005 CMP kernel scheduler paper for more details..
I will post another patch for power savings policy soon)
Most of the arch/* file changes are for cpu_coregroup_map() implementation.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide proper kprobes fault handling, if a user-specified pre/post handlers
tries to access user address space, through copy_from_user(), get_user() etc.
The user-specified fault handler gets called only if the fault occurs while
executing user-specified handlers. In such a case user-specified handler is
allowed to fix it first, later if the user-specifed fault handler does not fix
it, we try to fix it by calling fix_exception().
The user-specified handler will not be called if the fault happens when single
stepping the original instruction, instead we reset the current probe and
allow the system page fault handler to fix it up.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently kprobe handler traps only happen in kernel space, so function
kprobe_exceptions_notify should skip traps which happen in user space.
This patch modifies this, and it is based on 2.6.16-rc4.
Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: <hiramatu@sdl.hitachi.co.jp>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When kretprobe probes the schedule() function, if the probed process exits
then schedule() will never return, so some kretprobe instances will never
be recycled.
In this patch the parent process will recycle retprobe instances of the
probed function and there will be no memory leak of kretprobe instances.
Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Create compat_sys_adjtimex and use it an all appropriate places.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We had a copy of the compatibility version of struct timex in each 64 bit
architecture. This patch just creates a global one and replaces all the
usages of the old ones.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a "make isoimage" to i386 and x86-64, which allows the automatic
creation of a bootable CD image. It also adds an option FDINITRD= to
include an initrd of the user's choice in generated floppy- or CD boot
images. Finally, some minor cleanups of the image generation code.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
tcsh is not happy with the -9999 error code.
Suggested by Ernie Petrides
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No need to restrict to power of two here.
TBD needs more double checking
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pfn_to_page() and others need to access both memnode_shift and the very
first bytes of memnodemap[]. If we force memnode_shift to be just before the
memnodemap array, we can reduce the memory footprint to one cache line
instead of two for most setups. This patch introduce a 'memnode' structure
where shift and map[] are carefully placed.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
register_die_notifier is exported twice, once in traps.c and once in
x8664_ksyms.c. This results in a warning on build.
Signed-off-by: Kevin Winchester <kwin@ns.sympatico.ca>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
arch/x86_64/kernel/aperture.c: The search for the AGP bridge has been
extended to search for all the 256 buses instead of the first 32. This
is required since on a some systems, the bridge may be located on a bus
much farther than the first 32. By searching all 256 buses, we guarantee
that the search succeeds on such systems.
arch/x86_64/kernel/pci-gart.c: The search for the Northbridge is not
limited to just bus 0 anymore. This is required because on certain
systems, we may not find one on bus 0.
Signed-off-by: Navin Boppuri <navin.boppuri@newisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Have the GART_IOMMU help text specify that this is the hardware IOMMU in
amd64 processors. This will be significant if/when other IOMMUs are
added to the x86-64 architecture. :-)
Also, note that the previous help text stated that IOMMU was needed for
>3GB memory instead of >4GB. This is fixed in the newer version.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was a failed experiment - all benchmarks done with it on both AMD
and Intel showed it was a loss. That was probably because the store
buffers of the CPUs for write combining traffic weren't large enough.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
free_bootmem_node expects a physical address to be passed in, but
__alloc_bootmem_node returns a virtual one. That address needs to be
translated to physical.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o check_timer() routine fails while second kernel is booting after a crash
on an opetron box. Problem happens because timer vector (0x31) seems to be
locked.
o After a system crash, it is not safe to service interrupts any more, hence
interrupts are disabled. This leads to pending interrupts at LAPIC. LAPIC
sends these interrupts to the CPU during early boot of second kernel. Other
pending interrupts are discarded saying unexpected trap but timer interrupt
is serviced and CPU does not issue an LAPIC EOI because it think this
interrupt came from i8259 and sends ack to 8259. This leads to vector 0x31
locking as LAPIC does not clear respective ISR and keeps on waiting for
EOI.
o This patch issues extra EOI for the pending interrupts who have ISR set.
o Though today only timer seems to be the special case because in early
boot it thinks interrupts are coming from i8259 and uses
mask_and_ack_8259A() as ack handler and does not issue LAPIC EOI. But
probably doing it in generic manner for all vectors makes sense.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
cpu_vm_mask is of type cpumask_t, so use the proper bitops.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes problems with very large nodes (over 128GB) filling up all of
the first 4GB with their mem_map and not leaving enough space for the
swiotlb.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This means i386 processes compiled with a recent compiler will get non
executable heap by default now. This is the same default as a 32bit PAE
kernel would use on a NX enabled CPU.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Because 256 causes overflows in some code that stores them in 8 bit
fields and the x86 APIC architecture cannot handle more than 255
anyways.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When x86_64 timer init messages were changed to use apic verbosity
levels, two messages were missed and one got the wrong level. This
causes the last word of a suppressed message to print on a line by
itself. Fix that so either the entire message prints or none of it
does.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch puts the infrastructure in place to allow for a reordering of
functions based inside the vmlinux. The general idea is that it is possible
to put all "common" functions into the first 2Mb of the code, so that they
are covered by one TLB entry. This as opposed to the current situation where
a typical vmlinux covers about 3.5Mb (on x86-64) and thus 2 TLB entries.
This is done by enabling the -ffunction-sections flag in gcc, which puts
each function in its own ELF section, so that the linker can then order them
in a way defined by the linker script.
As per previous discussions, Linus said he wanted a "static" list for this,
eg a list provided by the kernel tarbal, so that most people have the same
ordering at least. A script is provided to create this list based on
readprofile(1) output. The included list is provisional, and entirely biased
on my own testbox and me running a few kernel compiles and some other
things.
I think that to get to a better list we need to invite people to submit
their own profiles, and somehow add those all up and base the final list on
that. I'm willing to do that effort if this is ends up being the prefered
approach. Such an effort probably needs to be repeated like once a year or
so to adopt to the changing nature of the kernel.
Made it a CONFIG with default n because it increases link times
dramatically.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I tested it on a couple of chipsets and it worked everywhere so it
should be ok as default for now.
So far I haven't done the great purge of the useless old check_timer
code yet though.
Can be overwritten with enable_8254_timer in the worst case
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a fallback logic, so it's better to not use the OOM killer
in the allocations.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ACPIv2 has an official but optional way to get a date >2100. Use it.
But all the platforms I tested didn't seem to support it. But anyways
the x86-64 kernel should be ready for the 22nd century now. Actually i
shouldn't care about this because I will be dead by then @)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch puts the code from head.S in a special .bootstrap.text
section.
I'm working on a patch to reorder the functions in the kernel (I'll post
that later), but for x86-64 at least the kernel bootstrap requires that
the head.S functions are on the very first page/pages of the kernel
text. This is understandable since the bootstrap is complex enough
already and not a problem at all, it just means they aren't allowed to
be reordered. This patch puts these special functions into a separate
section to document this, and to guarantee this in the light of possibly
reordering the rest later.
(So this patch doesn't fix a bug per se, but makes things more robust by
making the order of these functions explicit)
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are more and more cases where we need to know DMI information
early to work around bugs. i386 already had early DMI scanning, but
x86-64 didn't. Implement this now.
This required some cleanup in the i386 code.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As suggested by Andi (and Alan), move the default kernel location
from 1Mb to 2Mb, to align to the start of a TLB entry.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In a micro-benchmark that stresses the pagefault path, the down_read_trylock
on the mmap_sem showed up quite high on the profile. Turns out this lock is
bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
prefetches the lock (for write) as early as possible (and before some other
somewhat expensive operations). With this patch, the down_read_trylock
basically fell out of the top of profile.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
phys_proc_id[] on AMD boxes is right now populated with the initial
apic id, obtained by the cpuid instruction. But, the initial apic id
need not be the local apic id on clustered APIC systems (see comment at
x86_64/kernel/genapic_cluster.c, line 110). On vSMPowered with AMD
CPUs the cpu_to_node will turn out to be incorrect (as apicid_to_node[] is
indexed by the initial apic id rather than the local apic id).
On vSMPowered boxes with Intel CPUs this is working correctly as
phys_proc_id[] is initialized correctly in detect_ht().
This fixes AMD boot path according to specification, to use the correct
routines for local apic id and socket ids. We use
hard_smp_processor_id() to read the local apic id, and phys_pkg_id() to
determine socket id for phys_proc_id[]
Patch tested on Tyan multicore boxes as well as vSMPowered boxes.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- adjust limits of GDT/IDT pseudo-descriptors (some were off by one)
- move empty_zero_page into .bss.page_aligned
- move cpu_gdt_table into .data.page_aligned
- move idt_table into .bss
- align inital_code and init_rsp
- eliminate pointless (re-)declaration of idt_table in traps.c
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It needs num_physpages, so initialize it early. It's later overwritten
again.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Attached is a small code style cleanup patch that resulted from my
skimming through the arch/x86_64/kernel/traps.c code to figure out what
went haywire.
Signed-off-by: Roberto Nibali <ratz@drugphish.ch>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We've always had the problem that arguments only did a prefix match,
which resulted e.g. in noapic and noapictimer getting confused.
Fix the early argument parsing code to always check that arguments are
whole words (except for those that take additional arguments of course)
I factored out the checking code for that while also makes the code
easier to maintain.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While the modular aspect of the respective i386 patch doesn't apply to
x86-64 (as the top level page directory entry is shared between modules
and the base kernel), handlers registered with register_die_notifier()
are still under similar constraints for touching ioremap()ed or
vmalloc()ed memory. The likelihood of this problem becoming visible is
of course significantly lower, as the assigned virtual addresses would
have to cross a 2**39 byte boundary. This is because the callback gets
invoked
(a) in the page fault path before the top level page table propagation
gets carried out (hence a fault to propagate the top level page table
entry/entries mapping to module's code/data would nest infinitly) and
(b) in the NMI path, where nested faults must absolutely not happen,
since otherwise the IRET from the nested fault re-enables NMIs,
potentially resulting in nested NMI occurences.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code waits for the GART to clear the TLB flush bit. Use cpu_relax
in this time to allow hypervisors to yield the CPU in this time.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PM timer path through main_timer_handler doesn't need
the delay variable because it figures it out in a different way.
Don't try to read it from the PIT. With stopped PIT timer
it is even useless.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Minor cleanup to lend better for physical CPU hotplug.
Earlier way of using num_processors as index doesnt
fit if CPUs come and go. This makes the code little bit better
to read, and helps physical hotplug use the same functions as boot.
Reserving CPU0 for BSP is too late to be done in smp_prepare_boot_cpu().
Since logical assignments from MADT is already done via
setup_arch()->acpi_boot_init()->parse lapic
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Touching of the floating point state in a kernel debugger must be
NMI-safe, specifically math_state_restore() must be able to deal with
being called out of an NMI context. In order to do that reliably, the
context switch code must take care to not leave a window open where
the current task's TS_USEDFPU flag and CR0.TS could get out of sync.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For consistency and to have only a single place of definition, replace
set_debug() uses with set_debugreg(), and eliminate the definition of
thj former.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While AMD formally permits multi-byte execution breakpoints, Intel
disallows 8-byte as much as 2- or 4-byte ones.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix one place where the previous change of cpu_pda from being an array
to being a macro was not properly carried out.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It conflicts with the struct node in node.h
Actually the x86-64 version was there first, but ..
Suggested by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The upcomming gcc 4.2 got a new option -mtune=generic to tune
code for both common AMD and Intel CPUs. Use this option
when available for generic kernels.
On x86-64 it is used with CONFIG_GENERIC_CPU. On i386 it is
enabled with CONFIG_X86_GENERIC. It won't affect the base
line CPU support in any ways and also not the minimum supported CPU.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Memory >39bits has a different PUD.
Cc: "Tolentino, Matthew E" <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/sam/kbuild: (46 commits)
kbuild: remove obsoleted scripts/reference_* files
kbuild: fix make help & make *pkg
kconfig: fix time ordering of writes to .kconfig.d and include/linux/autoconf.h
Kconfig: remove the CONFIG_CC_ALIGN_* options
kbuild: add -fverbose-asm to i386 Makefile
kbuild: clean-up genksyms
kbuild: Lindent genksyms.c
kbuild: fix genksyms build error
kbuild: in makefile.txt note that Makefile is preferred name for kbuild files
kbuild: replace PHONY with FORCE
kbuild: Fix bug in crc symbol generating of kernel and modules
kbuild: change kbuild to not rely on incorrect GNU make behavior
kbuild: when warning symbols exported twice now tell user this is the problem
kbuild: fix make dir/file.xx when asm symlink is missing
kbuild: in the section mismatch check try harder to find symbols
kbuild: fix section mismatch check for unwind on IA64
kbuild: kill false positives from section mismatch warnings for powerpc
kbuild: kill trailing whitespace in modpost & friends
kbuild: small update of allnoconfig description
kbuild: make namespace.pl CROSS_COMPILE happy
...
Trivial conflict in arch/ppc/boot/Makefile manually fixed up
alarm() calls the kernel with an unsigend int timeout in seconds. The
value is stored in the tv_sec field of a struct timeval to setup the
itimer. The tv_sec field of struct timeval is of type long, which causes
the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.
Before the hrtimer merge (pre 2.6.16) such a negative value was converted
to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's
not clear whether this was intended or just happened to be done by the
timeval_to_jiffies code.
hrtimers expect a timeval in canonical form and treat a negative timeout as
already expired. This breaks the legitimate usage of alarm() with a
timeout value > INT_MAX seconds.
For 32 bit machines it is therefor necessary to limit the internal seconds
value to avoid API breakage. Instead of doing this in all implementations
of sys_alarm the duplicated sys_alarm code is moved into a common function
in itimer.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove all trailing tabs and spaces. No other changes.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
include/linux/platform.h contained nothing that was actually used except
the default_idle() prototype, and is therefore removed by this patch.
This patch does the following with the platform specific default_idle()
functions on different architectures:
- remove the unused function:
- parisc
- sparc64
- make the needlessly global function static:
- arm
- h8300
- m68k
- m68knommu
- s390
- v850
- x86_64
- add a prototype in asm/system.h:
- cris
- i386
- ia64
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Patrick Mochel <mochel@digitalimplant.org>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While working on these patch set, I found several possible cleanup on x86-64
and ia64.
akpm: I stole this from Andi's queue.
Not only does it clean up bitops. It also unrelatedly changes the prototype
of pci_mmcfg_init() and removes its arch_initcall(). It seems that the wrong
two patches got joined together, but this is the one which has been tested.
This patch fixes the current x86_64 build error (the pci_mmcfg_init()
declaration in arch/i386/pci/pci.h disagrees with the definition in
arch/x86_64/pci/mmconfig.c)
This also means that x86_64's pci_mmcfg_init() gets called in the same (new)
manner as x86's: from arch/i386/pci/init.c:pci_access_init(), rather than via
initcall.
The bitops cleanups came along for free.
All this worked OK in -mm testing (since 2.6.16-rc4-mm1) because x86_64 was
tested with both patches applied.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I moved it to a separate function which is safer.
This avoids problems with the linker reordering them and the
less useful PCI config space access methods taking priority
over the better ones.
Fixes some problems with broken MMCONFIG
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When we stop allocating percpu memory for not-possible CPUs we must not touch
the percpu data for not-possible CPUs at all. The correct way of doing this
is to test cpu_possible() or to use for_each_cpu().
This patch is a kernel-wide sweep of all instances of NR_CPUS. I found very
few instances of this bug, if any. But the patch converts lots of open-coded
test to use the preferred helper macros.
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@parisc-linux.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Christian Zankel <chris@zankel.net>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch from Pavel moves userland freeze signals handling into more logical
place. It now hits even with mysqld running.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use boot info to start early_printk() at the current row on VGA console, as
left by the boot loader.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Stas Sergeev <stsp@aknet.ru>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The history is that -mm kernels do not work for me for a few months
already. The things started from crashing somewhere after starting init,
and for the last month - no boot at all, just "Uncompressing... OK,
booting kernel", and silence. Early console didn't work too. With the
latest releases this degraded into an infinite stream of the "Unknown
interrupt or fault" messages. So today my patience ran out and I started
to think how can I collect at least some info for the bug-report. Attached
is the patch that allows to gather some valueable debug info on the problem
by making an early console more useable. I can't properly test the patch,
as the kernel still doesn't boot, so I'll explain it in details in a hope
someone else can justify the intrusive changes.
arch_hooks.h: added prototypes for setup_early_printk() and early_printk().
setup.c: killed wrong setup_early_printk() prototype. Moved
setup_early_printk() a bit earlier, as it was not "early enough" to cover
the bug I was fighting with.
early_printk.c: made it to start printing from the bottom of the screen,
otherwise the messages interfere with the ones of the boot-loader, so you
can't read them.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
Cc: Andi Kleen <ak@muc.de>
Cc: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().
This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use page->lru.next to implement the singly linked list of pages rather than
the struct deferred_page which needs to be allocated and freed for each
page.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sam's tree includes a new check, which found that we're exporting strpbrk()
multiple times.
It seems that the convention is that this is exported from the arch files, so
reove the lib/string.c export.
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This variable is rarely written to. Mark the variable accordingly.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The AES setkey routine writes 64 bytes to the E_KEY area even though
there are only 60 bytes there. It is in fact safe since E_KEY is
immediately follwed by D_KEY which is initialised afterwards. However,
doing this may trigger undefined behaviour and makes Coverity unhappy.
So by combining E_KEY and D_KEY into one array we sidestep this issue
altogether.
This problem was reported by Adrian Bunk.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commit c33d4568ac.
Andrew Clayton and Hugh Dickins report that it's broken for them and
causes strange page table and slab corruption, and spontaneous reboots.
Let's get it right next time.
Cc: Andrew Clayton <andrew@rootshell.co.uk>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
EM64T CPUs have somewhat weird error reporting for non canonical RIPs in
SYSRET.
We can't handle any exceptions there because the exception handler would
end up running on the user stack which is unsafe.
To avoid problems any code that might end up with a user touched pt_regs
should return using int_ret_from_syscall. int_ret_from_syscall ends up
using IRET, which allows safe exceptions.
Cc: Ernie Petrides <petrides@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While testing kexec and kdump we hit problems where the new kernel would
freeze or instantly reboot. The easiest way to trigger it was to kexec a
kernel compiled for CONFIG_M586 on an athlon cpu. Compiling for CONFIG_MK7
instead would work fine.
The patch fixes a few problems with the kexec inline asm.
Signed-off-by: Chris Mason <mason@suse.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The kbuild system takes advantage of an incorrect behavior in GNU make.
Once this behavior is fixed, all files in the kernel rebuild every time,
even if nothing has changed. This patch ensures kbuild works with both
the incorrect and correct behaviors of GNU make.
For more details on the incorrect behavior, see:
http://lists.gnu.org/archive/html/bug-make/2006-03/msg00003.html
Changes in this patch:
- Keep all targets that are to be marked .PHONY in a variable, PHONY.
- Add .PHONY: $(PHONY) to mark them properly.
- Remove any $(PHONY) files from the $? list when determining whether
targets are up-to-date or not.
Signed-off-by: Paul Smith <psmith@gnu.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
This reverts commit 13a229abc2.
Quoth Andi:
"After some consideration and feedback from various people it turns
out this wasn't that good an idea. It has some problems and needs
more work. Since it was only an optimization anyways it's best to
just back it out again for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit 9ec4b1f356 made kprobes not compile
without module support, so just make that clear in the Kconfig file.
Also, since it's marked EXPERIMENTAL, make that dependency explicit too.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The commit e2c0388866 added
setup_additional_cpus to setup.c but this is only defined if
CONFIG_HOTPLUG_CPU is set. This patch changes the #ifdef to reflect that.
Signed-off-by: Brian Magnuson <magnuson@rcn.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The previous experiment for using apicmaintimer on ATI systems didn't
work out very well. In particular laptops with C2/C3 support often
don't let it tick during idle, which makes it useless. There were also
some other bugs that made the apicmaintimer often not used at all.
I tried some other experiments - running timer over RTC and some other
things but they didn't really work well neither.
I rechecked the specs now and it turns out this simple change is
actually enough to avoid the double ticks on the ATI systems. We just
turn off IRQ 0 in the 8254 and only route it directly using the IO-APIC.
I tested it on a few ATI systems and it worked there. In fact it worked
on all chipsets (NVidia, Intel, AMD, ATI) I tried it on.
According to the ACPI spec routing should always work through the
IO-APIC so I think it's the correct thing to do anyways (and most of the
old gunk in check_timer should be thrown away for x86-64).
But for 2.6.16 it's best to do a fairly minimal change:
- Use the known to be working everywhere-but-ATI IRQ0 both over 8254
and IO-APIC setup everywhere
- Except on ATI disable IRQ0 in the 8254
- Remove the code to select apicmaintimer on ATI chipsets
- Add some boot options to allow to override this (just paranoia)
In 2.6.17 I hope to switch the default over to this for everybody.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SMP time selection originally ran after all CPUs were brought up because
it needed to know the number of CPUs to decide if it needs an MP safe
timer or not.
This is not needed anymore because we know present CPUs early.
This fixes a couple of problems:
- apicmaintimer didn't always work because it relied on state that was
set up time_init_gtod too late.
- The output for the used timer in early kernel log was misleading
because time_init_gtod could actually change it later. Now always
print the final timer choice
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It didn't set up the CPU possible map early enough, so the
option didn't actually work.
Noticed by Heiko Carstens
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
Old check for the IO-APIC watchdog during the timer check was wrong -
it obviously should only drop into this if the IO-APIC watchdog is used.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes x86-64 use the common X86_PM_TIMER Kconfig entry in drivers/acpi
And since PM timer is needed for correct timing on a lot of systems
now (e.g. AMD dual cores) and we often get bug reports from people
who forgot to set it make it depend on CONFIG_EMBEDDED. x86-64 had
this change before and it's a good thing.
I also fixed the description slightly to make this more clear.
Cc: len.brown@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Big Unisys systems have multiple clusters too, but they have an
synchronized TSC.
I'm using the SMBIOS to check for vendor == IBM.
Cc: Chris McDermott <lcm@us.ibm.com>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In previous versions of pci-gart.c, no_iommu was used to determine if IOMMU was
disabled in the GART DMA mapping functions. This changed in 2.6.16 and now
gart_xxx() functions are only called if gart is enabled. Therefore, uses of
no_iommu in the GART code are no longer necessary and can be removed.
Also, it removes double deceleration of no_iommu and force_iommu in pci.h and
proto.h, by removing the deceleration in pci.h.
Lastly, end_pfn off by one error.
Tested (along with patch 1/2) on dual opteron with gart enabled, iommu=soft,
and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit 9c869edac5 moved the i386 topology.c
file. That change broke x86-64 compiles, as it uses the same file.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Undo setting of CONFIG_DEBUG_INFO in the previous defconfig update. It
will make every build much slower and need more disk space and isn't a good
default.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't print KERN_INFO in the middle of a printk line.
printk(KERN_INFO "OEM ID: %s ",str);
is just above this. This is already fixed up in i386 copy.
Signed-off-by: Martin J. Bligh <mbligh@google.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously the numa hash code would be confused by holes in the node space
and stop early. This is the first part of the fix for the non boot issue
with empty nodes on Opterons.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Code was refusing good SRATs because about 12K got lost somewhere.
Allow less than 1MB of difference before rejecting it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
But do it after everything else to risk less from recursive
crashes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many laptops have problems with ticking the local APIC timer in C2/C3.
The code added earlier to use it by default on ATI didn't really work
for them. Don't enable it when the system supports C2/C3.
This doesn't fix the problem fully, but at least it's not worse than before.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This caused a sigreturn with bad argument on a preemptible kernel
to complain with
Debug: sleeping function called from invalid context at /home/lsrc/quilt/linux/include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1
Call Trace: {__might_sleep+190} {profile_task_exit+21}
{__do_exit+34} {do_wait+0}
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Along with that, also suppress the memory touching altogether when the
watchdog is not running, to eliminate needless crosstalk. Plus ad a call
to it to make things consistent (one could also consider removing the call
in enable_timer_nmi_watchdog()).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The early initialization of cpu_to_node code as it is now only updates the
cpu_to_node array, and does not update cpu_pda()->nodemember. This will
cause numa_node_id() to return 0 on systems where CPU 0 is not on Node 0.
This leads to a kernel panic in slab.c.
I've tested the patch below on a 16 processor x86_64 ES7000-600 server, and
no longer see the panic I saw with the original 2.6.16-rc3.
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't touch the non DMA members in the sg list in dma_map_sg in the IOMMU
Some drivers (in particular ST) ran into problems because they reused the sg
lists after passing them to pci_map_sg(). The merging procedure in the K8
GART IOMMU corrupted the state. This patch changes it to only touch the dma*
entries during merging, but not the other fields. Approach suggested by Dave
Miller.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We found a problem with x86_64 kernels with preemption enabled, where
having multiple tasks doing ptrace singlesteps around the same time will
cause the system to 'oops'. The problem seems that a task can get
preempted out of the do_debug() processing while it is running on the
DEBUG_STACK stack. If another task on that same cpu then enters do_debug()
and uses the same per-cpu DEBUG_STACK stack, the previous preempted tasks's
stack contents can be corrupted, and the system will oops when the
preempted task is context switched back in again.
The typical oops looks like the following:
Unable to handle kernel paging request at ffffffffffffffae RIP: <ffffffff805452a1>{thread_return+34}
PGD 103027 PUD 102429067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in:
Pid: 3786, comm: ssdd Not tainted 2.6.15.2 #1
RIP: 0010:[<ffffffff805452a1>] <ffffffff805452a1>{thread_return+34}
RSP: 0018:ffffffff80824058 EFLAGS: 000136c2
RAX: ffff81017e12cea0 RBX: 0000000000000000 RCX: 00000000c0000100
RDX: 0000000000000000 RSI: ffff8100f7856e20 RDI: ffff81017e12cea0
RBP: 0000000000000046 R08: ffff8100f68a6000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff81017e12cea0 R12: ffff81000c2d53e8
R13: ffff81017f5b3be8 R14: ffff81000c0036e0 R15: 000001056cbfc899
FS: 00002aaaaaad9b00(0000) GS:ffffffff80883800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffffffffae CR3: 00000000f6fcf000 CR4: 00000000000006e0
Process ssdd (pid: 3786, threadinfo ffff8100f68a6000, task ffff8100f7856e20)
Stack: ffffffff808240d8 ffffffff8012a84a ffff8100055f6c00 0000000000000020
0000000000000001 ffff81000c0036e0 ffffffff808240b8 0000000000000000
0000000000000000 0000000000000000
Call Trace: <#DB>
<ffffffff8012a84a>{try_to_wake_up+985}
<ffffffff8012c0d3>{kick_process+87}
<ffffffff8013b262>{signal_wake_up+48}
<ffffffff8013b5ce>{specific_send_sig_info+179}
<ffffffff80546abc>{_spin_unlock_irqrestore+27}
<ffffffff8013b67c>{force_sig_info+159}
<ffffffff801103a0>{do_debug+289} <ffffffff80110278>{sync_regs+103}
<ffffffff8010ed9a>{paranoid_userspace+35}
Unable to handle kernel paging request at 00007fffffb7d000 RIP: <ffffffff8010f2e4>{show_trace+465}
PGD f6f25067 PUD f6fcc067 PMD f6957067 PTE 0
Oops: 0000 [2] PREEMPT SMP
This patch disables preemptions for the task upon entry to do_debug(), before
interrupts are reenabled, and then disables preemption before exiting
do_debug(), after disabling interrupts. I've noticed that the task can be
preempted either at the end of an interrupt, or on the call to
force_sig_info() on the spin_unlock_irqrestore() processing. It might be
better to attempt to code a fix in entry.S around the code that calls
do_debug().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
The IBM Summit 3 chipset doesn't implement the HPET timer replacement
option. Since the current Linux code relies on it use a mixed mode with
both PIT for the interrupt and HPET counters for the time keeping. That
was already implemented, but didn't work properly because it was still
using the last interrupt offset in HPET. This resulted in x460 not
booting. Fix this up by using the free running HPET counter.
Shouldn't affect any other machine because they either use full HPET mode
or no HPET at all.
TBD needs a similar 32bit fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The *at patches introduced fstatat and, due to inusfficient research, I
used the newfstat functions generally as the guideline. The result is that
on 32-bit platforms we don't have all the information needed to implement
fstatat64.
This patch modifies the code to pass up 64-bit information if
__ARCH_WANT_STAT64 is defined. I renamed the syscall entry point to make
this clear. Other archs will continue to use the existing code. On x86-64
the compat code is implemented using a new sys32_ function. this is what
is done for the other stat syscalls as well.
This patch might break some other archs (those which define
__ARCH_WANT_STAT64 and which already wired up the syscall). Yet others
might need changes to accomodate the compatibility mode. I really don't
want to do that work because all this stat handling is a mess (more so in
glibc, but the kernel is also affected). It should be done by the arch
maintainers. I'll provide some stand-alone test shortly. Those who are
eager could compile glibc and run 'make check' (no installation needed).
The patch below has been tested on x86 and x86-64.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits in
the node's cpumask when a cpu goes down or cpu bring up is cancelled. This
is buggy since there are pieces of common code where the cpumask is checked
in the cpu down code path to decide on things (like in the slab down path).
PPC does the right thing, but x86_64 and ia64 don't (This was the reason
Sonny hit upon a slab bug during cpu offline on ppc and could not reproduce
on other arches). This patch fixes it for x86_64. I won't attempt ia64 as
I cannot test it.
Credit for spotting this should go to Alok.
(akpm: this was applied, then reverted. But it's OK now because we now use
for_each_cpu() in the right places).
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 10f4dc8b27.
Quoth Andi Kleen:
"Kiran decided that it makes the problem worse than it was before.
Fixing it fully requires more work which is too much for 2.6.16. So
please revert that commit for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains a printk reorder to remove the current problem of
displaying "PCI-DMA: Disabling IOMMU." and then "PCI-DMA: using GART
IOMMU" 20 lines later in dmesg.
It also constains a printk reorder in swiotlb to state swiotlb
enablement prior to describing the location of the bounce buffers, and a
printk reorder to state gart enablement prior to describing the
aperature.
Also constains a whitespace cleanup in arch/x86_64/kernel/setup.c
Tested (along with patch 2/2) on dual opteron with gart enabled,
iommu=soft, and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hack for 2.6.16. In 2.6.17 all code that uses NR_CPUs should
be audited and changed to only touch possible CPUs.
Don't mark the reference per cpu data init data (so it stays
around after boot) and point all impossible CPUs to it. This way
they reference some valid - although shared memory. Usually
this is only initialization like INIT_LIST_HEADs and there
won't be races because these CPUs never run. Still somewhat hackish.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's bad juju to touch the APIC when it hasn't been enabled.
I also moved ack_bad_irq for x86-64 out of line following i386.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Checking of the validity of pointers should be consistently done before
dereferencing the pointer.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conditionalize two unwind directives to match other similarly
conditional code.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Cc: Jim Houston <jim.houston@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some broken motherboards (at least one NForce3 based AMD64 laptop)
the PIT timer runs at a incorrect frequency. This patch adds a new
option "apicpmtimer" that allows to use the APIC timer and calibrate it
using the PMTimer. It requires the earlier patch that allows to run the
main timer from the APIC.
Specifying apicpmtimer implies apicmaintimer.
The option defaults to off for now.
I tested it on a few systems and the resulting APIC timer frequencies
were usually a bit off, but always <1%, which should be tolerable.
TBD figure out heuristic to enable this automatically on the affected
systems TBD perhaps do it on all NForce3s or using DMI?
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kprobes cannot deal with the funny calling conventions when it
runs on a different stack when it returns. If someone wants
to instrument context switch they can add a probe to schedule()
instead.
Cc: jkenisto@us.ibm.com, prasanna@in.ibm.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Align the start of the per-cpu section to the configured number of bytes in a
cache line. This stops a BUG_ON() from triggering in load_module() when
DEFINE_PER_CPU() is used in a module and the section isn't cacheline-aligned.
Rusty also found this and sent a patch in a while ago
(http://lkml.org/lkml/2004/10/19/17), I don't know what came of that.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[ AK: I redid Kevin's fix to be simpler, but the idea and original
analysis of the problem is from Kevin]
This avoid allocation failures on some SATA systems like Nvidia CK8
when the IOMMU gets fragmented. Modern SATA devices have quite large queues
(128 entries) and the FS with ext2/3 is good enough now that it often
passes whole 128 page sg lists down to the driver. These require
512K of continuous free space in the IOMMU aperture to map when merged.
When the IOMMU is fragmented this could lead to spurious IO errors
due to failing mappings.
Short term fix is to just try to map the SG list again unmerged
page by page - this way fragmentation doesn't matter anymore.
The code for that was already there, but it just wasn't enabled for the
merge case.
According to Kevin at least the Nvidia device doesn't seem to benefit
from merging much anyways, so the only slowdown is from trying
to do an unnecessary merge attempt.
Kevin plans to implement better fragmentation avoidance in the future,
but that wouldn't be 2.6.16 material.
TBD: should add some statistic counters to count how often that really
happens.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I broke this earlier when moving the patch from i386 to x86-64.
Need to return the virtual address here, not the physical address.
This fixes some boot time crashes on x86-64.
Cc: gregkh@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Check if the processor/memory affinity entries are long enough
according to the ACPI 3.0 spec.
- Ignore memory affinity entries that define a zero length region.
All based on BIOS issues found in the field @)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
attached patch is 2 more cases i found via running the reference_init.pl
script. These were easy to spot just knowing the file names. There is
one another about init/main.c that i cant exactly zero in. (partly
because i dont know how to interpret the data thats spewed out of the tool).
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It has been enabled by default for some time now and is cheap enough
so it doesn't matter anyways.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits
in the node's cpumask when a cpu goes down or cpu bring up is cancelled.
This is buggy since there are pieces of common code where the cpumask is
checked in the cpu down code path to decide on things (like in the slab
down path). PPC does the right thing, but x86_64 and ia64 don't (This
was the reason Sonny hit upon a slab bug during cpu offline on ppc and
could not reproduce on other arches). This patch fixes it for x86_64.
I won't attempt ia64 as I cannot test it.
Credit for spotting this should go to Alok.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They cause quite bad performance regressions on Netburst
This is temporary until we can get new optimized functions
for these CPUs.
This undoes changes that were done in 2.6.15 and in 2.6.16-rc1,
essentially bringing the code back to 2.6.14 level. Only change
is I renamed the X86_FEATURE_K8_C flag to X86_FEATURE_REP_GOOD
and fixed the check for the flag and also fixed some comments.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This avoids BUG_ONs in the low level allocator when an illegal
GFP mask is added.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At resume time, TSC's value or something similar might be changed a lot
against suspend time. This could make system gets a very big lost ticks.
See http://bugzilla.kernel.org/show_bug.cgi?id=5825
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They all have problems with IRQ 0 routing, so just use the APIC on them.
Can be overwritten with "noapicmaintimer"
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another piece from the no-idle-tick patch.
This can be enabled with the "apicmaintimer" option.
This is mainly useful when the PIT/HPET interrupt is unreliable.
Note there are some systems that are known to stop the APIC
timer in C3. For those it will never work, but this case
should be automatically detected.
It also only works with PM timer right now. When HPET is used
the way the main timer handler computes the delay doesn't work.
It should be a bit more efficient because there is one less
regular interrupt to process on the boot processor.
Requires earlier bugfix from Venkatesh
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A kprobe executes IRET early and that could cause NMI recursion
and stack corruption.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This assembly version is measurably faster than the generic version in
lib/iomap_copy.c.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to use the compat function here.
Pointer out by Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle more bogus MCFG entries
Some Asus P4 boards seem to have broken MCFG tables with
only a single entry for busses 0-0. Special case these
and assume they mean all busses can be accessed.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix a typo/mis-merge in one of the previous patches.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the ability to disability packet split at compile time and use the legacy receive path on PCI express hardware. Made this a CONFIG option and modified the Kconfig, to reflect the new option.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: John Ronciak <john.ronciak@intel.com>
Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Add x86-64 specific memory hot-add functions, Kconfig options,
and runtime kernel page table update functions to make
hot-add usable on x86-64 machines. Also, fixup the nefarious
conditional locking and exports pointed out by Andi.
Tested on Intel and IBM x86-64 memory hot-add capable systems.
Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another try at this.
For 32bit follow the 32bit implementation from Ingo -
mappings are growing down from the end of stack now
and vary randomly by 1GB.
Randomized mappings for 64bit just vary the normal mmap break
by 1TB. I didn't bother implementing full flex mmap for 64bit
because it shouldn't be needed there.
Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... as they are no longer needed. Since there were hard-coded numbers in the
file, the patch also adds a mechanism to avoid these (otherwise potential
future changes would again and again require adjusting these numbers).
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The comparison of the initrd start address against "&_end" is
unnecessary and incorrect. Make it match the x86 code that just
compares the passed-in arguments.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For not fully explained reasons it broke mem=... on several setups.
Also minor cleanup.
Cc: axboe@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To avoid mistakes.
I got a few reports where people got broken timing because they didn't
have the PMTIMER fallback.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This unbreaks recursive kprobes which didn't work anymore
due to an earlier patch which converted the debug entry point
to use an IST.
This also allows nesting of the debug entry point too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o This fix was posted for i386 long back. Posting it for x86_64.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110380103229830&w=2
o This patch fixes the problem of secondary cpus boot up. This situation
is faced when kernel is built for default locations like 16MB and
onwards. In this configuration, only primary cpu (BP) comes and
secondary cpus don't boot.
o Problem occurs because in trampoline code, lgdt is not able to load the
GDT as it happens to be situated beyond 16MB. This is due to the fact
that cpu is still in real mode and default operand size is 16bit.
o This patch uses lgdtl instead of lgdt to force operand size to 32
instead of 16.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... reducing the amount of changes Xen has to do.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The explicit and implicit calls to setup_early_printk() were passing
inconsistent arguments.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously they would be only allocated before the kernel text at
1MB. This limited the maximum supported memory to 128GB.
Now allow the e820 allocator to put them everywhere. Try
to put them beyond any DMA zones to avoid filling them up.
This should free some GFP_DMA memory compared to earlier kernels.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
hard_smp_processor_id would return the local APIC id instead
of the Linux processor id. On big systems they are often
not identical. safe_smp_processor_id is just a wrapper
around it that does the necessary conversions.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove support for obsolete hardware and cleanup.
- Remove checks for non integrated APICs
- Replace apic_write_around with apic_write.
- Remove apic_read_around
- Remove APIC version reads used by old workarounds
- Remove old workaround for Simics
- Fix indentation
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When building in a separate objtree, file names produced by BUG() & Co. can
get fairly long; printing only the first 50 characters may thus result in
(almost) no useful information. The following change makes it so that rather
the last 50 characters of the filename get printed.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Especially under Xen, where the console cannot be adjusted to more than 25
lines, it is fairly important that the information displayed during a panic
is as compact as possible. Below adjustments work towards that.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Due to a broken condition, the body of the loop that is intended to wait for
the Update-In-Progress bit to get set and then cleared again was never
entered; in fact, the entire loop was optimized out by the compiler. Here is
a change to fix the condition (and to also move the initialization of locals
out of the spin lock protected region).
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was only needed for APM
Pointed out by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
X86_FEATURE_K8_C was a synthetic Linux CPUID flag that was used for some
code optimizations in Opteron C stepping or later. But support for pre C
stepping optimizations has been removed, so this isn't needed anymore.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Saves about ~18K .text in defconfig
There would be more optimization potential, but that's for later.
Suggestion originally from Bill Irwin.
Fix from Andy Whitcroft.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They used to be used by the reboot code, but not anymore.
Noticed by Jan Beulich
Cc: JBeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Currently, during kexec reboot, IOAPIC is re-programmed back to virtual
wire mode if there was an i8259 connected to it. This enables getting
timer interrupts in second kernel in legacy mode.
o After putting into virtual wire mode, IOAPIC delivers the i8259 interrupts
to CPU0. This works well for kexec but not for kdump as we might crash
on a different CPU and second kernel will not see timer interrupts.
o This patch modifies the redirection table entry to deliver the timer
interrupts to the cpu we are rebooting (instead of hardcoding to zero).
This ensures that second kernel receives timer interrupts even on a
non-boot cpu.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce vSMP arch to the kernel.
This patch:
1. Adds CONFIG_X86_VSMP
2. Adds machine specific macros for local_irq_disabled, local_irq_enabled
and irqs_disabled
3. Writes to the vSMP CTL device to indicate kernel compiled with CONFIG_VSMP
Signed-off-by: Ravikiran Thirumalai <kiran@scalemp.com>
Signed-off-by: Shai Fultheim <shai@scalemp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently we attempt to restore virtual wire mode on reboot, which only
works if we can figure out where the i8259 is connected. This is very
useful when we are kexec another kernel and likely helpful to an peculiar
BIOS that make assumptions about how the system is setup.
Since the acpi MADT table does not provide the location where the i8259 is
connected we have to look at the hardware to figure it out.
Most systems have the i8259 connected the local apic of the cpu so won't be
affected but people running Opteron and some serverworks chipsets should be
able to use kexec now.
In addition this patch removes the hard coded assumption that the io_apic
that delivers isa interrups is always known to the kernel as io_apic 0.
There does not appear to be anything to guarantee that assumption is true.
And From: Vivek Goyal <vgoyal@in.ibm.com>
A minor fix to the patch which remembers the location of where i8259 is
connected. Now counter i has been replaced by apic. counter i is having
some junk value which was leading to non-detection of i8259 connected to
IOAPIC.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Setting RF (resume flag) allows a debugger to resume execution after a code
breakpoint without tripping the breakpoint again. It is reset by the CPU
after executing one instruction.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was set as an NMI, but the NMI bit always forces an interrupt
to end up at vector 2. So it was never used. Remove.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix
CC arch/x86_64/kernel/nmi.o
linux/arch/x86_64/kernel/nmi.c: In function ???check_nmi_watchdog???:
linux/arch/x86_64/kernel/nmi.c:155: warning: statement with no effect
on Uniprocessor builds.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch uses a static PDA array early at boot and reallocates processor PDA
with node local memory when kmalloc is ready, just before pda_init.
The boot_cpu_pda is needed since the cpu_pda is used even before pda_init for
that cpu is called (to set the static per-cpu areas offset table etc)
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch enables early intialization of cpu_to_node.
apicid_to_node is built by reading the SRAT table, from acpi_numa_init with
ACPI_NUMA and k8_scan_nodes with K8_NUMA.
x86_cpu_to_apicid is built by parsing the ACPI MADT table, from acpi_boot_init.
We combine these two tables and setup cpu_to_node.
Early intialization helps the static per_cpu_areas in getting pages from
correct node.
Change since last release:
Do not initialize early init_cpu_to_node for faking node cases.
Patch tested on TYAN dual core 4P board with K8 only, ACPI_NUMA.
Tested on EM64T NUMA. Also tested with numa=off, numa=fake, and running
a kernel compiled with NUMA on a regular EM64 2 way SMP.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The real vsyscall .text addresses are not mapped when the alternative()
replacement runs early, so use some black magic to access them using
the direct mapping.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They already do this in hardware and the Linux algorithm
actually adds errors.
Cc: mingo@elte.hu
Cc: rohit.seth@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Apic id is in most significant 8 bits of APIC_ID register. Current code
is trying to write apic id to least significant 8 bits. This patch fixes
it.
o This fix enables booting uni kdump capture kernel on a cpu with non-zero
apic id.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove exports that are already exported from the object's source file.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These functions are inlines and shouldn't be exported.
Signed-off-by: Brian Gerst <bgerst@didntduck.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Remove optimization for old B stepping Opteron
- Make the fast path for copies with a multiple of eight length faster.
- Minor instruction rearrangement to hopefully avoid a pipeline
stall or two.
- Add comment about errata to consider.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AK: I hacked Muli's original patch a lot and there were a lot
of changes - all bugs are probably to blame on me now.
There were also some changes in the fall back behaviour
for swiotlb - in particular it doesn't try to use GFP_DMA
now anymore. Also all DMA mapping operations use the
same core dma_alloc_coherent code with proper fallbacks now.
And various other changes and cleanups.
Known problems: iommu=force swiotlb=force together breaks
needs more testing.
This patch cleans up x86_64's DMA mapping dispatching code. Right now
we have three possible IOMMU types: AGP GART, swiotlb and nommu, and
in the future we will also have Xen's x86_64 swiotlb and other HW
IOMMUs for x86_64. In order to support all of them cleanly, this
patch:
- introduces a struct dma_mapping_ops with function pointers for each
of the DMA mapping operations of gart (AMD HW IOMMU), swiotlb
(software IOMMU) and nommu (no IOMMU).
- gets rid of:
if (swiotlb)
return swiotlb_xxx();
- PCI_DMA_BUS_IS_PHYS is now checked against the dma_ops being set
This makes swiotlb faster by avoiding double copying in some cases.
Signed-Off-By: Muli Ben-Yehuda <mulix@mulix.org>
Signed-Off-By: Jon D. Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Broken BIOS on Iwill 8way systems reports these and it causes the bootmem
allocator to crash. Add a sanity check if all the PXMs in the
SRAT table cover all memory as reported by e820. If the sanity
check fails the SRAT is rejected and the code will fall back
to discover the NUMA topology using the K8 northbridge registers
when applicable.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This adds a new notifier chain that is called with IDLE_START
when a CPU goes idle and IDLE_END when it goes out of idle.
The context can be idle thread or interrupt context.
Since we cannot rely on MONITOR/MWAIT existing the idle
end check currently has to be done in all interrupt
handlers.
They were originally inspired by the similar s390 implementation.
They have a variety of applications:
- They will be needed for CONFIG_NO_IDLE_HZ
- They can be used for oprofile to fix up the missing time
in idle when performance counters don't tick.
- They can be used for better C state management in ACPI
- They could be used for microstate accounting.
This is just infrastructure so far, no users.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix off by one when checking if the machine has enougn memory to need IOMMU
This caused the IOMMUs to be needlessly enabled for mem=4G
Based on a patch from Jon Mason
Signed-off-by: jdmason@us.ibm.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Whenever we see that a CPU is capable of C3 (during ACPI cstate init), we
disable local APIC timer and switch to using a broadcast from external timer
interrupt (IRQ 0).
Patch below adds the code for x86_64.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the finer control of local APIC timer. We cannot provide a sub-jiffy
control like this when we use broadcast from external timer in place of
local APIC. Instead of removing this only on systems that may end up using
broadcast from external timer (due to C3), I am going the
"I'm feeling lucky" way to remove this fully. Basically, I am not sure about
usefulness of this code today. Few other architectures also don't seem to
support this today.
If you are using profiling and fine grained control and don't like this going
away in normal case, yell at me right now.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I would like to throw out a suggestion for a possible change in the way that
the debug register traps are handled in do_debug() when the trap occurs
in kernel-mode.
In the x86_64 version of do_debug(), the code will skip around sending
a SIGTRAP to the current task if the trap occurred while in kernel mode.
On the i386-side of things, if the access happens to occur in kernel mode
(say during a read(2) of user's buffer that matches the address of a
debug register trap), then the do_debug() routine for i386 will go ahead
and call send_sigtrap() and send the SIGTRAP signal. The send_sigtrap()
code will also set the info.si_addr to NULL in this case (even though I
don't understand why, since the SIGTRAP siginfo processing doesn't use
the si_addr field...).
So I would like to suggest that the x86_64 do_debug() routine also
follow this type of behavior and have it go ahead and send the
SIGTRAP signal to the current task, even if the debug register trap
happens to have occurred in kernel mode. I have taken a stab at
a patch for this change below. (It includes the i386-ish change
for setting si_addr to NULL when the trap occurred in kernel mode.)
It seems like a useful feature to be able to 'watch' a user location that
might also be modified in the kernel via a system service call, and have the
debugger report that information back to the user, rather than to just
silently ignore the trap.
Additionally, I realize that users that pull in a kernel debugger such as
KGDB into their kernel might want to remove this change below when they add
in KGDB support. However, they could alternatively look at the current
task's thread.debugreg[] values to see if the trap occurred due to KGDB
or instead because of a user-space debugger trap, and still honor the
user SIGTRAP processing (instead of the KGDB breakpoint processing)
if the trap matches up with the thread.debugreg[] registers.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Much better to deal with these than with the magic numbers.
And remove the comment describing the bits - kernel source
is no replacement for an architecture manual.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't need to do the vmalloc check for the module range because its
PML4 is shared with the kernel text.
Also removed an unnecessary TLB flush.
Pointed out by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is on the same lines as Zachary Amsden's i386 GDT page alignemnt
patch in -mm, but for x86_64.
Patch to align and pad x86_64 GDT on page boundries.
[AK: some minor cleanups and fixed incorrect TLS initialization
in CPU init.]
Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This might help on distributions that use a 32bit biarch compiler.
First pass -m64 by default.
Secondly add some more .code32s because at least the Ubuntu biarch
32bit as called by gcc doesn't seem to handle -m64 -m32 as generated
by the Makefile without such assistance.
And finally make sure the linker script can be preprocessed
with a 32bit cpp.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The attempt to avoid overflow in __delay caused varying precision
on different CPUs depending on differences in the CPU speed.
We should be able to do this multiplication with out overflowing
provided the
cpu is running at less than about 128 GHz. xloops < 20000 * 0x10c6.
loops_per_jiffy * HZ <= cpu_clock_speed. So if the cpu clock speed
< 2^64/(20000 * 0x10c6) = 2^64/ 51E6CC0 < 2^64/2^27 = 2^37 = 128G we
will not overflow the calculation.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we don't know the node a PCI bus is connected to return -1.
This matches the generic code.
Noticed by Ravikiran G Thirumalai <kiran@scalex86.org>
Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A lot of Opteron BIOS just pass 10 in all SLIT entries (10 is the
normalized unit). This is actually worse than the default heuristic
because it leads to pci_distance not knowing the difference between
local and remote nodes anymore. This messes up some NUMA
heuristics in generic code.
In this case it's better to fall back to the default heuristic
which just does nodea == nodeb ? 10 : 20.
This patch does some basic sanity checking on the SLIT and only accepts
the SLIT when it passes.
Invariants enforced are:
- Node to itself shall be 10
- Any other distance shouldn't be 10
- Distances smaller than 10 are illegal
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some people need it now on 64bit so reuse the i386 code for
x86-64. This will be also useful for future bug workarounds.
It is a bit simplified there because there is no need
to do it very early on x86-64. This means it doesn't need
early ioremap et.al. We run it as a core initcall right now.
I hope it's not needed for early setup.
I added a general CONFIG_DMI symbol in case IA64 or someone
else wants to reuse the code later too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was a backup file that somehow made it into the official
tree. Never used for anything. Remove.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The introduction of call_softirq switching to the interrupt stack several
releases earlier resulted in a problem with the code in show_trace, which
assumes that it can pick the previous stack pointer from the end of the
interrupt stack.
Cc: Andi Kleen <ak@muc.de>
Cc: Arjan van de Ven <arjanv@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Be more careful with TF handling to fix some copy protection codes in wine
patch originally for i386 by Linus, then ported to x86_64 by Andi Kleen
see: [PATCH] x86_64: Some fixes for single step handling
commit: be61bff789
But it was never applied to the ia32 emulation code which breaks some
copy-protection schemes under wine when running on x86_64.
Signed-off-by: Peter Beutner <p.beutner@gmx.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
So why are we calling smp_send_stop from machine_halt?
We don't.
Looking more closely at the bug report the problem here
is that halt -p is called which triggers not a halt but
an attempt to power off.
machine_power_off calls machine_shutdown which calls smp_send_stop.
If pm_power_off is set we should never make it out machine_power_off
to the call of do_exit. So pm_power_off must not be set in this case.
When pm_power_off is not set we expect machine_power_off to devolve
into machine_halt.
So how do we fix this?
Playing too much with smp_send_stop is dangerous because it
must also be safe to be called from panic.
It looks like the obviously correct fix is to only call
machine_shutdown when pm_power_off is defined. Doing
that will make Andi's assumption about not scheduling
true and generally simplify what must be supported.
This turns machine_power_off into a noop like machine_halt
when pm_power_off is not defined.
If the expected behavior is that sys_reboot(LINUX_REBOOT_CMD_POWER_OFF)
becomes sys_reboot(LINUX_REBOOT_CMD_HALT) if pm_power_off is NULL
this is not quite a comprehensive fix as we pass a different parameter
to the reboot notifier and we set system_state to a different value
before calling device_shutdown().
Unfortunately any fix more comprehensive I can think of is not
obviously correct. The core problem is that there is no architecture
independent way to detect if machine_power will become a noop, without
calling it.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I noticed that some lowlevel send_IPI_mask helpers had a hotplug/preempt
race whereupon the cpu_online_map was read before disabling preemption;
...
cpumask_t mask = cpu_online_map;
int cpu = get_cpu();
cpu_clear(cpu, mask);
...
But then i realised that there is no need for these lowlevel functions to
be going through all this trouble when all the callers are already made
hotplug/preempt safe.
Signed-off-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is one CPU here whose MCE bank count is 6. This patch increases
x86_64's MCE bank count.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following is probably a good idea given that the atomic_set() isn't
a barrier here either.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>