This patch contains the ia64 architecture specific changes to prevent the
possible race conditions.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I've already sent this to the maintainers, and this is now being sent to a
larger community audience. I have fixed a problem with the ia64 version of
build_sched_domains(), but a similar fix still needs to be made to the
generic build_sched_domains() in kernel/sched.c.
The "dynamic sched domains" functionality has recently been merged into
2.6.13-rcN that sees the dynamic declaration of a cpu-exclusive (a.k.a.
"isolated") cpuset and rebuilds the CPU Scheduler sched domains and sched
groups to separate away the CPUs in this cpu-exclusive cpuset from the
remainder of the non-isolated CPUs. This allows the non-isolated CPUs to
completely ignore the isolated CPUs when doing load-balancing.
Unfortunately, build_sched_domains() expects that a sched domain will
include all the CPUs of each node in the domain, i.e., that no node will
belong in both an isolated cpuset and a non-isolated cpuset. Declaring a
cpuset that violates this presumption will produce flawed data structures
and will oops the kernel.
To trigger the problem (on a NUMA system with >1 CPUs per node):
cd /dev/cpuset
mkdir newcpuset
cd newcpuset
echo 0 >cpus
echo 0 >mems
echo 1 >cpu_exclusive
I have fixed this shortcoming for ia64 NUMA (with multiple CPUs per node).
A similar shortcoming exists in the generic build_sched_domains() (in
kernel/sched.c) for NUMA, and that needs to be fixed also. The fix
involves dynamically allocating sched_group_nodes[] and
sched_group_allnodes[] for each invocation of build_sched_domains(), rather
than using global arrays for these structures. Care must be taken to
remember kmalloc() addresses so that arch_destroy_sched_domains() can
properly kfree() the new dynamic structures.
Signed-off-by: John Hawkes <hawkes@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When handling writes to /proc/irq, current code is re-programming rte
entries directly. This is not recommended and could potentially cause
chipset's to lockup, or cause missing interrupts.
CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the
interrupt is pending. The same needs to be done for /proc/irq handling as well.
Otherwise user space irq balancers are really not doing the right thing.
- Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for
lack of a generic name.
- added move_irq out of IRQ_BALANCE, and added this same to X86_64
- Added new proc handler for write, so we can do deferred write at irq
handling time.
- Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead
it now shows only active cpu masks, or exactly what was set.
- Provided a common move_irq implementation, instead of duplicating
when using generic irq framework.
Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off.
Tested UP builds as well.
MSI testing: tbd: I have cards, need to look for a x-over cable, although I
did test an earlier version of this patch. Will test in a couple days.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Acked-by: Zwane Mwaikambo <zwane@holomorphy.com>
Grudgingly-acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Coywolf Qi Hunt <coywolf@lovecn.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The config option 'CONFIG_ACPI_DEALLOCATE_IRQ' is no longer
needed. This patch removes it.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The reenabling of psr.ic should really belong to dtr mapping code block.
It make the fall through code fast since it doesn't need to execute the
predicated-off instruction. Logically make more sense as well since psr.ic
was turned off in .map code block.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Change sn2-specific calls into generic functions. Without this change
the uncached allocator will not work on non-sn2 platforms.
Signed-off-by: Greg Edwards <edwardsg@sgi.com>
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Not all of the PAL VM calls are implemented for the SKI simulator.
Don't just give up if one fails, print information from the calls
that succeed.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
It has been reported that the way Linux handles NODEFER for signals is
not consistent with the way other Unix boxes handle it. I've written a
program to test the behavior of how this flag affects signals and had
several reports from people who ran this on various Unix boxes,
confirming that Linux seems to be unique on the way this is handled.
The way NODEFER affects signals on other Unix boxes is as follows:
1) If NODEFER is set, other signals in sa_mask are still blocked.
2) If NODEFER is set and the signal is in sa_mask, then the signal is
still blocked. (Note: this is the behavior of all tested but Linux _and_
NetBSD 2.0 *).
The way NODEFER affects signals on Linux:
1) If NODEFER is set, other signals are _not_ blocked regardless of
sa_mask (Even NetBSD doesn't do this).
2) If NODEFER is set and the signal is in sa_mask, then the signal being
handled is not blocked.
The patch converts signal handling in all current Linux architectures to
the way most Unix boxes work.
Unix boxes that were tested: DU4, AIX 5.2, Irix 6.5, NetBSD 2.0, SFU
3.5 on WinXP, AIX 5.3, Mac OSX, and of course Linux 2.6.13-rcX.
* NetBSD was the only other Unix to behave like Linux on point #2. The
main concern was brought up by point #1 which even NetBSD isn't like
Linux. So with this patch, we leave NetBSD as the lonely one that
behaves differently here with #2.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch to support P-state transitions on ia64. This driver is based on ACPI,
and uses the ACPI processor driver interface to find out the P-state support
information for the processor. This driver plugs into generic cpufreq
infrastructure.
Once this driver is loaded successfully, ondemand/userspace governor can be
used to change the CPU frequency dynamically based on load or on request from
userspace process.
Refer :
ACPI specification -
http://www.acpi.info
P-state related PAL calls -
http://developer.intel.com/design/itanium/downloads/24869909.pdf
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Currently, region numbers are defined in several files, with several
names. For example, we have REGION_KERNEL in asm/page.h and
RGN_KERNEL in pgtable.h
We also have address definitions that should depend on the
RGN_XXX macros, but are currently just long constants.
The following patch reorganises all the definitions so that they have
the same form (RGN_XXX), are in one place, and that addresses that
depend on RGN_XXX are derived from them.
(This is a necessary but not sufficient patch to allow UML-like
operation on IA64).
Thanks to David Mosberger for catching the change I missed in mmu_context.h.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The PFM_LOAD_CONTEXT may fail silently and cause a session
to remain reserved even though it should not. This can happen
when the commands succeeds in reserving the session but fails
when it actually tries to attach to the load_pid. In that case,
the command has failed but will return 0. More importantly,
the session will remain reserved. This patch fixes the problem.
Signed-off-by: <stephane.eranian@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
this changeset broke the "nohalt" kernel boot option.
8df5a500a3
default_idle() is looking at new variable can_do_pal_halt. However,
that variable did not get cleared upon "nohalt" boot option. Result
is that "nohalt" option is ignored until perfmon is exercised.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
error condition is passed along by acpi_register_gsi().
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Current acpi_register_gsi() function has no way to indicate errors to its
callers even though acpi_register_gsi() can fail to register gsi because of
some reasons (out of memory, lack of interrupt vectors, incorrect BIOS, and so
on). As a result, caller of acpi_register_gsi() cannot handle the case that
acpi_register_gsi() fails. I think failure of acpi_register_gsi() should be
handled properly.
This series of patches changes acpi_register_gsi() to return negative value on
error, and also changes callers of acpi_register_gsi() to handle failure of
acpi_register_gsi().
This patch changes the type of return value of acpi_register_gsi() from
"unsigned int" to "int" to indicate an error. If acpi_register_gsi() fails to
register gsi, it returns negative value.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
This removes sys_set_zone_reclaim() for now. While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity. I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.
Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]
Secondly, it was added based on a single datapoint from Martin:
http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2
where Martin characterizes the numbers the following way:
' Run-to-run variability for "make -j" is huge, so these numbers aren't
terribly useful except to see that with reclaim the benchmark still
finishes in a reasonable amount of time. '
in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?
I'd really suggest to first walk the walk and see what's needed to get
stable & predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!
The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.
The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes. We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
unwind.c can read the wrong unat bits from switch_stack.
sw->caller_unat is the value of ar.unat when the task was blocked.
sw->ar_unat is the value of ar.unat after doing st8.spill for r4-7.
IOW, ar_unat is caller_unat with 4 bits changed.
unw_access_gr() uses sw->ar_unat for r4-7 (correct), but it also uses
sw->ar_unat for other scratch registers (incorrect). sw->ar_unat
should only be used for r4-7, everything else should use
sw->caller_unat, unless modified by unwind info. Using sw->ar_unat
risks picking up the 4 bits that were overwritten when r4-7 were saved.
Also this line is wrong
unw.sw_off[unw.preg_index[UNW_REG_PFS]] = SW(AR_UNAT);
and should be
unw.sw_off[unw.preg_index[UNW_REG_PFS]] = SW(AR_PFS);
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
machine_restart, machine_halt and machine_power_off are machine
specific hooks deep into the reboot logic, that modules
have no business messing with. Usually code should be calling
kernel_restart, kernel_halt, kernel_power_off, or
emergency_restart. So don't export machine_restart,
machine_halt, and machine_power_off so we can catch buggy users.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Check with PAL to see what the i-cache line size is for
each level of the cache, and so use the correct stride
when flushing the cache.
Acked-by: David Mosberger
Signed-off-by: Tony Luck <tony.luck@intel.com>
ACPI 3.0 added a Correctable Platform Error Interrupt (CPEI)
Processor Overide flag to MADT.Platform_Interrupt_Source.
Record the processor that was provided as hint from ACPI.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Current assign_irq_vector() will panic if interrupt vectors is running
out. But I think how to handle the case of lack of interrupt vectors
should be handled by the caller of this function. For example, some
PCI devices can raise the interrupt signal via both MSI and I/O
APIC. So even if the driver for these device fails to allocate a
vector for MSI, the driver still has a chance to use I/O APIC based
interrupt. But currently there is no chance for these driver to use
I/O APIC based interrupt because kernel will panic when
assign_irq_vector() fails to allocate interrupt vector.
The following patch changes assign_irq_vector() for ia64 to return
-ENOSPC on error instead of panic (as i386 and x86_64 versions do).
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
changing CONFIG_LOCALVERSION rebuilds too much, for no appearent reason.
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Jesse Barnes provided the original version of this patch months ago, but
other changes kept conflicting with it, so it got deferred. Greg Edwards
dug it out of obscurity just over a week ago, and almost immediately
another conflicting patch appeared (Bob Picco's memory-less nodes).
I've resolved the conflicts and got it running again. CONFIG_SGI_TIOCX
is set to "y" in defconfig, which causes a Tiger to not boot (oops in
tiocx_init). But that can be resolved later ... get this in now before it
gets stale again.
Signed-off-by: Tony Luck <tony.luck@intel.com>
restore_sigcontext calls ia64_set_local_fpu_owner() which requires that
preempt be disabled.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The following renames arch_init, a kprobes function for performing any
architecture specific initialization, to arch_init_kprobes in order to
cleanup the namespace.
Also, this patch adds arch_init_kprobes to sparc64 to fix the sparc64 kprobes
build from the last return probe patch.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There's another problem shown up by Ingo's recent patch to make
smp_processor_id() complain if it's called with preemption enabled.
local_finish_flush_tlb_mm() calls activate_context() in a situation
where it could be rescheduled to another processor. This patch
disables preemption around the call.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch greatly speeds up the handling of lfetch.fault instructions
which result in NaT consumption. Due to the NaT-page mapped at address
0, this is guaranteed to happen when lfetch.fault'ing a NULL pointer.
With this patch in place, we can even define prefetch()/prefetchw() as
lfetch.fault without significant performance degradation. More
importantly, it allows compilers to be more aggressive with using
lfetch.fault on pointers that might be NULL.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Resend 2 with changes per Bjorn Helgaas comments. Changes from original:
+ Change globals to vga_console_iobase/vga_console_membase and make them
unconditional.
+ Address style-related comments.
Patch to extend the PCDP vga setup code to support PCI io/mem translations
for the legacy vga ioport and ram spaces on architectures (e.g. altix) which
need them.
Summary of the changes:
drivers/firmware/pcdp.c
drivers/firmware/pcdp.h
-----------------------
+ add declaration for the spec-defined PCI interface struct (pcdp_if_pci)
as well as support macros.
+ extend setup_vga_console() to know about pcdp_if_pci and add a couple of
globals to hold the io and mem translation offsets if present.
arch/ia64/kernel/setup.c
------------------------
+ tweek early_console_setup() to allow multiple early console setup routines
to be called.
include/asm-ia64/vga.h
----------------------
+ make VGA_MAP_MEM vga_console_membase aware
Signed-off-by: Mark Maule <maule@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This is an ia64 implementation of acpi_register_ioapic() and
acpi_unregister_ioapic() interfaces.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch adds the following new interfaces for I/O xAPIC
hotplug. The implementation of these interfaces depends on each
architecture.
o int acpi_register_ioapic(acpi_handle handle, u64 phys_addr,
u32 gsi_base);
This new interface is to add a new I/O xAPIC specified by
phys_addr and gsi_base pair. phys_addr is the physical address
to which the I/O xAPIC is mapped and gsi_base is global system
interrupt base of the I/O xAPIC. acpi_register_ioapic returns
0 on success, or negative value on error.
o int acpi_unregister_ioapic(acpi_handle handle, u32 gsi_base);
This new interface is to remove a I/O xAPIC specified by
gsi_base. acpi_unregister_ioapic returns 0 on success, or
negative value on error.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Not safe to insert kprobes on IVT code.
This patch checks to see if the address on which Kprobes is being inserted is
in ivt code and if it is in ivt code then refuse to register kprobe.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Acked-by: David Mosberger <davidm@napali.hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Without the ability to atomically write 16 bytes, we can not update the
middle slot of a bundle, slot 1, unless we stop the machine first. This
patch will ensure the ability to robustly insert and remove a kprobe by
refusing to insert a kprobe on slot 1 until a mechanism is in place to
safely handle this case.
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch implements function return probes for ia64 using
the revised design. With this new design we no longer need to do some
of the odd hacks previous required on the last ia64 return probe port
that I sent out for comments.
Note that this new implementation still does not resolve the problem noted
by Keith Owens where backtrace data is lost after a return probe is hit.
Changes include:
* Addition of kretprobe_trampoline to act as a dummy function for instrumented
functions to return to, and for the return probe infrastructure to place
a kprobe on on, gaining control so that the return probe handler
can be called, and so that the instruction pointer can be moved back
to the original return address.
* Addition of arch_init(), allowing a kprobe to be registered on
kretprobe_trampoline
* Addition of trampoline_probe_handler() which is used as the pre_handler
for the kprobe inserted on kretprobe_implementation. This is the function
that handles the details for calling the return probe handler function
and returning control back at the original return address
* Addition of arch_prepare_kretprobe() which is setup as the pre_handler
for a kprobe registered at the beginning of the target function by
kernel/kprobes.c so that a return probe instance can be setup when
a caller enters the target function. (A return probe instance contains
all the needed information for trampoline_probe_handler to do it's job.)
* Hooks added to the exit path of a task so that we can cleanup any left-over
return probe instances (i.e. if a task dies while inside a targeted function
then the return probe instance was reserved at the beginning of the function
but the function never returns so we need to mark the instance as unused.)
Signed-off-by: Rusty Lynch <rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This updates the CFQ io scheduler to the new time sliced design (cfq
v3). It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes. It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls. The latter closely mimic
set/getpriority.
This import is based on my latest from -mm.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ia64 changes similar to kernel/sched.c.
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do some basic initial tuning.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dead CPU notifies online CPU that it's dead using cpu_state variable.
After switching to physical cpu hotplug, we forgot setting the variable.
This patch fixes it. Currently only __cpu_die uses it. We changed other
locations for consistency in case others use it.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch includes IA64 architecture specific changes(ported form i386) to
support temporary disarming on reentrancy of probes.
In case of reentrancy we single step without calling user handler.
Signed-of-by: Anil S Keshavamurth <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Once the jprobe instrumented function returns, it executes a jprobe_break
which is a break instruction with __IA64_JPROBE_BREAK value. The current
patch checks for this break value, before assuming that jprobe instrumented
function just completed.
The previous code was not checking for this value and that was a bug.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current kprobes does not yet handle register kprobes on some of the
following kind of instruction which needs to be emulated in a special way.
1) mov r1=ip
2) chk -- Speculation check instruction
This patch attempts to fail register_kprobes() when user tries to insert
kprobes on the above kind of instruction.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current Kprobes when patching the original instruction with the break
instruction tries to retain the original qualifying predicate(qp), however
for cmp.crel.ctype where ctype == unc, which is a special instruction
always needs to be executed irrespective of qp. Hence, if the instruction
we are patching is of this type, then we should not copy the original qp to
the break instruction, this is because we always want the break fault to
happen so that we can emulate the instruction.
This patch is based on the feedback given by David Mosberger
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
arch_prepare_kprobes() was doing lots of functionality
in just one single function. This patch
attempts to clean up arch_prepare_kprobes() by moving
specific sub task to the following (new)functions
1)valid_kprobe_addr() -->> validate the given kprobe address
2)get_kprobe_inst(slot..)->> Retrives the instruction for a given slot from the bundle
3)prepare_break_inst() -->> Prepares break instruction within the bundle
3a)update_kprobe_inst_flag()-->>Updates the internal flags, required
for proper emulation of the instruction at later
point in time.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a bug where a kprobe still fires when the instruction is predicated
off. So given the p6=0, and we have an instruction like:
(p6) move loc1=0
we should not be triggering the kprobe. This is handled by carrying over
the qp section of the original instruction into the break instruction.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A cleanup of the ia64 kprobes implementation such that all of the bundle
manipulation logic is concentrated in arch_prepare_kprobe().
With the current design for kprobes, the arch specific code only has a
chance to return failure inside the arch_prepare_kprobe() function.
This patch moves all of the work that was happening in arch_copy_kprobe()
and most of the work that was happening in arch_arm_kprobe() into
arch_prepare_kprobe(). By doing this we can add further robustness checks
in arch_arm_kprobe() and refuse to insert kprobes that will cause problems.
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is required to support kprobe on branch/call instructions.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds IA64 architecture specific JProbes support on top of Kprobes
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is an IA64 arch specific handling of Kprobes
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As many of you know that kprobes exist in the main line kernel for various
architecture including i386, x86_64, ppc64 and sparc64. Attached patches
following this mail are a port of Kprobes and Jprobes for IA64.
I have tesed this patches for kprobes and Jprobes and this seems to work fine.
I have tested this patch by inserting kprobes on various slots and various
templates including various types of branch instructions.
I have also tested this patch using the tool
http://marc.theaimsgroup.com/?l=linux-kernel&m=111657358022586&w=2 and the
kprobes for IA64 works great.
Here is list of TODO things and pathes for the same will appear soon.
1) Support kprobes on "mov r1=ip" type of instruction
2) Support Kprobes and Jprobes to exist on the same address
3) Support Return probes
3) Architecture independent cleanup of kprobes
This patch adds the kdebug die notification mechanism needed by Kprobes.
For break instruction on Branch type slot, imm21 is ignored and value
zero is placed in IIM register, hence we need to handle kprobes
for switch case zero.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Rusty Lynch <Rusty.lynch@intel.com>
From: Rusty Lynch <rusty.lynch@intel.com>
At the point in traps.c where we recieve a break with a zero value, we can
not say if the break was a result of a kprobe or some other debug facility.
This simple patch changes the informational string to a more correct "break
0" value, and applies to the 2.6.12-rc2-mm2 tree with all the kprobes
patches that were just recently included for the next mm cut.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains the ia64 uncached page allocator and the generic
allocator (genalloc). The uncached allocator was formerly part of the SN2
mspec driver but there are several other users of it so it has been split
off from the driver.
The generic allocator can be used by device driver to manage special memory
etc. The generic allocator is based on the allocator from the sym53c8xx_2
driver.
Various users on ia64 needs uncached memory. The SGI SN architecture requires
it for inter-partition communication between partitions within a large NUMA
cluster. The specific user for this is the XPC code. Another application is
large MPI style applications which use it for synchronization, on SN this can
be done using special 'fetchop' operations but it also benefits non SN
hardware which may use regular uncached memory for this purpose. Performance
of doing this through uncached vs cached memory is pretty substantial. This
is handled by the mspec driver which I will push out in a seperate patch.
Rather than creating a specific allocator for just uncached memory I came up
with genalloc which is a generic purpose allocator that can be used by device
drivers and other subsystems as they please. For instance to handle onboard
device memory. It was derived from the sym53c7xx_2 driver's allocator which
is also an example of a potential user (I am refraining from modifying sym2
right now as it seems to have been under fairly heavy development recently).
On ia64 memory has various properties within a granule, ie. it isn't safe to
access memory as uncached within the same granule as currently has memory
accessed in cached mode. The regular system therefore doesn't utilize memory
in the lower granules which is mixed in with device PAL code etc. The
uncached driver walks the EFI memmap and pulls out the spill uncached pages
and sticks them into the uncached pool. Only after these chunks have been
utilized, will it start converting regular cached memory into uncached memory.
Hence the reason for the EFI related code additions.
Signed-off-by: Jes Sorensen <jes@wildopensource.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the core of the (much simplified) early reclaim. The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.
One of the major uses of this is NUMA machines. With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.
This adds a zone tuneable to enable early zone reclaim. It is selected on a
per-zone basis and is turned on/off via syscall.
Adding some extra throttling on the reclaim was also required (patch
4/4). Without the machine would grind to a crawl when doing a "make -j"
kernel build. Even with this patch the System Time is higher on
average, but it seems tolerable. Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
wall user sys %cpu ctx sw. sleeps
---- ---- --- ---- ------ ------
No patch 1009 1384 847 258 298170 504402
w/patch, no reclaim 880 1376 667 288 254064 396745
w/patch & reclaim 1079 1385 926 252 291625 548873
These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot. Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.
I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.
Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).
The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes handling of accesses to ar.rsc via ptrace & restore_sigcontext
[With Thanks to Chris Wright for noticing the restore_sigcontext path]
Signed-off-by: Matthew Chapman <matthewc@hp.com>
Acked-by: David Mosberger <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The nested_dtlb_miss handler currently does not handle fault from
hugetlb address correctly. It walks the page table assuming PAGE_SIZE.
Thus when taking a fault triggered from hugetlb address, it would not
calculate the pgd/pmd/pte address correctly and thus result an incorrect
invocation of ia64_do_page_fault(). In there, kernel will signal SIGBUS
and application dies (The faulting address is perfectly legal and we
have a valid pte for the corresponding user hugetlb address as well).
This patch fix the described kernel bug. Since nested_dtlb_miss is a
rare event and a slow path anyway, I'm making the change without #ifdef
CONFIG_HUGETLB_PAGE for code readability. Tony, please apply.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
printk() calls should include appropriate KERN_* constant.
Signed-off-by: Christophe Lucas <clucas@rotomalug.org>
Signed-off-by: Domen Puncer <domen@coderock.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The latest assembler catches this typo. (reported by Jim Wilson).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
current->blocked will be set to the value of current->thread_info->flags if the
cmpxchg to update thread_info->flags fails. For performance reasons the store into
current->blocked was placed in the cmpxchg loop. However, the cmpxchg overwrites the
register holding the value to be stored. In the rare case of a retry the value of
thread_info->flags will be written into current->blocked.
The fix is to use another register so that the register containing the current->blocked
value is not overwritten.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There've been reports of problems with CONFIG_PREEMPT=y and the high
floating point partition. This is caused by the possibility of preemption
and rescheduling on a different processor while saving or restioirng the
high partition.
The only places where the FPU state is touched are in ptrace, in
switch_to(), and where handling a floating-point exception. In switch_to()
preemption is off. So it's only in trap.c and ptrace.c that we need to
prevent preemption.
Here is a patch that adds commentary to make the conditions clear, and adds
appropriate preempt_{en,dis}able() calls to make it so. In trap.c I use
preempt_enable_no_resched(), as we're about to return to user space where
the preemption flag will be checked anyway.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
break.b does not store the break number in cr.iim, instead it stores 0,
which makes all break.b instructions look like BUG(). Extract the
break number from the instruction itself.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Christian Hildner pointed out that the comment did not match what the
code does in cpu_init() when we set up the default control register.
Patch based on suggestions from Ken Chen.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Some bits of the kernel assume that gp always points to valid memory,
in particular PHYSICAL_MODE_ENTER() assumes that both gp and sp are
valid virtual addresses with associated physical pages. The IA64
module loader puts gp well past the end of the module, with no physical
backing. Offsets on gp are still valid, but physical mode addressing
breaks for modules. Ensure that gp always falls within the module
body. Also ensure that gp is 8 byte aligned.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The attached patch cleans up a compilation warning when ACPI
is turned off (i.e., when compiling for the Ski simulator).
Signed-off-by: Tony Luck <tony.luck@intel.com>
In IA64 kernel, sys_mmap calls do_mmap2 and do_mmap2 returns addr if
len=0, which means the mmap sys call succeeds.
Posix.1 says:
The mmap() function shall fail if:
[EINVAL] The value of len is zero.
Here is a patch to fix it.
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Acked-by: David Mosberger <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
I applied the penultimate version of the perfmon patch, which didn't have
the initialization of the new spinlock that was added.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Patch from Charles Spirakis
Some linux customers want to optimize their applications on the latest
hardware but are not yet willing to upgrade to the latest kernel. This
patch provides a way to plug in an alternate, basic, and GPL'ed PMU
subsystem to help with their monitoring needs or for specialty work. It
can also be used in case of serious unexpected bugs in perfmon. Mutual
exclusion between the two subsystems is guaranteed, hence no conflict
can arise from both subsystem being present.
Acked-by: Stephane Eranian <eranian@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Fix convert_to_non_syscall() so it arranges for the kernel to be left
via ia64_leave_kernel() rather than ia64_leave_syscall(). The latter
no longer tolerates being called with pSys=0 and pNonSys=1.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
acpi_request_vector() is called in ia64_mca_init() to get the cpe_vector.
The problem is that acpi_request_vector() looks in platform_intr_list[] to
get the vector, but platform_intr_list[] is not initialized with a valid
vector until later (in sn_setup()). Without a valid vector the code
defaults to polling mode.
This patch moves the call to acpi_request_vector() from ia64_mca_init()
to ia64_mca_late_init(), which is after platform_intr_list[] is initialized.
Signed-off-by: Russ Anderson (rja@sgi.com)
Signed-off-by: Tony Luck <tony.luck@intel.com>
convert_to_non_syscall() has the same problem that unwind_to_user()
used to have. Fix it likewise.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Some time ago, GAS was fixed to bring the .spillpsp directive in line
with the Intel assembler manual (there was some disagreement as to
whether or not there is a built-in 16-byte offset). Unfortunately,
there are two places in the kernel where this directive is used in
handwritten assembly files and those of course relied on the "buggy"
behavior. As a result, when using a "fixed" assembler, the kernel
picks up the UNaT bits from the wrong place (off by 16) and randomly
sets NaT bits on the scratch registers. This can be noticed easily by
looking at a coredump and finding various scratch registers with
unexpected NaT values. The patch below fixes this by using the
.spillsp directive instead, which works correctly no matter what
assembler is in use.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
I noticed this typo when trying to compile a kernel which had
CONFIG_HOTPLUG turned off. In that case, __devinit is no longer a
no-op and the compiler then detects a section-conflict. Fix by using
__devinitdata instead of __devinit.
Same patch also submitted by Darren Williams to fix compilation error
using sim_defconfig (which has CONFIG_HOTPLUG=n).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Darren Williams <dsw@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Without this patch, the stack is placed _below_ the current task
structure, which is risky at best.
Tony, I think this patch needs to go into 2.6.12, since it fixes a
real bug. Without it, INIT may case secondary errors, which would be
most unpleasant.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
- pfm_context_load(): change return value from EINVAL to EBUSY
when context is already loaded.
- pfm_check_task_state(): pass test if context state is MASKED.
It is safe to give access on PFM_CTX_MASKED because the PMU
state (PMD) is stable and saved in software state.
This helps multiplexing programs such as the example given
in libpfm-3.1.
Signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The pmu_active test is based on the values of PSR.up. THIS IS THE PROBLEM as
it does not take into account the lazy restore logic which is as follow (simplified):
context switch out:
save PMDs
clear psr.up
release ownership
context switch in:
if (ctx->last_cpu == smp_processor_id() && ctx->cpu_activation == cpu_activation) {
set psr.up
return
}
restore PMD
restore PMC
ctx->last_cpu = smp_processor_id();
ctx->activation = ++cpu_activation;
set psr.up
The key here is that on context switch out, we clear psr.up and on context switch in
we check if nobody else used the PMU on that processor since last time we came. In
that case, we assume the PMD/PMC are ours and we simply reactivate.
The Caliper problem is that between the moment we context switch out and the moment we
come back, nobody effectively used the PMU BUT the processor went idle. Normally this
would have no incidence but PAL_HALT does alter the PMU registers. In default_idle(),
the test on psr.up is not strong enough to cover this case and we go into PAL which
trashed the PMU resgisters. When we come back we falsely assume that this is our state
yet it is corrupted. Very nasty indeed.
To avoid the problem it is necessary to forbid going to PAL_HALT as soon as perfmon
installs some valid state in the PMU registers. This happens with an application
attaches a context to a thread or CPU. It is not enough to check the psr/dcr bits.
Hence I propose the attached patch. It adds a callback in process.c to modify the
condition to enter PAL on idle. Basically, now it is conditional to pal_halt=1 AND
perfmon saying it is okay.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Jack Steiner uncovered some opportunities for improvement in
the MCA recovery code.
1) Set bsp to save registers on the kernel stack.
2) Disable interrupts while in the MCA recovery code.
3) Change the way the user process is killed, to avoid
a panic in schedule.
Testing shows that these changes make the recovery code much
more reliable with the 2.6.12 kernel.
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Attached is a patch against David's audit.17 kernel that adds checks
for the TIF_SYSCALL_AUDIT thread flag to the ia64 system call and
signal handling code paths. The patch enables auditing of system
calls set up via fsys_bubble_down, as well as ensuring that
audit_syscall_exit() is called on return from sigreturn.
Neglecting to check for TIF_SYSCALL_AUDIT at these points results in
incorrect information in audit_context, causing frequent system panics
when system call auditing is enabled on an ia64 system.
I have tested this patch and have seen no problems with it.
[Original patch from Amy Griffis ported to current kernel by David Woodhouse]
From: Amy Griffis <amy.griffis@hp.com>
From: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Andi noted that during normal runtime cpu_idle_map is bounced around a lot,
and occassionally at a higher frequency than the timer interrupt wakeup
which we normally exit pm_idle from. So switch to a percpu variable.
I didn't move things to the slow path because it would involve adding
scheduler code to wakeup the idle thread on the cpus we're waiting for.
Signed-off-by: Zwane Mwaikambo <zwane@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch simplifies a couple places where we search for _PXM
values in ACPI namespace. Thanks,
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Patch below fixes 3 trivial typos which are caught by the new
assembler (v2.169.90). Please apply.
[Note: fix to memcpy that was also part of this patch was separately
applied from patches by H.J. and Andreas ... so the delta here only
has the other two fixes. -Tony]
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Convert most of the current code that uses _NSIG directly to instead use
valid_signal(). This avoids gcc -W warnings and off-by-one errors.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Attached is a patch against David's audit.17 kernel that adds checks
for the TIF_SYSCALL_AUDIT thread flag to the ia64 system call and
signal handling code paths.The patch enables auditing of system
calls set up via fsys_bubble_down, as well as ensuring that
audit_syscall_exit() is called on return from sigreturn.
Neglecting to check for TIF_SYSCALL_AUDIT at these points results in
incorrect information in audit_context, causing frequent system panics
when system call auditing is enabled on an ia64 system.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
We were calling ptrace_notify() after auditing the syscall and arguments,
but the debugger could have _changed_ them before the syscall was actually
invoked. Reorder the calls to fix that.
While we're touching ever call to audit_syscall_entry(), we also make it
take an extra argument: the architecture of the syscall which was made,
because some architectures allow more than one type of syscall.
Also add an explicit success/failure flag to audit_syscall_exit(), for
the benefit of architectures which return that in a condition register
rather than only returning a single register.
Change type of syscall return value to 'long' not 'int'.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Yanmin Zhang pointed out a sequence problem when saving the psr. David
Mosberger provided this patch (which gave up a cycle).
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch switches the srlz.i in ia64_leave_kernel() to srlz.d. As
per architecture manual, the former is needed only to ensure that the
clearing of PSR.IC is seen by the VHPT for subsequent instruction
fetches. However, since the remainder of the code (up to and
including the RFI instruction) is mapped by a pinned TLB entry, there
is no chance of an iTLB miss and we don't care whether or not the VHPT
sees PSR.IC cleared. Since srlz.d is substantially cheaper than
srlz.i, this should shave off a few cycles off the interrupt path
(unverified though; I'm not setup to measure this at the moment).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch changes comments & formatting only. There is no code
change.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Improvements come from eliminating srlz.i, not scheduling AR/CR-reads
too early (while there are others still pending), scheduling the
backing-store switch as well as possible, splitting the BBB bundle
into a MIB/MBB pair.
Why is it safe to eliminate the srlz.i? Observe
that we used to clear bits ~PSR_PRESERVED_BITS in PSR.L. Since
PSR_PRESERVED_BITS==PSR.{UP,MFL,MFH,PK,DT,PP,SP,RT,IC}, we
ended up clearing PSR.{BE,AC,I,DFL,DFH,DI,DB,SI,TB}. However,
PSR.BE : already is turned off in __kernel_syscall_via_epc()
PSR.AC : don't care (kernel normally turns PSR.AC on)
PSR.I : already turned off by the time fsys_bubble_down gets invoked
PSR.DFL: always 0 (kernel never turns it on)
PSR.DFH: don't care --- kernel never touches f32-f127 on its own
initiative
PSR.DI : always 0 (kernel never turns it on)
PSR.SI : always 0 (kernel never turns it on)
PSR.DB : don't care --- kernel never enables kernel-level breakpoints
PSR.TB : must be 0 already; if it wasn't zero on entry to
__kernel_syscall_via_epc, the branch to fsys_bubble_down
will trigger a taken branch; the taken-trap-handler then
converts the syscall into a break-based system-call.
In other words: all the bits we're clearying are either 0 already or
are don't cares! Thus, we don't have to write PSR.L at all and we
don't have to do a srlz.i either.
Good for another ~20 cycle improvement for EPC-based heavy-weight
syscalls.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Two other very minor changes: use "mov.i" instead of "mov" for reading
ar.pfs (for clarity; doesn't affect the code at all). Also, predicate
the load of r14 for consistency.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Avoid some stalls, which is good for about 2 cycles when invoking a
light-weight handler. When invoking a heavy-weight handler, this
helps by about 7 cycles, with most of the improvement coming from the
improved branch-prediction achieved by splitting the BBB bundle into
two MIB bundles.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch reorganizes break_fault() to optimistically assume that a
system-call is being performed from user-space (which is almost always
the case). If it turns out that (a) we're not being called due to a
system call or (b) we're being called from within the kernel, we fixup
the no-longer-valid assumptions in non_syscall() and .break_fixup(),
respectively.
With this approach, there are 3 major phases:
- Phase 1: Read various control & application registers, in
particular the current task pointer from AR.K6.
- Phase 2: Do all memory loads (load system-call entry,
load current_thread_info()->flags, prefetch
kernel register-backing store) and switch
to kernel register-stack.
- Phase 3: Call ia64_syscall_setup() and invoke
syscall-handler.
Good for 26-30 cycles of improvement on break-based syscall-path.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reschedule code to read ar.bsp as early as possible. To enable this,
don't bother clearing some of the registers when we're returning to
kernel stacks. Also, instead of trying to support the pNonSys case
(which makes no sense), do a bugcheck instead (with break 0). Finally,
remove a clear of r14 which is a left-over from the previous patch.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Using stf8 seemed like a clever idea at the time, but stf8 forces
the cache-line to be invalidated in the L1D (if it happens to be
there already). This patch eliminates a guaranteed L1D cache-miss
and, by itself, is good for a 1-2 cycle improvement for heavy-weight
syscalls.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Why is this a good idea? Clearing b7 to 0 is guaranteed to do us no
good and writing it with __kernel_syscall_via_epc() yields a 6 cycle
improvement _if_ the application performs another EPC-based system-
call without overwriting b7, which is not all that uncommon. Well
worth the minimal cost of 1 bundle of code.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Decreases syscall overhead by approximately 6 cycles.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This by itself is good for a 1-2 cycle speed up. Effect is bigger
when combined with the later patches.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
vector sharing patch had a typo ... mismatched spin_lock() with
a spin_unlock_irq(). Fix from Kenji Kaneshige.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Rohit and Suresh changed their mind about the order to print things
in /proc/cpuinfo, but didn't include the change in the version of
the patch they sent to me.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Current ia64 linux cannot handle greater than 184 interrupt sources
because of the lack of vectors. The following patch enables ia64 linux
to handle greater than 184 interrupt sources by allowing the same
vector number to be shared by multiple IOSAPIC's RTEs. The design of
this patch is besed on "Intel(R) Itanium(R) Processor Family Interrupt
Architecture Guide".
Even if you don't have a large I/O system, you can see the behavior of
vector sharing by changing IOSAPIC_LAST_DEVICE_VECTOR to fewer value.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Version 3 - rediffed to apply on top of Ashok's hotplug cpu
patch. /proc/cpuinfo output in step with x86.
This is an updated MC/MT identification patch based on the
previous discussions on list.
Add the Multi-core and Multi-threading detection for IPF.
- Add new core and threading related fields in /proc/cpuinfo.
Physical id
Core id
Thread id
Siblings
- setup the cpu_core_map and cpu_sibling_map appropriately
- Handles Hot plug CPU
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Gordon Jin <gordon.jin@intel.com>
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Sadly, I goofed in this syscall-tuning patch:
ChangeSet 1.1966.1.40 2005/01/22 13:31:05 davidm@hpl.hp.com
[IA64] Improve ia64_leave_syscall() for McKinley-type cores.
Optimize ia64_leave_syscall() a bit better for McKinley-type cores.
The patch looks big, but that's mostly due to renaming r16/r17 to r2/r3.
Good for a 13 cycle improvement.
The problem is that the size of the physical stacked registers was
loaded into the wrong register (r3 instead of r17). Since r17 by
coincidence always had the value 1, this had the effect of turning
rse_clear_invalid into a no-op. That poses the risk of leaking kernel
state back to user-land and is hence not acceptable.
The fix below is simple, but unfortunately it costs us about 28 cycles
in syscall overhead. ;-(
Unfortunately, there isn't much we can do about that since those
registers have to be cleared one way or another.
--david
Signed-off-by: Tony Luck <tony.luck@intel.com>
- make pfm_sysctl a global such that it is possible
to enable/disable debug printk in sampling formats
using PFM_DEBUG.
- remove unused pfm_debug_var variable
- fix a bug in pfm_handle_work where an BUG_ON() could
be triggered. There is a path where pfm_handle_work()
can be called with interrupts enabled, i.e., when
TIF_NEED_RESCHED is set. The fix correct the masking
and unmasking of interrupts in pfm_handle_work() such
that we restore the interrupt mask as it was upon entry.
signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Recently I noticed that clearing ar.ssd/ar.csd right before srlz.d is
causing significant stalling in the syscall path. The patch below
fixes that by moving the register-writes after srlz.d. On a Madison,
this drops break-based getpid() from 241 to 226 cycles (-15 cycles).
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Detect user space by the unwind frame with predicate PRED_USER_STACK
set, instead of a user space IP. Tighten up the last ditch check for
running off the top of the kernel stack.
Based on a suggestion by David Mosberger, reworked to fit the current
tree. This survives my stress test which used to break 2.6.9 kernels.
Unlike 2.6.11, the stress test now unwinds to the correct point, so
gdb can get the user space registers.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Call cpu_relax() in busy-waiting loops of the ITC-syncing code.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch is required to support cpu removal for IPF systems. Existing code
just fakes the real offline by keeping it run the idle thread, and polling
for the bit to re-appear in the cpu_state to get out of the idle loop.
For the cpu-offline to work correctly, we need to pass control of this CPU
back to SAL so it can continue in the boot-rendez mode. This gives the
SAL control to not pick this cpu as the monarch processor for global MCA
events, and addition does not wait for this cpu to checkin with SAL
for global MCA events as well. The handoff is implemented as documented in
SAL specification section 3.2.5.1 "OS_BOOT_RENDEZ to SAL return State"
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!