This is needed for large multinode IBM systems which have a sparse
APIC space in clustered mode, fully covering the available 8 bits.
The previous kernels would limit the local APIC number to 127,
which caused it to reject some of the CPUs at boot.
I increased the maximum and shrunk the apic_version array a bit
to make up for that (the version is only 8 bit, so don't need
an full int to store)
Cc: Chris McDermott <lcm@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With a NR_CPUS==128 kernel with CPU hotplug enabled we would waste 4MB
on per CPU data of all possible CPUs. The reason was that HOTPLUG
always set up possible map to NR_CPUS cpus and then we need to allocate
that much (each per CPU data is roughly ~32k now)
The underlying problem is that ACPI didn't tell us how many hotplug CPUs
the platform supports. So the old code just assumed all, which would
lead to this memory wastage.
This implements some new heuristics:
- If the BIOS specified disabled CPUs in the ACPI/mptables assume they
can be enabled later (this is bending the ACPI specification a bit,
but seems like a obvious extension)
- The user can overwrite it with a new additionals_cpus=NUM option
- Otherwise use half of the available CPUs or 2, whatever is more.
Cc: ashok.raj@intel.com
Cc: len.brown@intel.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should zap the low mappings, as soon as possible, so that we can catch
kernel bugs more effectively. Previously early boot had NULL mapped
and didn't trap on NULL references.
This patch introduces boot_level4_pgt, which will always have low identity
addresses mapped. Druing boot, all the processors will use this as their
level4 pgt. On BP, we will switch to init_level4_pgt as soon as we enter C
code and zap the low mappings as soon as we are done with the usage of
identity low mapped addresses. On AP's we will zap the low mappings as
soon as we jump to C code.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's a patch that builds on Natalie Protasevich's IRQ compression
patch and tries to work for MPS boots as well as ACPI. It is meant for
a 4-node IBM x460 NUMA box, which was dying because it had interrupt
pins with GSI numbers > NR_IRQS and thus overflowed irq_desc.
The problem is that this system has 270 GSIs (which are 1:1 mapped with
I/O APIC RTEs) and an 8-node box would have 540. This is much bigger
than NR_IRQS (224 for both i386 and x86_64). Also, there aren't enough
vectors to go around. There are about 190 usable vectors, not counting
the reserved ones and the unused vectors at 0x20 to 0x2F. So, my patch
attempts to compress the GSI range and share vectors by sharing IRQs.
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... and with that all instances in arch/x86_64 are gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch adds boundary check for the MAX_GSI_NUM. Same as the update for
i386, the patch addresses a problem with ACPI SCI IRQ. The patch corrects
the code such that SCI IRQ is skipped and duplicate entry is avoided. The
VIA chipset uses 4-bit IRQ register for internal interrupt routing, and
therefore cannot handle IRQ numbers assigned to its devices. The patch
corrects this problem by allowing PCI IRQs below 16.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Various code needs this information now before the actual SMP bootup. Instead
of computing it on the fly while booting the other CPUs set it up now while
initial MPtable/MADT parsing.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I suggest to change the way IRQs are handed out to PCI devices.
Currently, each I/O APIC pin gets associated with an IRQ, no matter if the
pin is used or not. It is expected that each pin can potentually be
engaged by a device inserted into the corresponding PCI slot. However,
this imposes severe limitation on systems that have designs that employ
many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset.
It is used in ES7000, and currently, there is no way to boot the system
with more that 9 I/O APICs.
The simple change below allows to boot a system with say 64 (or more) I/O
APICs, each providing 1 slot, which otherwise impossible because of the IRQ
gaps created for unused lines on each I/O APIC. It does not resolve the
problem with number of devices that exceeds number of possible IRQs, but
eases up a tension for IRQs on any large system with potentually large
number of devices.
I only implemented this for the ACPI boot, since if the system is this big
and using newer chipsets it is probably (better be!) an ACPI based system
:). The change is completely "mechanical" and does not alter any internal
structures or interrupt model/implementation. The patch works for both
i386 and x86_64 archs. It works with MSIs just fine, and should not
intervene with implementations like shared vectors, when they get worked
out and incorporated.
To illustrate, below is the interrupt distribution for 2-cell ES7000 with
20 I/O APICs, and an Ethernet card in the last slot, which should be eth1
and which was not configured because its IRQ exceeded allowable number (it
actially turned out huge - 480!):
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer
4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0
16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3
19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx
97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx
192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0
NMI: 0 0 0 0 0 0 0 0
LOC: 273423 274528 272829 274228 274092 273761 273827 273694
ERR: 7
MIS: 0
Even though the system doesn't have that many devices, some don't get
enabled only because of IRQ numbering model.
This is the IRQ picture after the patch was applied:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer
4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0
17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx
18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3
21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0
25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0 0 0
LOC: 107981 107636 108899 108698 108489 108326 108331 108254
ERR: 7
MIS: 0
Not only we see the card in the last I/O APIC, but we are not even close to
using up available IRQs, since we didn't waste any.
Signed-off-by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Define pcibus_to_node to be able to figure out which NUMA node contains a
given PCI device. This defines pcibus_to_node(bus) in
include/linux/topology.h and adjusts the macros for i386 and x86_64 that
already provided a way to determine the cpumask of a pci device.
x86_64 was changed to not build an array of cpumasks anymore. Instead an
array of nodes is build which can be used to generate the cpumask via
node_to_cpumask.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the assumption that LAPIC entries contain the BSP as its
first entry. This is a slight improvement to the temporary fix submitted by
Suresh Siddha.
- Removes assumption that LAPIC entries contain BSP first.
- Builds x86_acpiid_to_apicid[] and bios_cpu_apicid[] properly with BSP as
first entry.
- Made maxcpus=1 boot on these systems. Since the parsing earlier in
arch/x86_64/kernel/mpparse.c stopped after maxcpus entries, other entries
were not processed, this causes kernel not to boot on these systems.
TBD: x86_acpiid_to_apicid and bios_cpu_apicid[] seem to be exactly the
same. This could be removed, but might need more work to cleanup.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It is unnecessary on modern Intel or AMD systems, and that is all we support
on x86-64
Also causes problems on various systems
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!