Make zonelist creation policy selectable from sysctl/boot option v6.
This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.
[Default Order]
The kernel selects Node-based or Zone-based order automatically.
[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.
Pros.
* A user can expect local memory as much as possible.
Cons.
* lower zone will be exhansted before higher zone. This may cause OOM_KILL.
Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.
(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.
Pros.
* OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
* memory locality may not be best.
(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
command:
%echo N > /proc/sys/vm/numa_zonelist_order
Will rebuild zonelist in Node-based order.
command:
%echo Z > /proc/sys/vm/numa_zonelist_order
Will rebuild zonelist in Zone-based order.
Thanks to Lee Schermerhorn, he gives me much help and codes.
[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Beacuse SERIAL_PORT_DFNS is removed from include/asm-i386/serial.h and
include/asm-x86_64/serial.h. the serial8250_ports need to be probed late in
serial initializing stage. the console_init=>serial8250_console_init=>
register_console=>serial8250_console_setup will return -ENDEV, and console
ttyS0 can not be enabled at that time. need to wait till uart_add_one_port in
drivers/serial/serial_core.c to call register_console to get console ttyS0.
that is too late.
Make early_uart to use early_param, so uart console can be used earlier. Make
it to be bootconsole with CON_BOOT flag, so can use console handover feature.
and it will switch to corresponding normal serial console automatically.
new command line will be:
console=uart8250,io,0x3f8,9600n8
console=uart8250,mmio,0xff5e0000,115200n8
or
earlycon=uart8250,io,0x3f8,9600n8
earlycon=uart8250,mmio,0xff5e0000,115200n8
it will print in very early stage:
Early serial console at I/O port 0x3f8 (options '9600n8')
console [uart0] enabled
later for console it will print:
console handover: boot [uart0] -> real [ttyS0]
Signed-off-by: <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Needed to get fixed virtual address for USB debug and earlycon with mmio.
Signed-off-by: Eric W. Biderman <ebiderman@xmisson.com>
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for earlyprintk=ttyS0,9600 console=tty0 console=ttyS0,9600n8
the handover will happen from earlyser0 to tty0. but what we want is to
hand over to ttyS0.
Later with serial-convert-early_uart-to-earlycon-for-8250.patch,
console=tty0 console=uart8250,io,0x3f8,9600n8
will handover to ttyS0 instead of tty0.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change name to buf according to the usage as name + index
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some RS-232 devices require DTR to be asserted before they can be used. DTR
is normally asserted in uart_startup() when the port is opened. But we don't
actually open serial console ports, so assert DTR when the port is added.
BTW:
earlyprintk and early_uart are hard coded to set DTR/RTS.
rmk says
The only issue I can think of is the possibility for an attached modem to
auto-answer or maybe even auto-dial before the system is ready for it to do
so. Might have an undesirable cost implication for some running with such a
setup.
Apart from that, I can't think of any other side effect of this specific
patch.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Andi Kleen <ak@suse.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove all ids from the given idr tree. idr_destroy() only frees up
unused, cached idp_layers, but this function will remove all id mappings
and leave all idp_layers unused.
A typical clean-up sequence for objects stored in an idr tree, will use
idr_for_each() to free all objects, if necessay, then idr_remove_all() to
remove all ids, and idr_destroy() to free up the cached idr_layers.
Signed-off-by: Kristian Hoegsberg <krh@redhat.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds an iterator function for the idr data structure. Compared
to just iterating through the idr with an integer and idr_find, this
iterator is (almost, but not quite) linear in the number of elements, as
opposed to the number of integers in the range covered by the idr. This
makes a difference for sparse idrs, but more importantly, it's a nicer way
to iterate through the elements.
The drm subsystem is moving to idr for tracking contexts and drawables, and
with this change, we can use the idr exclusively for tracking these
resources.
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Kristian Hoegsberg <krh@redhat.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This version brings a number of new checks, fixes for flase
positives, plus a clarification of the output to better guide use. Of
note:
- checks for documentation for new __setup calls
- clearer reporting where braces and parenthesis are involved
- reports for closing brace and semi-colon spacing
- reports on unwanted externs
This patch includes an update to the documentation on checkpatch.pl
itself to clarify when it should be used and to indicate that it
is not intended as the final arbitor of style.
Full changelog:
Andy Whitcroft (19):
Version: 0.07
ensure we do not apply control brace checks to preprocesor directives
add {u,s}{8,16,32,64} to the type matcher
accept lack of spacing after the semicolons in for (;;)
report new externs in .c files
fix up typedef exclusion for function prototypes
else trailing statements check need to account for \ at end of line
add enums to the type matcher
add missing check descriptions
suppress double reporting of ** spacing
report on do{ spacing issues
include an example of the brace/parenthesis in output
check for spacing after closing braces
prevent double reports on pointer spacing issues
handle blank continuation lines on macros
classify all reports error, warning, or check
revamp hanging { checks and apply in context
no spaces after the last ; in a for is ok
check __setup has a corresponding addition to documentation
David Woodhouse (1):
limit character set used in patches and descriptions to UTF-8
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a correction for a macro which gives worst case compressed data
size by LZO1X.
This patch was provided by the LZO author (Markus Oberhumer).
Signed-off-by: Nitin Gupta <nitingupta910@gmail.com>
Cc: "Markus F.X.J. Oberhumer" <markus@oberhumer.com>
Cc: "Richard Purdie" <rpurdie@openedhand.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add entries to MAINTAINERS for I/OAT and DMAENGINE
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Cc: Chris Leech <christopher.leech@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have to check that also the second checkpoint list is non-empty before
dropping the transaction.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have to check that also the second checkpoint list is non-empty before
dropping the transaction.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> reported that he's noticed
nfsd read corruption in recent kernels, and did the hard work of
discovering that it's due to splice updating the file position twice.
This means that the next operation would start further ahead than it
should.
nfsd_vfs_read()
splice_direct_to_actor()
while(len) {
do_splice_to() [update sd->pos]
-> generic_file_splice_read() [read from sd->pos]
nfsd_direct_splice_actor()
-> __splice_from_pipe() [update sd->pos]
There's nothing wrong with the core splice code, but the direct
splicing is an addon that calls both input and output paths.
So it has to take care in locally caching offset so it remains correct.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes some code that became dead code after the ATARI_ACSI
removal.
It also indirectly fixes the following bug introduced by
commit c2bcf3b897:
config ATARI_SLM
tristate "Atari SLM laser printer support"
- depends on ATARI && ATARI_ACSI!=n
+ depends on ATARI
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
the pci device list for umem was not using PCI_DEVICE, so the
subvendor/subdevice fields were not set to ANY, so matching
didn't work properly.
Change to use PCI_DEVICE.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch removes the documentation for the removed legacy CDROM drivers.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Mirror the logic in 8250 for proper console write locking
when SYSRQ is triggered or an OOPS is in progress.
Signed-off-by: David S. Miller <davem@davemloft.net>
When cpu_up() fails, we can discern the most likely cause.
If cpu_present() is false, this means the cpu did not appear
in the MD. If -ENODEV is the error return value, then
the processor did not boot properly into the kernel.
Pass this information back in the dr-cpu response packet.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we hot-plug in new cpus, the core_id and proc_id of existing
cpus can change. So in order to set the cpu groups correctly we
need to clear the maps out completely first.
Signed-off-by: David S. Miller <davem@davemloft.net>
dr-cpu unconfigure requests will walk throught he enabled
IRQs and trigger ->set_affinity so that the going-down
cpu no longer has INOs targetted to it.
Signed-off-by: David S. Miller <davem@davemloft.net>
Take a page from the powerpc folks and just calculate the
delay factor directly.
Since frequency scaling chips use a system-tick register,
the value is going to be the same system-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the move of ldom_startcpu_cpuid() into smp.c some other
things need to follow along:
1) smp.c is not a driver so we can't use "PFX" macro in the
printk calls.
2) smp.c now needs asm/io.h and asm/hvtramp.h, ds.c no longer
does
3) kimage_addr_to_ra() also needs to move into smp.c
While we're here, update copyright info and my email address
in smp.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not select HOTPLUG_CPU from SUN_LDOMS, that causes
HOTPLUG_CPU to be selected even on non-SMP which is
illegal.
Only build hvtramp.o when SMP, just like trampoline.o
Protect dr-cpu code in ds.c with HOTPLUG_CPU.
Likewise move ldom_startcpu_cpuid() to smp.c and protect
it and the call site with SUN_LDOMS && HOTPLUG_CPU.
Signed-off-by: David S. Miller <davem@davemloft.net>
The VIO drivers register themselves unconditionally just
like those of any other bus type, so to avoid crashes
on non-VIO systems we need to always register vio_bus_type.
Signed-off-by: David S. Miller <davem@davemloft.net>
Only adding cpus is supports at the moment, removal
will come next.
When new cpus are configured, the machine description is
updated. When we get the configure request we pass in a
cpu mask of to-be-added cpus to the mdesc CPU node parser
so it only fetches information for those cpus. That code
also proceeds to update the SMT/multi-core scheduling bitmaps.
cpu_up() does all the work and we return the status back
over the DS channel.
CPUs via dr-cpu need to be booted straight out of the
hypervisor, and this requires:
1) A new trampoline mechanism. CPUs are booted straight
out of the hypervisor with MMU disabled and running in
physical addresses with no mappings installed in the TLB.
The new hvtramp.S code sets up the critical cpu state,
installs the locked TLB mappings for the kernel, and
turns the MMU on. It then proceeds to follow the logic
of the existing trampoline.S SMP cpu bringup code.
2) All calls into OBP have to be disallowed when domaining
is enabled. Since cpus boot straight into the kernel from
the hypervisor, OBP has no state about that cpu and therefore
cannot handle being invoked on that cpu.
Luckily it's only a handful of interfaces which can be called
after the OBP device tree is obtained. For example, rebooting,
halting, powering-off, and setting options node variables.
CPU removal support will require some infrastructure changes
here. Namely we'll have to process the requests via a true
kernel thread instead of in a workqueue. workqueues run on
a per-cpu thread, but when unconfiguring we might need to
force the thread to execute on another cpu if the current cpu
is the one being removed. Removal of a cpu also causes the kernel
to destroy that cpu's workqueue running thread.
Another issue on removal is that we may have interrupts still
pointing to the cpu-to-be-removed. So new code will be needed
to walk the active INO list and retarget those cpus as-needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a special domain services capability for setting
variables in the OBP options node. Guests don't have permanent
store for the OBP variables like a normal system, so they are
instead maintained in the LDOM control node or in the SC.
Signed-off-by: David S. Miller <davem@davemloft.net>
Property values cannot be referenced outside of
mdesc_grab()/mdesc_release() pairs. The only major
offender was the VIO bus layer, easily fixed.
Add some commentary to mdesc.h describing these rules.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have to be able to handle MD updates, having an in-tree
set of data structures representing the MD objects actually makes
things more painful.
The MD itself is easy to parse, and we can implement the existing
interfaces using direct parsing of the MD binary image.
The MD is now reference counted, so accesses have to now take the
form:
handle = mdesc_grab();
... operations on MD ...
mdesc_release(handle);
The only remaining issue are cases where code holds on to references
to MD property values. mdesc_get_property() returns a direct pointer
to the property value, most cases just pull in the information they
need and discard the pointer, but there are few that use the pointer
directly over a long lifetime. Those will be fixed up in a subsequent
changeset.
A preliminary handler for MD update events from domain services is
there, it is rudimentry but it works and handles all of the reference
counting. It does not check the generation number of the MDs,
and it does not generate a "add/delete" list for notification to
interesting parties about MD changes but that will be forthcoming.
Signed-off-by: David S. Miller <davem@davemloft.net>
All of the interrupts say "LDX RX" and "LDX TX" currently
which is next to useless. Put a device specific prefix
before "RX" and "TX" instead which makes it much more
useful.
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides the existing usage for power-button interrupts, we'll
want to make use of this code for domain-services where the
LDOM manager can send reboot requests to the guest node.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) LDC_MODE_RELIABLE is deprecated an unused by anything, plus
it and LDC_MODE_STREAM were mis-numbered.
2) read_stream() should try to read as much as possible into
the per-LDC stream buffer area, so do not trim the read_nonraw()
length by the caller's size parameter.
3) Send data ACKs when necessary in read_nonraw().
4) In read_nonraw() when we get a pure ACK, advance the RX head
unconditionally past it.
5) Provide the ACKID field in the ldcdgb() packet dump in read_nonraw().
This helps debugging stream mode LDC channel problems.
6) Decrease verbosity of rx_data_wait() so that it is more useful.
A debugging message each loop iteration is too much.
7) In process_data_ack() stop the loop checking when we hit lp->tx_tail
not lp->tx_head.
8) Set the seqid field properly in send_data_nack().
Signed-off-by: David S. Miller <davem@davemloft.net>
This is also a partial workaround for a bug in the LDOM firmware which
double-transmits RX inos during high load. Without this, such an
event causes the kernel to loop forever in the interrupt call chain
ACK'ing but never actually running the IRQ handler (and thus clearing
the interrupt condition in the device).
There is still a bad potential effect when double INOs occur,
not covered by this changeset. Namely, if the INO is already on
the per-cpu INO vector list, we still blindly re-insert it and
thus we can end up losing interrupts already linked in after
it.
We could deal with that by traversing the list before insertion,
but that's too expensive for this edge case.
Signed-off-by: David S. Miller <davem@davemloft.net>
Virtual devices on Sun Logical Domains are built on top
of a virtual channel framework. This, with help of hypervisor
interfaces, provides a link layer protocol with basic
handshaking over which virtual device clients and servers
communicate.
Built on top of this is a VIO device protocol which has it's
own handshaking and message types. At this layer attributes
are exchanged (disk size, network device addresses, etc.)
descriptor rings are registered, and data transfers are
triggers and replied to.
Signed-off-by: David S. Miller <davem@davemloft.net>
while changing task_stime() i noticed a whitespace style problem in
array.c - fix it. While at it, fix all the other style problems too,
most of them in the scheduler-stats related portions of array.c.
There is no change in functionality:
text data bss dec hex filename
4356 28 0 4384 1120 array.o-before
4356 28 0 4384 1120 array.o-after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
improve the comments around the wmult array (which controls the weight
of niced tasks). Clarify that to achieve a 10% difference in CPU
utilization, a weight multiplier of 1.25 has to be used.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alexey Dobriyan noticed that task_stime() contains a piece of dead code.
(which is a remnant of earlier versions of this code) Remove that code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>