I suspect Christoph tested his code only in the NUMA configuration, for
the combination of SLAB+non-NUMA the zero-sized kmalloc's would not work.
Of course, this would only trigger in configurations where those zero-
sized allocations happen (not very common), so that may explain why it
wasn't more widely noticed.
Seen by by Andi Kleen under qemu, and there seems to be a report by
Michael Tsirkin on it too.
Cc: Andi Kleen <ak@suse.de>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lguest does some fairly lowlevel things to support a host, which
normal modules don't need:
math_state_restore:
When the guest triggers a Device Not Available fault, we need
to be able to restore the FPU
__put_task_struct:
We need to hold a reference to another task for inter-guest
I/O, and put_task_struct() is an inline function which calls
__put_task_struct.
access_process_vm:
We need to access another task for inter-guest I/O.
map_vm_area & __get_vm_area:
We need to map the switcher shim (ie. monitor) at 0xFFC01000.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page-writeback accounting is presently performed in the page-flags macros.
This is inconsistent and a bit ugly and makes it awkward to implement
per-backing_dev under-writeback page accounting.
So move this accounting down to the callsite(s).
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use appropriate accessor function to set compound page destructor
function.
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fix to that race in alloc_fresh_huge_page() which could give an illegal
node ID did not need nid_lock at all: the fix was to replace static int nid
by static int prev_nid and do the work on local int nid. nid_lock did make
sure that racers strictly roundrobin the nodes, but that's not something we
need to enforce strictly. Kill nid_lock.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've noticed lots of failures of vmalloc_32 on machines where it
shouldn't have failed unless it was doing an atomic operation.
Looking closely, I noticed that:
#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 GFP_DMA32
#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
#define GFP_VMALLOC32 GFP_DMA
#else
#define GFP_VMALLOC32 GFP_KERNEL
#endif
Which seems to be incorrect, it should always -or- in the DMA flags
on top of GFP_KERNEL, thus this patch.
This fixes frequent errors launchin X with the nouveau DRM for example.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Work around a possible bug in the FRV compiler.
What appears to be happening is that gcc resolves the
__builtin_constant_p() in kmalloc() to true, but then fails to reduce the
therefore constant conditions in the if-statements it guards to constant
results.
When compiling with -O2 or -Os, one single spurious error crops up in
cpuup_callback() in mm/slab.c. This can be avoided by making the memsize
variable const.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/hugetlb.c: In function `dequeue_huge_page':
mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
the old mm into the new mm.
We create the new mm before the binfmt code runs, and place the new stack at
the very top of the address space. Once the binfmt code runs and figures out
where the stack should be, we move it downwards.
It is a bit peculiar in that we have one task with two mm's, one of which is
inactive.
[a.p.zijlstra@chello.nl: limit stack size]
Signed-off-by: Ollie Wild <aaw@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <linux-arch@vger.kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
[bunk@stusta.de: unexport bprm_mm_init]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename some file_ra_state variables and remove some accessors.
It results in much simpler code.
Kudos to Rusty!
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split ondemand readahead interface into two functions. I think this makes it
a little clearer for non-readahead experts (like Rusty).
Internally they both call ondemand_readahead(), but the page argument is
changed to an obvious boolean flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Share the same page flag bit for PG_readahead and PG_reclaim.
One is used only on file reads, another is only for emergency writes. One
is used mostly for fresh/young pages, another is for old pages.
Combinations of possible interactions are:
a) clear PG_reclaim => implicit clear of PG_readahead
it will delay an asynchronous readahead into a synchronous one
it actually does _good_ for readahead:
the pages will be reclaimed soon, it's readahead thrashing!
in this case, synchronous readahead makes more sense.
b) clear PG_readahead => implicit clear of PG_reclaim
one(and only one) page will not be reclaimed in time
it can be avoided by checking PageWriteback(page) in readahead first
c) set PG_reclaim => implicit set of PG_readahead
will confuse readahead and make it restart the size rampup process
it's a trivial problem, and can mostly be avoided by checking
PageWriteback(page) first in readahead
d) set PG_readahead => implicit set of PG_reclaim
PG_readahead will never be set on already cached pages.
PG_reclaim will always be cleared on dirtying a page.
so not a problem.
In summary,
a) we get better behavior
b,d) possible interactions can be avoided
c) racy condition exists that might affect readahead, but the chance
is _really_ low, and the hurt on readahead is trivial.
Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the old readahead algorithm.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert filemap reads to use on-demand readahead.
The new call scheme is to
- call readahead on non-cached page
- call readahead on look-ahead page
- update prev_index when finished with the read request
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.
It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead
In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows to
one single window.
HEURISTICS
The logic deals with four cases:
- sequential-next
found a consistent readahead window, so push it forward
- random
standalone small read, so read as is
- sequential-first
create a new readahead window for a sequential/oversize request
- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up
In each case, three parameters are determined:
- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)
BEHAVIORS
The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:
- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.
- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.
- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.
OVERHEADS
The new code reduced the overheads of
- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)
- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)
That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`
However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.
PERFORMANCE
The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk
The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times
iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k
2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%
second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%
In summary, writes are unstable, reads are pretty close on average:
access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%
aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%
sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run
threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%
second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%
Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:
sum all up 495.046s 491.514s -0.7%
sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run
10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%
That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.
That also explains the adaptive readahead tricks for clustered random reads.
readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.
max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extend struct file_ra_state to support the on-demand readahead logic. Also
define some helpers for it.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define two convenient macros for read-ahead:
- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD
Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
page sizes like 64k.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add look-ahead support to __do_page_cache_readahead().
It works by
- mark the Nth backwards page with PG_readahead,
(which instructs the page's first reader to invoke readahead)
- and only do the marking for newly allocated pages.
(to prevent blindly doing readahead on already cached pages)
Look-ahead is a technique to achieve I/O pipelining:
While the application is working through a chunk of cached pages, the kernel
reads-ahead the next chunk of pages _before_ time of need. It effectively
hides low level I/O latencies to high level applications.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new page flag: PG_readahead.
It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until
it runs out of cached pages!
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
re-dirty through such a mapping will not generate a fault, PG_dirty will
not reflect the dirty state and the dirty count will be skewed. This
implies that msync() is also currently broken for nonlinear mappings.
The easiest solution is to emulate remap_file_pages on non-linear mappings
with simple mmap() for non ram-backed filesystems. Applications continue
to work (albeit slower), as long as the number of remappings remain below
the maximum vma count.
However all currently known real uses of non-linear mappings are for ram
backed filesystems, which this patch doesn't affect.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix msync data loss and (less importantly) dirty page accounting
inaccuracies due to the race remaining in clear_page_dirty_for_io().
The deleted comment explains what the race was, and the added comments
explain how it is fixed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer. This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).
[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).
This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.
struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.
The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__do_fault() was calling ->page_mkwrite() with the page lock held, which
violates the locking rules for that callback. Release and retake the page
lock around the callback to avoid deadlocking file systems which manually
take it.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.
->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).
Having the populate handler install the pte itself is likewise a nasty thing
to be doing.
This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.
The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.
After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.
NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.
[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate/release a chunk of vmalloc address space:
alloc_vm_area reserves a chunk of address space, and makes sure all
the pagetables are constructed for that address range - but no pages.
free_vm_area releases the address space range.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: "Andi Kleen" <ak@muc.de>
Add a kstrndup function, modelled on strndup. Like strndup this
returns a string copied into its own allocated memory, but it copies
no more than the specified number of bytes from the source.
Remove private strndup() from irda code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@mandriva.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Panagiotis Issaris <takis@issaris.org>
Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Our original NFSv4 delegation policy was to give out a read delegation on any
open when it was possible to.
Since the lifetime of a delegation isn't limited to that of an open, a client
may quite reasonably hang on to a delegation as long as it has the inode
cached. This becomes an obvious problem the first time a client's inode cache
approaches the size of the server's total memory.
Our first quick solution was to add a hard-coded limit. This patch makes a
mild incremental improvement by varying that limit according to the server's
total memory size, allowing at most 4 delegations per megabyte of RAM.
My quick back-of-the-envelope calculation finds that in the worst case (where
every delegation is for a different inode), a delegation could take about
1.5K, which would make the worst case usage about 6% of memory. The new limit
works out to be about the same as the old on a 1-gig server.
[akpm@linux-foundation.org: Don't needlessly bloat vmlinux]
[akpm@linux-foundation.org: Make it right for highmem machines]
Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
currently the export_operation structure and helpers related to it are in
fs.h. fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.
[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSYM_NAME_LEN is peculiar in that it does not include the space for the
trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating
buffer. This is nonsense and error-prone. Moreover, when the caller
forgets that it's very likely to subtly bite back by corrupting the stack
because the last position of the buffer is always cleared to zero.
This patch increments KSYM_NAME_LEN by one and updates code accordingly.
* off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro
is fixed.
* Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together,
MODULE_NAME_LEN was treated as if it didn't include space for the
trailing '\0'. Fix it.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Paulo Marques <pmarques@grupopie.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The bounce buffer logic is included on systems that do not need it. If a
system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to
the use of bounce buffers then there is no need to reserve memory pools etc
etc. This is true f.e. for SGI Altix.
Also nicifies the Makefile and gets rid of the tricky "and" there.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the freezer treats all tasks as freezable, except for the kernel
threads that explicitly set the PF_NOFREEZE flag for themselves. This
approach is problematic, since it requires every kernel thread to either
set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
care for the freezing of tasks at all.
It seems better to only require the kernel threads that want to or need to
be frozen to use some freezer-related code and to remove any
freezer-related code from the other (nonfreezable) kernel threads, which is
done in this patch.
The patch causes all kernel threads to be nonfreezable by default (ie. to
have PF_NOFREEZE set by default) and introduces the set_freezable()
function that should be called by the freezable kernel threads in order to
unset PF_NOFREEZE. It also makes all of the currently freezable kernel
threads call set_freezable(), so it shouldn't cause any (intentional)
change of behaviour to appear. Additionally, it updates documentation to
describe the freezing of tasks more accurately.
[akpm@linux-foundation.org: build fixes]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is a bug to set a page dirty if it is not uptodate unless it has
buffers. If the page has buffers, then the page may be dirty (some buffers
dirty) but not uptodate (some buffers not uptodate). The exception to this
rule is if the set_page_dirty caller is racing with truncate or invalidate.
A buffer can not be set dirty if it is not uptodate.
If either of these situations occurs, it indicates there could be some data
loss problem. Some of these warnings could be a harmless one where the
page or buffer is set uptodate immediately after it is dirtied, however we
should fix those up, and enforce this ordering.
Bring the order of operations for truncate into line with those of
invalidate. This will prevent a page from being able to go !uptodate while
we're holding the tree_lock, which is probably a good thing anyway.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that
embedded systems start to use NUMA we may need this.
Put an #ifdef around places where NUMA only code uses fields only valid
for CONFIG_SLUB_DEBUG.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysfs can do a gazillion things when called. Make sure that we do not call
any sysfs functions while holding the slub_lock.
Just protect the essentials:
1. The list of all slab caches
2. The kmalloc_dma array
3. The ref counters of the slabs.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The objects per slab increase with the current patches in mm since we allow up
to order 3 allocs by default. More patches in mm actually allow to use 2M or
higher sized slabs. For slab validation we need per object bitmaps in order
to check a slab. We end up with up to 64k objects per slab resulting in a
potential requirement of 8K stack space. That does not look good.
Allocate the bit arrays via kmalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past. But with __GFP_ZERO it is possible now to do zeroing
while allocating.
Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It becomes now easy to support the zeroing allocs with generic inline
functions in slab.h. Provide inline definitions to allow the continued use of
kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing
functions from the slab allocators and util.c.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can get to the length of the object through the kmem_cache_structure. The
additional parameter does no good and causes the compiler to generate bad
code.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do proper spacing and we only need to do this in steps of 8.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to caculate the dma slab size ourselves. We can simply
lookup the size of the corresponding non dma slab.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_index is a long series of comparisons. The attempt to replace
kmalloc_index with something more efficient like ilog2 failed due to compiler
issues with constant folding on gcc 3.3 / powerpc.
kmalloc_index()'es long list of comparisons works fine for constant folding
since all the comparisons are optimized away. However, SLUB also uses
kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
This leads to a large set of comparisons in get_slab().
The patch here allows to get rid of that list of comparisons in get_slab():
1. If the requested size is larger than 192 then we can simply use
fls to determine the slab index since all larger slabs are
of the power of two type.
2. If the requested size is smaller then we cannot use fls since there
are non power of two caches to be considered. However, the sizes are
in a managable range. So we divide the size by 8. Then we have only
24 possibilities left and then we simply look up the kmalloc index
in a table.
Code size of slub.o decreases by more than 200 bytes through this patch.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We modify the kmalloc_cache_dma[] array without proper locking. Do the proper
locking and undo the dma cache creation if another processor has already
created it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rarely used dma functionality in get_slab() makes the function too
complex. The compiler begins to spill variables from the working set onto the
stack. The created function is only used in extremely rare cases so make sure
that the compiler does not decide on its own to merge it back into get_slab().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add #ifdefs around data structures only needed if debugging is compiled into
SLUB.
Add inlines to small functions to reduce code size.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A kernel convention for many allocators is that if __GFP_ZERO is passed to an
allocator then the allocated memory should be zeroed.
This is currently not supported by the slab allocators. The inconsistency
makes it difficult to implement in derived allocators such as in the uncached
allocator and the pool allocators.
In addition the support zeroed allocations in the slab allocators does not
have a consistent API. There are no zeroing allocator functions for NUMA node
placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are
only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
__GFP_ZERO will make zeroing universally available and does not require any
addititional functions.
So add the necessary logic to all slab allocators to support __GFP_ZERO.
The code is added to the hot path. The gfp flags are on the stack and so the
cacheline is readily available for checking if we want a zeroed object.
Zeroing while allocating is now a frequent operation and we seem to be
gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
allocators. Move ZERO_SIZE_PTR related stuff into slab.h.
Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
WARN_ON_ONCE(size == 0) that is still remaining in SLAB.
Make slub return NULL like the other allocators if a too large memory segment
is requested via __kmalloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of a kmalloc object is readily available via ksize(). ksize is
provided by all allocators and thus we can implement krealloc in a generic
way.
Implement krealloc in mm/util.c and drop slab specific implementations of
krealloc.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function we are calling to initialize object debug state during early NUMA
bootstrap sets up an inactive object giving it the wrong redzone signature.
The bootstrap nodes are active objects and should have active redzone
signatures.
Currently slab validation complains and reverts the object to active state.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently SLUB has no provision to deal with too high page orders that may
be specified on the kernel boot line. If an order higher than 6 (on a 4k
platform) is generated then we will BUG() because slabs get more than 65535
objects.
Add some logic that decreases order for slabs that have too many objects.
This allow booting with slab sizes up to MAX_ORDER.
For example
slub_min_order=10
will boot with a default slab size of 4M and reduce slab sizes for small
object sizes to lower orders if the number of objects becomes too big.
Large slab sizes like that allow a concentration of objects of the same
slab cache under as few as possible TLB entries and thus potentially
reduces TLB pressure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have to do an GFP_ATOMIC allocation because the list_lock is
already taken when we first allocate memory for tracking allocation
information. It would be better if we could avoid atomic allocations.
Allocate a size of the tracking table that is usually sufficient (one page)
before we take the list lock. We will then only do the atomic allocation
if we need to resize the table to become larger than a page (mostly only
needed under large NUMA because of the tracking of cpus and nodes otherwise
the table stays small).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry() instead of list_for_each().
Get rid of for_all_slabs(). It had only one user. So fold it into the
callback. This also gets rid of cpu_slab_flush.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changes the error reporting format to loosely follow lockdep.
If data corruption is detected then we generate the following lines:
============================================
BUG <slab-cache>: <problem>
--------------------------------------------
INFO: <more information> [possibly multiple times]
<object dump>
FIX <slab-cache>: <remedial action>
This also adds some more intelligence to the data corruption detection. Its
now capable of figuring out the start and end.
Add a comment on how to configure SLUB so that a production system may
continue to operate even though occasional slab corruption occur through
a misbehaving kernel component. See "Emergency operations" in
Documentation/vm/slub.txt.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I can never remember what the function to register to receive VM pressure
is called. I have to trace down from __alloc_pages() to find it.
It's called "set_shrinker()", and it needs Your Help.
1) Don't hide struct shrinker. It contains no magic.
2) Don't allocate "struct shrinker". It's not helpful.
3) Call them "register_shrinker" and "unregister_shrinker".
4) Call the function "shrink" not "shrinker".
5) Reduce the 17 lines of waffly comments to 13, but document it properly.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Chinner <dgc@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we are out of memory of a suitable size we enter reclaim. The current
reclaim algorithm targets pages in LRU order, which is great for fairness at
order-0 but highly unsuitable if you desire pages at higher orders. To get
pages of higher order we must shoot down a very high proportion of memory;
>95% in a lot of cases.
This patch set adds a lumpy reclaim algorithm to the allocator. It targets
groups of pages at the specified order anchored at the end of the active and
inactive lists. This encourages groups of pages at the requested orders to
move from active to inactive, and active to free lists. This behaviour is
only triggered out of direct reclaim when higher order pages have been
requested.
This patch set is particularly effective when utilised with an
anti-fragmentation scheme which groups pages of similar reclaimability
together.
This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
the foundation. Credit to Mel Gorman for sanitity checking.
Mel said:
The patches have an application with hugepage pool resizing.
When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
be resized with greater reliability. Testing on a desktop machine with 2GB
of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
was very slow as the success rate was quite low. Without lumpy-reclaim,
each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.
[akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
[bunk@stusta.de: static declarations for internal functions]
[a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a new parameter for sizing ZONE_MOVABLE called
movablecore=. While kernelcore= is used to specify the minimum amount of
memory that must be available for all allocation types, movablecore= is
used to specify the minimum amount of memory that is used for migratable
allocations. The amount of memory used for migratable allocations
determines how large the huge page pool could be dynamically resized to at
runtime for example.
How movablecore is actually handled is that the total number of pages in
the system is calculated and a value is set for kernelcore that is
kernelcore == totalpages - movablecore
Both kernelcore= and movablecore= can be safely specified at the same time.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the kernelcore= parameter for x86.
Once all patches are applied, a new command-line parameter exist and a new
sysctl. This patch adds the necessary documentation.
From: Yasunori Goto <y-goto@jp.fujitsu.com>
When "kernelcore" boot option is specified, kernel can't boot up on ia64
because of an infinite loop. In addition, the parsing code can be handled
in an architecture-independent manner.
This patch uses common code to handle the kernelcore= parameter. It is
only available to architectures that support arch-independent zone-sizing
(i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
ignore the boot parameter.
[bunk@stusta.de: make cmdline_parse_kernelcore() static]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
can be used to satisfy hugepage allocations even when the system has been
running a long time. This allows an administrator to resize the hugepage pool
at runtime depending on the size of ZONE_MOVABLE.
This patch adds a new sysctl called hugepages_treat_as_movable. When a
non-zero value is written to it, future allocations for the huge page pool
will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
that is only usable by allocations that specify both __GFP_HIGHMEM and
__GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
single memory partition while allowing movable allocations to be satisfied
from either partition. The patches may be applied with the list-based
anti-fragmentation patches that groups pages together based on mobility.
The size of the zone is determined by a kernelcore= parameter specified at
boot-time. This specifies how much memory is usable by non-movable
allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.
When selecting a zone to take pages from for ZONE_MOVABLE, there are two
things to consider. First, only memory from the highest populated zone is
used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
the amount of memory usable by the kernel will be spread evenly throughout
NUMA nodes where possible. If the nodes are not of equal size, the amount of
memory usable by the kernel on some nodes may be greater than others.
By default, the zone is not as useful for hugetlb allocations because they are
pinned and non-migratable (currently at least). A sysctl is provided that
allows huge pages to be allocated from that zone. This means that the huge
page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
the system assuming that pages are not mlocked. Despite huge pages being
non-movable, we do not introduce additional external fragmentation of note as
huge pages are always the largest contiguous block we care about.
Credit goes to Andy Whitcroft for catching a large variety of problems during
review of the patches.
This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
memory continues to be placed in their existing destination as there is no
mechanism to redirect them to a specific zone.
[y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is often known at allocation time whether a page may be migrated or not.
This patch adds a flag called __GFP_MOVABLE and a new mask called
GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
using the page migration mechanism or reclaimed by syncing with backing
storage and discarding.
An API function very similar to alloc_zeroed_user_highpage() is added for
__GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
flags used by alloc_zeroed_user_highpage() are not changed because it would
change the semantics of an existing API. After this patch is applied there
are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
be marked deprecated if this patch is merged.
Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
shmem.c to keep all flag modifications to inode->mapping in the
shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
Hugh Dickens.
Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
concept. Credit to Hugh Dickens for catching issues with shmem swap vector
and ramfs allocations.
[akpm@linux-foundation.org: build fix]
[hugh@veritas.com: __GFP_ZERO cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_generic_mapping_read currently samples the i_size at the start and doesn't
do so again unless it needs to call ->readpage to load a page. After
->readpage it has to re-sample i_size as a truncate may have caused that page
to be filled with zeros, and the read() call should not see these.
However there are other activities that might cause ->readpage to be called on
a page between the time that do_generic_mapping_read samples i_size and when
it finds that it has an uptodate page. These include at least read-ahead and
possibly another thread performing a read.
So do_generic_mapping_read must sample i_size *after* it has an uptodate page.
Thus the current sampling at the start and after a read can be replaced with
a sampling before the copy-out.
The same change applied to __generic_file_splice_read.
Note that this fixes any race with truncate_complete_page, but does not fix a
possible race with truncate_partial_page. If a partial truncate happens after
do_generic_mapping_read samples i_size and before the copy_out, the nuls that
truncate_partial_page place in the page could be copied out incorrectly.
I think the best fix for that is to *not* zero out parts of the page in
truncate_partial_page, but rather to zero out the tail of a page when
increasing i_size.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (209 commits)
[POWERPC] Create add_rtc() function to enable the RTC CMOS driver
[POWERPC] Add H_ILLAN_ATTRIBUTES hcall number
[POWERPC] xilinxfb: Parameterize xilinxfb platform device registration
[POWERPC] Oprofile support for Power 5++
[POWERPC] Enable arbitary speed tty ioctls and split input/output speed
[POWERPC] Make drivers/char/hvc_console.c:khvcd() static
[POWERPC] Remove dead code for preventing pread() and pwrite() calls
[POWERPC] Remove unnecessary #undef printk from prom.c
[POWERPC] Fix typo in Ebony default DTS
[POWERPC] Check for NULL ppc_md.init_IRQ() before calling
[POWERPC] Remove extra return statement
[POWERPC] pasemi: Don't auto-select CONFIG_EMBEDDED
[POWERPC] pasemi: Rename platform
[POWERPC] arch/powerpc/kernel/sysfs.c: Move NUMA exports
[POWERPC] Add __read_mostly support for powerpc
[POWERPC] Modify sched_clock() to make CONFIG_PRINTK_TIME more sane
[POWERPC] Create a dummy zImage if no valid platform has been selected
[POWERPC] PS3: Bootwrapper support.
[POWERPC] powermac i2c: Use mutex
[POWERPC] Schedule removal of arch/ppc
...
Fixed up conflicts manually in:
Documentation/feature-removal-schedule.txt
arch/powerpc/kernel/pci_32.c
arch/powerpc/kernel/pci_64.c
include/asm-powerpc/pci.h
and asked the powerpc people to double-check the result..
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: (68 commits)
sh: sh-rtc support for SH7709.
sh: Revert __xdiv64_32 size change.
sh: Update r7785rp defconfig.
sh: Export div symbols for GCC 4.2 and ST GCC.
sh: fix race in parallel out-of-tree build
sh: Kill off dead mach.c for hp6xx.
sh: hd64461.h cleanup and added comments.
sh: Update the alignment when 4K stacks are used.
sh: Add a .bss.page_aligned section for 4K stacks.
sh: Don't let SH-4A clobber SH-4 CFLAGS.
sh: Add parport stub for SuperIO ports.
sh: Drop -Wa,-dsp for DSP tuning.
sh: Update dreamcast defconfig.
fb: pvr2fb: A few more __devinit annotations for PCI.
fb: pvr2fb: Fix up section mismatch warnings.
sh: Select IPR-IRQ for SH7091.
sh: Correct __xdiv64_32/div64_32 return value size.
sh: Fix timer-tmu build for SH-3.
sh: Add cpu and mach links to CLEAN_FILES.
sh: Preliminary support for the SH-X3 CPU.
...
Christian Borntraeger points out that mempool_free() doesn't noop when
handed NULL. This is inconsistent with the other free-like functions
in the kernel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Limiting smaller allocation failures by fault injection helps to find real
possible bugs. Because higher order allocations are likely to fail and
zero-order allocations are not likely to fail.
This patch adds min-order parameter to fail_page_alloc. It specifies the
minimum page allocation order to be injected failures.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some users have been having problems with utilities like cp or dd dumping
core when they try to copy a file that's too large for the destination
filesystem (typically, > 4gb). Apparently, some defunct standards required
SIGXFSZ to be sent in such circumstances, but SUS only requires/allows it
for when a written file exceeds the process's resource limits. I'd like to
limit SIGXFSZs to the bare minimum required by SUS.
Patch sent per http://lkml.org/lkml/2007/4/10/302
Signed-off-by: Micah Cowan <micahcowan@ubuntu.com>
Acked-by: Alan Cox <alan@redhat.com>
Cc: <reiserfs-dev@namesys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make some offending drivers depend on it and set CONFIG_ARCH_NO_VIRT_TO_BUS
for ppc64 so that we don't build those drivers.
This gets PowerPC allmodconfig and allyesconfig much closer to building.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@ftp.linux.org.uk>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Be consistent with VM mmap, implement expand_stack(). We can't actually do
anything other than return an error in the no MMU case though.
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a straightforward split of do_mmap_pgoff() into two functions:
- do_mmap_pgoff() checks the parameters, and calculates the vma
flags. Then it calls
- mmap_region(), which does the actual mapping
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The do_loop_readv_writev implementation of readv breaks out of the loop as
soon as a single read request didn't fill it's buffer:
if (nr != len)
break;
The generic_file_aio_read version doesn't. So if it hits EOF before the end
of the list of buffers, it will try again on the next buffer. If the file was
extended in the mean time, this will produce a bad result.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug in mm/mlock.c on 32-bit architectures that prevents a user from
locking more than 4GB of shared memory, or allocating more than 4GB of
shared memory in hugepages, when rlim[RLIMIT_MEMLOCK] is set to
RLIM_INFINITY.
Signed-off-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently slob is disabled if we're using sparsemem, due to an earlier
patch from Goto-san. Slob and static sparsemem work without any trouble as
it is, and the only hiccup is a missing slab_is_available() in the case of
sparsemem extreme. With this, we're rid of the last set of restrictions
for slob usage.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds preliminary NUMA support to SLOB, primarily aimed at systems with
small nodes (tested all the way down to a 128kB SRAM block), whether
asymmetric or otherwise.
We follow the same conventions as SLAB/SLUB, preferring current node
placement for new pages, or with explicit placement, if a node has been
specified. Presently on UP NUMA this has the side-effect of preferring
node#0 allocations (since numa_node_id() == 0, though this could be
reworked if we could hand off a pfn to determine node placement), so
single-CPU NUMA systems will want to place smaller nodes further out in
terms of node id. Once a page has been bound to a node (via explicit node
id typing), we only do block allocations from partial free pages that have
a matching node id in the page flags.
The current implementation does have some scalability problems, in that all
partial free pages are tracked in the global freelist (with contention due
to the single spinlock). However, these are things that are being reworked
for SMP scalability first, while things like per-node freelists can easily
be built on top of this sort of functionality once it's been added.
More background can be found in:
http://marc.info/?l=linux-mm&m=118117916022379&w=2http://marc.info/?l=linux-mm&m=118170446306199&w=2http://marc.info/?l=linux-mm&m=118187859420048&w=2
and subsequent threads.
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the new madvise_need_mmap_write() call we can avoid an extra case
statement and function call as follows.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
start_cpu_timer() should be __cpuinit (which also matches what it's
callers are).
__devinit didn't cause problems, it simply wasted a few bytes of memory
for the common CONFIG_HOTPLUG_CPU=n case.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently zone_spanned_pages_in_node() and zone_absent_pages_in_node() are
non-static for ARCH_POPULATES_NODE_MAP and static otherwise. However, only
the non-static versions are __meminit annotated, despite only being called
from __meminit functions in either case.
zone_init_free_lists() is currently non-static and not __meminit annotated
either, despite only being called once in the entire tree by
init_currently_empty_zone(), which too is __meminit. So make it static and
properly annotated.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This symbol got orphaned quite a while ago.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. which modpost started warning about.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enabling debugging fails to build due to the nodemask variable in
do_mbind() having changed names, and then oopses on boot due to the
assumption that the nodemask can be dereferenced -- which doesn't work out
so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by
numa_default_policy().
This fixes it up, and switches from PDprintk() to pr_debug() while
we're at it.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_user_pages() can try to allocate a nearly unlimited amount of memory on
behalf of a user process, even if that process has been OOM killed. The
OOM kill occurs upon return to user space via a SIGKILL, but
get_user_pages() will try allocate all its memory before returning. Change
get_user_pages() to check for TIF_MEMDIE, and if set then return
immediately.
Signed-off-by: Ethan Solomita <solo@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the default system init memory policy to use a dynamically
created node map instead of defaulting to all online nodes. Nodes of a
certain size (>= 16MB) are judged to be suitable for interleave, and are added
to the map. If all nodes are smaller in size, the largest one is
automatically selected.
Without this, tiny nodes find themselves out of memory before we even make it
to userspace. Systems with large nodes will notice no change.
Only the system init policy is effected by this change, the regular
MPOL_DEFAULT policy is still switched to later on in the boot process as
normal.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new configuration variable
CONFIG_SLUB_DEBUG_ON
If set then the kernel will be booted by default with slab debugging
switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging
is available but must be enabled by specifying "slub_debug" as a
kernel parameter.
Also add support to switch off slab debugging for a kernel that was
built with CONFIG_SLUB_DEBUG_ON. This works by specifying
slub_debug=-
as a kernel parameter.
Dave Jones wanted this feature.
http://marc.info/?l=linux-kernel&m=118072189913045&w=2
[akpm@linux-foundation.org: clean up switch statement]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
invalidate_mapping_pages() can sometimes take a long time (millions of pages
to free). Long enough for the softlockup detector to trigger.
We used to have a cond_resched() in there but I took it out because the
drop_caches code calls invalidate_mapping_pages() under inode_lock.
The patch adds a nasty flag and puts the cond_resched() back.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable
for both linear and nonlinear pages with a userspace test harness (using
direct IO and truncate, respectively).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
That static `nid' index needs locking. Without it we can end up calling
alloc_pages_node() with an illegal node ID and the kernel crashes.
Acked-by: gurudas pai <gurudas.pai@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the shrink_list name on some files under mm/ directory.
Signed-off-by: Anderson Briglia <anderson.briglia@indt.org.br>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the core slob allocator's minimum alignment restrictions, and instead
introduce the alignment restrictions at the slab API layer. This lets us heed
the ARCH_KMALLOC/SLAB_MINALIGN directives, and also use __alignof__ (unsigned
long) for the default alignment (which should allow relaxed alignment
architectures to take better advantage of SLOB's small minimum alignment).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the bigblock lists in favour of using compound pages and going directly
to the page allocator. Allocation size is stored in page->private, which also
makes ksize more accurate than it previously was.
Saves ~.5K of code, and 12-24 bytes overhead per >= PAGE_SIZE allocation.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Improve slob by turning the freelist into a list of pages using struct page
fields, then each page has a singly linked freelist of slob blocks via a
pointer in the struct page.
- The first benefit is that the slob freelists can be indexed by a smaller
type (2 bytes, if the PAGE_SIZE is reasonable).
- Next is that freeing is much quicker because it does not have to traverse
the entire freelist. Allocation can be slightly faster too, because we can
skip almost-full freelist pages completely.
- Slob pages are then freed immediately when they become empty, rather than
having a periodic timer try to free them. This gives efficiency and memory
consumption improvement.
Then, we don't encode seperate size and next fields into each slob block,
rather we use the sign bit to distinguish between "size" or "next". Then
size 1 blocks contain a "next" offset, and others contain the "size" in
the first unit and "next" in the second unit.
- This allows minimum slob allocation alignment to go from 8 bytes to 2
bytes on 32-bit and 12 bytes to 2 bytes on 64-bit. In practice, it is
best to align them to word size, however some architectures (eg. cris)
could gain space savings from turning off this extra alignment.
Then, make kmalloc use its own slob_block at the front of the allocation
in order to encode allocation size, rather than rely on not overwriting
slob's existing header block.
- This reduces kmalloc allocation overhead similarly to alignment reductions.
- Decouples kmalloc layer from the slob allocator.
Then, add a page flag specific to slob pages.
- This means kfree of a page aligned slob block doesn't have to traverse
the bigblock list.
I would get benchmarks, but my test box's network doesn't come up with
slob before this patch. I think something is timing out. Anyway, things
are faster after the patch.
Code size goes up about 1K, however dynamic memory usage _should_ be
lower even on relatively small memory systems.
Future todo item is to restore the cyclic free list search, rather than
to always begin at the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_large_system_hash() is called at boot time to allocate space for
several large hash tables.
Lately, TCP hash table was changed and its bucketsize is not a power-of-two
anymore.
On most setups, alloc_large_system_hash() allocates one big page (order >
0) with __get_free_pages(GFP_ATOMIC, order). This single high_order page
has a power-of-two size, bigger than the needed size.
We can free all pages that wont be used by the hash table.
On a 1GB i386 machine, this patch saves 128 KB of LOWMEM memory.
TCP established hash table entries: 32768 (order: 6, 393216 bytes)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This entry prints a header in .start callback. This is OK, but the more
elegant solution would be to move this into the .show callback and use
seq_list_start_head() in .start one.
I have left it as is in order to make the patch just switch to new API and
noting more.
[adobriyan@sw.ru: Wrong pointer was used as kmem_cache pointer]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace a hand coded version of DIV_ROUND_UP().
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nid is initialized to numa_node_id() but will either be overwritten in
the loop or not used in the conditional. So remove the initialization.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make zonelist creation policy selectable from sysctl/boot option v6.
This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.
[Default Order]
The kernel selects Node-based or Zone-based order automatically.
[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.
Pros.
* A user can expect local memory as much as possible.
Cons.
* lower zone will be exhansted before higher zone. This may cause OOM_KILL.
Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.
(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.
Pros.
* OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
* memory locality may not be best.
(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.
*node(0)'s memory allocation order:
node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.
*node(1)'s memory allocation order:
node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.
bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.
command:
%echo N > /proc/sys/vm/numa_zonelist_order
Will rebuild zonelist in Node-based order.
command:
%echo Z > /proc/sys/vm/numa_zonelist_order
Will rebuild zonelist in Zone-based order.
Thanks to Lee Schermerhorn, he gives me much help and codes.
[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new security check on mmap operations to see if the user is attempting
to mmap to low area of the address space. The amount of space protected is
indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
0, preserving existing behavior.
This patch uses a new SELinux security class "memprotect." Policy already
contains a number of allow rules like a_t self:process * (unconfined_t being
one of them) which mean that putting this check in the process class (its
best current fit) would make it useless as all user processes, which we also
want to protect against, would be allowed. By taking the memprotect name of
the new class it will also make it possible for us to move some of the other
memory protect permissions out of 'process' and into the new class next time
we bump the policy version number (which I also think is a good future idea)
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
This patch removes xip_file_sendfile, the sendfile implementation for
xip without replacement. Those customers that use xip on s390 are not
using sendfile() as far as we know, and so far s390 is the only platform
this could potentially be used on so far.
Having sendfile is not a popular feature for execute in place file
systems, however we have a working implementation of splice_read() based
on fs/splice.c if anyone asks for it.
At this point in time, it does not seem preferable to merge
splice_read() for xip because it causes extra maintenence effort due to
code duplication and it requires struct page behind the xip memory
segment. We'd like to get rid of that in favor of supporting flash based
embedded platforms (Monta Vista work) soon.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs
to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice,
loop and sendfile in the simplest way, using generic_file_splice_read and
generic_file_splice_write (with the aid of shmem_prepare_write).
We could make some efficiency tweaks later, if there's a real need;
but this is stable and works well as is.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Fix a post-2.6.21 regression.
read_cache_page_async() has two invocations of mark_page_accessed() which will
launch pages right onto the active list.
Remove the first one, keeping the latter one. This avoids marking unwanted
pages active (in the retry loop).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
time period where kmem_cache_open was usable outside of slub.
(Fixes powerpc build error)
Signed-off-by: Chrsitoph Lameter <clameter@sgi.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Line up the vmstat_text with zone_stat_item
enum zone_stat_item {
/* First 128 byte cacheline (assuming 64 bit words) */
NR_FREE_PAGES,
NR_INACTIVE,
NR_ACTIVE,
We current have nr_active and nr_inactive reversed.
[ "OK with patch, though using initializers canbe handy to prevent such
things in future:
static const char * const vmstat_text[] = {
[NR_FREE_PAGES] = "nr_free_pages",
..."
- Alexey ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b46b8f19c9 fixed a couple of bugs
by switching the redzone to 64 bits. Unfortunately, it neglected to
ensure that the _second_ redzone, after the slab object, is aligned
correctly. This caused illegal instruction faults on sparc32, which for
some reason not entirely clear to me are not trapped and fixed up.
Two things need to be done to fix this:
- increase the object size, rounding up to alignof(long long) so
that the second redzone can be aligned correctly.
- If SLAB_STORE_USER is set but alignof(long long)==8, allow a
full 64 bits of space for the user word at the end of the buffer,
even though we may not _use_ the whole 64 bits.
This patch should be a no-op on any 64-bit architecture or any 32-bit
architecture where alignof(long long) == 4. Of the others, it's tested
on ppc32 by myself and a very similar patch was tested on sparc32 by
Mark Fortescue, who reported the new problem.
Also, fix the conditions for FORCED_DEBUG, which hadn't been adjusted to
the new sizes. Again noticed by Mark.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we move the local_irq_enable() to the end of the function then
add_partial() in early_kmem_cache_node_alloc() will be called
with interrupts disabled like during regular operations.
This makes lockdep happy.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Andre Noll <maan@systemlinux.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We agreed to remove the WARN_ON_ONCE before 2.6.22 is released.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
validate_anon_vma gave a useful check on the integrity of the anon_vma list
when Andrea was developing obj rmap; but it was not enabled in SLES9
itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that
its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.
That limit was just an arbitrary number to protect against an infinite
loop. We could raise it to something enormous (depending on sizeof struct
vma and size of memory?); but I rather think validate_anon_vma has outlived
its usefulness, and is better just removed - which gives a magnificent
performance boost to anything like Petr's test program ;)
Of course, a very long anon_vma list is bad news for preemption latency,
and I believe there has been one recent report of such: let's not forget
that, but validate_anon_vma only makes it worse not better.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Petr Vandrovec <petr@vmware.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If slabs are allocated or freed from a large set of call sites (typical for
the kmalloc area) then we may create more output than fits into a single
PAGE and sysfs only gives us one page. The output should be truncated.
This patch fixes the checks to do the truncation properly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function expand_upwards() did not guarded against wrapping
around to address 0. This fixes the adjtimex02 testcase from
the Linux Test Project on a 32bit PARISC kernel.
[expand_upwards is only used on parisc and ia64; it looks like it does
the right thing on both. --kyle]
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
because the object size is padded to reach ARCH_KMALLOC_MINALIGN. Thus the
size of the small slabs is all the same.
No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
for some reason wants a 128 byte alignment.
This patch increases the size of the smallest cache if
ARCH_KMALLOC_MINALIGN is greater than 8. In that case more and more of the
smallest caches are disabled.
If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array. So count them separately.
This approach was tested by Havard yesterday.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.
This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed). This allow
fixing the sparc implementation to always return 1 on sun4c.
[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Miller <davem@davemloft.net>
Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The data structure to manage the information gathered about functions
allocating and freeing objects is allocated when the list_lock has already
been taken. We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building with memory hotplug enabled and cpu hotplug disabled, we
end up with the following section mismatch:
WARNING: mm/built-in.o(.text+0x4e58): Section mismatch: reference to
.init.text: (between 'free_area_init_node' and '__build_all_zonelists')
This happens as a result of:
-> free_area_init_node()
-> free_area_init_core()
-> zone_pcp_init() <-- all __meminit up to this point
-> zone_batchsize() <-- marked as __cpuinit fo
This happens because CONFIG_HOTPLUG_CPU=n sets __cpuinit to __init, but
CONFIG_MEMORY_HOTPLUG=y unsets __meminit.
Changing zone_batchsize() to __devinit fixes this.
__devinit is the only thing that is common between CONFIG_HOTPLUG_CPU=y and
CONFIG_MEMORY_HOTPLUG=y. In the long run, perhaps this should be moved to
another section identifier completely. Without this, memory hot-add
of offline nodes (via hotadd_new_pgdat()) will oops if CPU hotplug is
not also enabled.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
This makes unmap_vm_area static and a wrapper around a new
exported unmap_kernel_range that takes an explicit range instead
of a vm_area struct.
This makes it more versatile for code that wants to play with kernel
page tables outside of the standard vmalloc area.
(One example is some rework of the PowerPC PCI IO space mapping
code that depends on that patch and removes some code duplication
and horrible abuse of forged struct vm_struct).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Instead of returning the smallest available object return ZERO_SIZE_PTR.
A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
is not deferenced. The dereference of ZERO_SIZE_PTR causes a distinctive
fault. kfree can handle a ZERO_SIZE_PTR in the same way as NULL.
This enables functions to use zero sized object. e.g. n = number of objects.
objects = kmalloc(n * sizeof(object));
for (i = 0; i < n; i++)
objects[i].x = y;
kfree(objects);
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cache_free_alien must be called regardless if we use alien caches or not.
cache_free_alien() will do the right thing if there are no alien caches
available.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap reports that a tmpfs, mounted with NUMA mpol= specifying an
offline node, crashes as soon as data is allocated upon it. Now restrict it
to online nodes, where before it restricted to MAX_NUMNODES.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Tested-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This enables simple hotplug support for sparsemem users. Presently
this only permits memory being added in to node 0 on ZONE_NORMAL.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Hotplug callbacks are performed with interrupts enabled. Slub requires
interrupts to be disabled for flushing caches.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zone->present_pages is updated in online_pages(). But, __add_zone() can be
called twice or more before calling online_pages(). So,
init_currenty_empty_zone() can be called unnecessary times. It is cause of
memory leak of zone's wait_table.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On systems with huge amount of physical memory, VFS cache and memory memmap
may eat all available system memory under 4G, then the system may fail to
allocate swiotlb bounce buffer.
There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose
not cover sparsemem model.
This patch add fix to sparsemem model by first try to allocate memmap above
4G.
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need this patch in ASAP. Patch fixes the mysterious hang that remained
on some particular configurations with lockdep on after the first fix that
moved the #idef CONFIG_SLUB_DEBUG to the right location. See
http://marc.info/?t=117963072300001&r=1&w=2
The kmem_cache_node cache is very special because it is needed for NUMA
bootstrap. Under certain conditions (like for example if lockdep is
enabled and significantly increases the size of spinlock_t) the structure
may become exactly the size as one of the larger caches in the kmalloc
array.
That early during bootstrap we cannot perform merging properly. The unique
id for the kmem_cache_node cache will match one of the kmalloc array.
Sysfs will complain about a duplicate directory entry. All of this occurs
while the console is not yet fully operational. Thus boot may appear to be
silently failing.
The kmem_cache_node cache is very special. During early boostrap the main
allocation function is not operational yet and so we have to run our own
small special alloc function during early boot. It is also special in that
it is never freed.
We really do not want any merging on that cache. Set the refcount -1 and
forbid merging of slabs that have a negative refcount.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check for super sized slabs where we can no longer move the free
pointer behind the object for debugging purposes etc is accessing a
field that is not setup yet. We must use objsize here since the size of
the slab has not been determined yet.
The effect of this is that a global slab shrink via "slabinfo -s" will
show errors about offsets being wrong if booted with slub_debug.
Potentially there are other troubles with huge slabs under slub_debug
because the calculated free pointer offset is truncated.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc.c:931: warning: 'setup_nr_node_ids' defined but not used
This is now the only (!) compiler warning I get in my UML build :)
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
#ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
DESTROY_BY_RCU and ctor.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fix:
mm/slab: fix section mismatch warning
mm: fix section mismatch warnings
init/main: use __init_refok to fix section mismatch
kbuild: introduce __init_refok/__initdata_refok to supress section mismatch warnings
all-archs: consolidate .data section definition in asm-generic
all-archs: consolidate .text section definition in asm-generic
kbuild: add "Section mismatch" warning whitelist for powerpc
kbuild: make better section mismatch reports on i386, arm and mips
kbuild: make modpost section warnings clearer
kconfig: search harder for curses library in check-lxdialog.sh
kbuild: include limits.h in sumversion.c for PATH_MAX
powerpc: Fix the MODALIAS generation in modpost for of devices
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
modpost had two cases hardcoded for mm/
Shift over to __init_refok and kill the
hardcoded function names in modpost.
This has the drawback that the functions
will always be kept no matter configuration.
With previous code the function were placed in
init section if configuration allowed it.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Re-introduce rmap verification patches that Hugh removed when he removed
PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
anonymous pages, because PG_locked and PTL together already do.
These checks were important in discovering and fixing a rare rmap corruption
in SLES9.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vunmap doesn't seem to be used outside of mm/vmalloc.c, and has
no prototype in any header so let's make it static
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have a maze of configuration variables that determine the
maximum slab size. Worst of all it seems to vary between SLAB and SLUB.
So define a common maximum size for kmalloc. For conveniences sake we use
the maximum size ever supported which is 32 MB. We limit the maximum size
to a lower limit if MAX_ORDER does not allow such large allocations.
For many architectures this patch will have the effect of adding large
kmalloc sizes. x86_64 adds 5 new kmalloc sizes. So a small amount of
memory will be needed for these caches (contemporary SLAB has dynamically
sizeable node and cpu structure so the waste is less than in the past)
Most architectures will then be able to allocate object with sizes up to
MAX_ORDER. We have had repeated breakage (in fact whenever we doubled the
number of supported processors) on IA64 because one or the other struct
grew beyond what the slab allocators supported. This will avoid future
issues and f.e. avoid fixes for 2k and 4k cpu support.
CONFIG_LARGE_ALLOCS is no longer necessary so drop it.
It fixes sparc64 with SLAB.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate functionality into the #ifdef section.
Extract tracing into one subroutine.
Move object debug processing into the #ifdef section so that the
code in __slab_alloc and __slab_free becomes minimal.
Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The atomicity when handling flags in SLUB is not necessary since both flags
used by SLUB are not updated in a racy way. Flag updates are either done
during slab creation or destruction or under slab_lock. Some of these flags
do not have the non atomic variants that we need. So define our own.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub warns on this, and we're working on making kmalloc(0) return NULL.
Let's make slab warn as well so our testers detect such callers more
rapidly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use inline functions to access the per cpu bit. Intoduce the notion of
"freezing" a slab to make things more understandable.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no user of destructors left. There is no reason why we should keep
checking for destructors calls in the slab allocators.
The RFC for this patch was discussed at
http://marc.info/?l=linux-kernel&m=117882364330705&w=2
Destructors were mainly used for list management which required them to take a
spinlock. Taking a spinlock in a destructor is a bit risky since the slab
allocators may run the destructors anytime they decide a slab is no longer
needed.
Patch drops destructor support. Any attempt to use a destructor will BUG().
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SLOB allocator should implement SLAB_DESTROY_BY_RCU correctly, because
even on UP, RCU freeing semantics are not equivalent to simply freeing
immediately. This also allows SLOB to be used on SMP.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We call alloc_page where we should be calling __page_cache_alloc.
__page_cache_alloc performs cpuset memory spreading. alloc_page does not.
There is no reason that pages allocated via find_or_create should be
exempt.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_cache_create() was swapping ctor and dtor in calling find_mergeable():
though it caused no bug, and probably never would, even if destructors are
retained; but fix it so as not to generate anxiety ;)
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up massive code duplication between mpage_writepages() and
generic_writepages().
The new generic function, write_cache_pages() takes a function pointer
argument, which will be called for each page to be written.
Maybe cifs_writepages() too can use this infrastructure, but I'm not
touching that with a ten-foot pole.
The upcoming page writeback support in fuse will also want this.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VM statistics updates do not matter if the kernel is in idle powersaving
mode. So allow the timer to be deferred.
It would be better though if we could switch the timer between deferrable
and nondeferrable based on differentials present. The timer would start
out nondeferrable and if we find that there were no updates in the last
statistics interval then we would switch the timer to deferrable. If the
timer later finds again that there are differentials then go to
nondeferrable again.
And yet another way would be to run the timer shortly before going to idle?
The solution here means that the VM counters may be slightly off during
idle since differentials may be still pending while the timer is deferred.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following bug was uncovered by compiling with '-W' flag:
CC mm/thrash.o
mm/thrash.c: In function âgrab_swap_tokenâ:
mm/thrash.c:52: warning: comparison of unsigned expression < 0 is always false
Variable token_priority is unsigned, so decrementing first and then
checking the result does not work; fixed by reversing the test, patch
attached (compile tested only).
I am not sure if likely() makes much sense in this new situation, but
I'll let somebody else to make a decision on that.
Signed-off-by: Mika Kukkonen <mikukkon@iki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This was in SLUB in order to head off trouble while the nr_cpu_ids
functionality was not merged. Its merged now so no need to still have this.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since it is referenced by memmap_init_zone (which is __meminit) via the
early_pfn_in_nid macro when CONFIG_NODES_SPAN_OTHER_NODES is set (which
basically means PowerPC 64).
This removes a section mismatch warning in those circumstances.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid atomic overhead in slab_alloc and slab_free
SLUB needs to use the slab_lock for the per cpu slabs to synchronize with
potential kfree operations. This patch avoids that need by moving all free
objects onto a lockless_freelist. The regular freelist continues to exist
and will be used to free objects. So while we consume the
lockless_freelist the regular freelist may build up objects.
If we are out of objects on the lockless_freelist then we may check the
regular freelist. If it has objects then we move those over to the
lockless_freelist and do this again. There is a significant savings in
terms of atomic operations that have to be performed.
We can even free directly to the lockless_freelist if we know that we are
running on the same processor. So this speeds up short lived objects.
They may be allocated and freed without taking the slab_lock. This is
particular good for netperf.
In order to maximize the effect of the new faster hotpath we extract the
hottest performance pieces into inlined functions. These are then inlined
into kmem_cache_alloc and kmem_cache_free. So hotpath allocation and
freeing no longer requires a subroutine call within SLUB.
[I am not sure that it is worth doing this because it changes the easy to
read structure of slub just to reduce atomic ops. However, there is
someone out there with a benchmark on 4 way and 8 way processor systems
that seems to show a 5% regression vs. Slab. Seems that the regression is
due to increased atomic operations use vs. SLAB in SLUB). I wonder if
this is applicable or discernable at all in a real workload?]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6fe6900e1e introduced a nasty bug
in read_cache_page_async().
It added a "mark_page_accessed(page)" at the final return path in
read_cache_page_async(). But in error cases, 'page' holds the error
code, and you can't mark it accessed.
[ and Glauber de Oliveira Costa points out that we can use a return
instead of adding more goto's ]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
Fix trivial comment conflict in kernel/relay.c.
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.
Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.
Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.
The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.
Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it configurable. Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.
The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.
However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.
We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?
[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.
This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.
[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Nate Diller <nate.diller@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shutdown the cache_reaper if the cpu is brought down and set the
cache_reap.func to NULL. Otherwise hotplug shuts down the reaper for good.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export a couple of core functions for AFS write support to use:
find_get_pages_contig()
find_get_pages_tag()
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When cpuset is configured, it breaks the strict hugetlb page reservation as
the accounting is done on a global variable. Such reservation is
completely rubbish in the presence of cpuset because the reservation is not
checked against page availability for the current cpuset. Application can
still potentially OOM'ed by kernel with lack of free htlb page in cpuset
that the task is in. Attempt to enforce strict accounting with cpuset is
almost impossible (or too ugly) because cpuset is too fluid that task or
memory node can be dynamically moved between cpusets.
The change of semantics for shared hugetlb mapping with cpuset is
undesirable. However, in order to preserve some of the semantics, we fall
back to check against current free page availability as a best attempt and
hopefully to minimize the impact of changing semantics that cpuset has on
hugetlb.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The internal hugetlb resv_huge_pages variable can permanently leak nonzero
value in the error path of hugetlb page fault handler when hugetlb page is
used in combination of cpuset. The leaked count can permanently trap N
number of hugetlb pages in unusable "reserved" state.
Steps to reproduce the bug:
(1) create two cpuset, user1 and user2
(2) reserve 50 htlb pages in cpuset user1
(3) attempt to shmget/shmat 50 htlb page inside cpuset user2
(4) kernel oom the user process in step 3
(5) ipcrm the shm segment
At this point resv_huge_pages will have a count of 49, even though
there are no active hugetlbfs file nor hugetlb shared memory segment
in the system. The leak is permanent and there is no recovery method
other than system reboot. The leaked count will hold up all future use
of that many htlb pages in all cpusets.
The culprit is that the error path of alloc_huge_page() did not
properly undo the change it made to resv_huge_page, causing
inconsistent state.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Martin Bligh <mbligh@google.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No "blank" (or "*") line is allowed between the function name and lines for
it parameter(s).
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases SLUB is creating uselessly slabs that are larger than
slub_max_order. Also the layout of some of the slabs was not satisfactory.
Go to an iterarive approach.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components
of SLUB. Thus SLUB will be able to replace SLOB. SLUB can arrange objects in
a denser way than SLOB and the code size should be minimal without debugging
and sysfs support.
Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG.
CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB. SLUB enables
debugging via a boot parameter. SLUB debug code should always be present.
CONFIG_SLUB_DEBUG can be modified in the embedded config section.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the tracking definitions and the check_valid_pointer() function away from
the debugging related functions.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trace in both slab_alloc and slab_free has a lot of common code. Use a single
function for both.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This replaces the PageError() checking. DebugSlab is clearer and allows for
future changes to the page bit used. We also need it to support
CONFIG_SLUB_DEBUG.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the resiliency check into the SYSFS section after validate_slab that is
used by the resiliency check. This will avoid a forward declaration.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scanning of objects happens in a number of functions. Consolidate that code.
DECLARE_BITMAP instead of coding the declaration for bitmaps.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update comments throughout SLUB to reflect the new developments. Fix up
various awkward sentences.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Its only purpose was to bring some sort of symmetry to sysfs usage when
dealing with bootstrapping per cpu flushing. Since we do not time out slabs
anymore we have no need to run finish_bootstrap even without sysfs. Fold it
back into slab_sysfs_init and drop the initcall for the !SYFS case.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We really do not need all this gaga there.
ksize gives us all the information we need to figure out if the object can
cope with the new size.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We needlessly duplicate code. Also make check_valid_pointer inline.
Signed-off-by: Christoph LAemter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If no redzoning is selected then we do not need padding before the next
object.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLUB currently assumes that the cacheline size is static. However, i386 f.e.
supports dynamic cache line size determination.
Use cache_line_size() instead of L1_CACHE_BYTES in the allocator.
That also explains the purpose of SLAB_HWCACHE_ALIGN. So we will need to keep
that one around to allow dynamic aligning of objects depending on boot
determination of the cache line size.
[akpm@linux-foundation.org: need to define it before we use it]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves SH over to the generic quicklists. As per x86_64,
we have special mappings for the PGDs, so these go on their
own list..
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
This implements deferred IO support in fbdev. Deferred IO is a way to delay
and repurpose IO. This implementation is done using mm's page_mkwrite and
page_mkclean hooks in order to detect, delay and then rewrite IO. This
functionality is used by hecubafb.
[adaplas]
This is useful for graphics hardware with no directly addressable/mappable
framebuffer. Implementing this will allow the "framebuffer" to be accesible
from user space via mmap().
Signed-off-by: Jaya Kumar <jayakumar.lkml@gmail.com>
Signed-off-by: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Same story as with cat /proc/*/wchan race vs rmmod race, only
/proc/slab_allocators want more info than just symbol name.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the die notifier handling to common code. Previous
various architectures had exactly the same code for it. Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)
arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at. avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup: setting an outstanding error on a mapping was open coded too many
times. Factor it out in mapping_set_error().
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.
sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
at boot time, and kmalloc/vmalloc at hotplug time. If config
memory hotplug is on, there are references of bootmem allocater(init text)
from them (normal text). This is cause of section mismatch.
Bootmem is called by many functions and it must be
used only at boot time. I think __init of them should keep for
section mismatch check. So, I would like to register sparse_index_alloc()
and zone_wait_table_init() into white list.
In addition, ia64's .machvec section is function table of some platform
dependent code. It is mixture of .init.text and normal text. These
reference of __init functions are valid too.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is to fix many section mismatches of code related to memory hotplug.
I checked compile with memory hotplug on/off on ia64 and x86-64 box.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two problems with the existing redzone implementation.
Firstly, it's causing misalignment of structures which contain a 64-bit
integer, such as netfilter's 'struct ipt_entry' -- causing netfilter
modules to fail to load because of the misalignment. (In particular, the
first check in
net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks())
On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8.
With slab debugging, we use 32-bit redzones. And allocated slab objects
aren't sufficiently aligned to hold a structure containing a uint64_t.
By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable
redzone checks on those architectures. By using 64-bit redzones we avoid that
loss of debugging, and also fix the other problem while we're at it.
When investigating this, I noticed that on 64-bit platforms we're using a
32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set
aside for the redzone. Which means that the four bytes immediately before
or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE
machines, respectively. Which is probably not the most useful choice of
poison value.
One way to fix both of those at once is just to switch to 64-bit
redzones in all cases.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The newly merged SLUB allocator patches had been generated before the
removal of "struct subsystem", and ended up applying fine, but wouldn't
build based on the current tree as a result.
Fix up that merge error - not that SLUB is likely really ready for
showtime yet, but at least I can fix the trivial stuff.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SERIAL] sunsu: Fix section mismatch warnings.
[SPARC64]: pgtable_cache_init() should be __init.
[SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/prom.c
[SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/pci.c
[SPARC64]: Fix section mismatch warnings in arch/sparc64/kernel/console.c
[MM]: sparse_init() should be __init.
[SPARC64]: Update defconfig.
[VIDEO]: Add Sun XVR-2500 framebuffer driver.
[VIDEO]: Add Sun XVR-500 framebuffer driver.
[SPARC64]: SUN4U PCI-E controller support.
[SPARC]: Fix comment typo in smp4m_blackbox_current().
[SCSI] SUNESP: sun_esp.c needs linux/delay.h
Fix up conflict in arch/sparc64/mm/init.c manually due to removal of
pgtable_cache_init() through the -mm patches (even though that patch was
also by David ;)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we can miss freeze_process()->signal_wake_up() in kswapd() if it
happens between try_to_freeze() and prepare_to_wait(). To prevent this
from happening we should check freezing(current) before calling schedule().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc. with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLOB doesn't calculate correct page order when page size is not 4KB. This
patch fixes it with using get_order() instead of find_order() which is SLOB
version of get_order().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no user remaining and I have never seen any use of that flag.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
that one would want to do something serious in a constructor or destructor.
In particular given that the slab allocators run with interrupts disabled.
Actions in constructors and destructors are by their nature very limited
and usually do not go beyond initializing variables and list operations.
(The i386 pgd ctor and dtors do take a spinlock in constructor and
destructor..... I think that is the furthest we go at this point.)
There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
establishes a certain symmetry.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
SLAB.
I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again? The callback is
performed before each freeing of an object.
I would think that it is much easier to check the object state manually
before the free. That also places the check near the code object
manipulation of the object.
Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on. If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code. But there is no such code
in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e. add debug code before kfree).
There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches. Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.
This is the last slab flag that SLUB did not support. Remove the check for
unimplemented flags from SLUB.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the hugetlbfs specific hacks in toplevel get_unmapped_area() now that
all archs and hugetlbfs itself do the right thing for both cases.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic arch_get_unmapped_area() now handles MAP_FIXED. Now that all
implementations have been fixed, change the toplevel get_unmapped_area() to
call into arch or drivers for the MAP_FIXED case.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: William Irwin <bill.irwin@oracle.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes a deadlock in the OOM killer for allocations that are not
__GFP_HARDWALL.
Before the OOM killer checks for the allocation constraint, it takes
callback_mutex.
constrained_alloc() iterates through each zone in the allocation zonelist
and calls cpuset_zone_allowed_softwall() to determine whether an allocation
for gfp_mask is possible. If a zone's node is not in the OOM-triggering
task's mems_allowed, it is not exiting, and we did not fail on a
__GFP_HARDWALL allocation, cpuset_zone_allowed_softwall() attempts to take
callback_mutex to check the nearest exclusive ancestor of current's cpuset.
This results in deadlock.
We now take callback_mutex after iterating through the zonelist since we
don't need it yet.
Cc: Andi Kleen <ak@suse.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin J. Bligh <mbligh@mbligh.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current panic_on_oom may not work if there is a process using
cpusets/mempolicy, because other nodes' memory may remain. But some people
want failover by panic ASAP even if they are used. This patch makes new
setting for its request.
This is tested on my ia64 box which has 3 nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ethan Solomita <solo@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently failslab injects failures into ____cache_alloc(). But with enabling
CONFIG_NUMA it's not enough to let actual slab allocator functions (kmalloc,
kmem_cache_alloc, ...) return NULL.
This patch moves fault injection hook inside of __cache_alloc() and
__cache_alloc_node(). These are lower call path than ____cache_alloc() and
enable to inject faulures to slab allocators with CONFIG_NUMA.
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch was recently posted to lkml and acked by Pekka.
The flag SLAB_MUST_HWCACHE_ALIGN is
1. Never checked by SLAB at all.
2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB
3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.
The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.
The flag is confusing, inconsistent and has no purpose.
Remove it.
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid down_write of the mmap_sem in madvise when we can help it.
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On x86_64 this cuts allocation overhead for page table pages down to a
fraction (kernel compile / editing load. TSC based measurement of times spend
in each function):
no quicklist
pte_alloc 1569048 4.3s(401ns/2.7us/179.7us)
pmd_alloc 780988 2.1s(337ns/2.7us/86.1us)
pud_alloc 780072 2.2s(424ns/2.8us/300.6us)
pgd_alloc 260022 1s(920ns/4us/263.1us)
quicklist:
pte_alloc 452436 573.4ms(8ns/1.3us/121.1us)
pmd_alloc 196204 174.5ms(7ns/889ns/46.1us)
pud_alloc 195688 172.4ms(7ns/881ns/151.3us)
pgd_alloc 65228 9.8ms(8ns/150ns/6.1us)
pgd allocations are the most complex and there we see the most dramatic
improvement (may be we can cut down the amount of pgds cached somewhat?). But
even the pte allocations still see a doubling of performance.
1. Proven code from the IA64 arch.
The method used here has been fine tuned for years and
is NUMA aware. It is based on the knowledge that accesses
to page table pages are sparse in nature. Taking a page
off the freelists instead of allocating a zeroed pages
allows a reduction of number of cachelines touched
in addition to getting rid of the slab overhead. So
performance improves. This is particularly useful if pgds
contain standard mappings. We can save on the teardown
and setup of such a page if we have some on the quicklists.
This includes avoiding lists operations that are otherwise
necessary on alloc and free to track pgds.
2. Light weight alternative to use slab to manage page size pages
Slab overhead is significant and even page allocator use
is pretty heavy weight. The use of a per cpu quicklist
means that we touch only two cachelines for an allocation.
There is no need to access the page_struct (unless arch code
needs to fiddle around with it). So the fast past just
means bringing in one cacheline at the beginning of the
page. That same cacheline may then be used to store the
page table entry. Or a second cacheline may be used
if the page table entry is not in the first cacheline of
the page. The current code will zero the page which means
touching 32 cachelines (assuming 128 byte). We get down
from 32 to 2 cachelines in the fast path.
3. x86_64 gets lightweight page table page management.
This will allow x86_64 arch code to faster repopulate pgds
and other page table entries. The list operations for pgds
are reduced in the same way as for i386 to the point where
a pgd is allocated from the page allocator and when it is
freed back to the page allocator. A pgd can pass through
the quicklists without having to be reinitialized.
64 Consolidation of code from multiple arches
So far arches have their own implementation of quicklist
management. This patch moves that feature into the core allowing
an easier maintenance and consistent management of quicklists.
Page table pages have the characteristics that they are typically zero or in a
known state when they are freed. This is usually the exactly same state as
needed after allocation. So it makes sense to build a list of freed page
table pages and then consume the pages already in use first. Those pages have
already been initialized correctly (thus no need to zero them) and are likely
already cached in such a way that the MMU can use them most effectively. Page
table pages are used in a sparse way so zeroing them on allocation is not too
useful.
Such an implementation already exits for ia64. Howver, that implementation
did not support constructors and destructors as needed by i386 / x86_64. It
also only supported a single quicklist. The implementation here has
constructor and destructor support as well as the ability for an arch to
specify how many quicklists are needed.
Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one
quicklist is necessary then we can define NR_QUICK for additional lists. F.e.
i386 needs two and thus has
config NR_QUICK
int
default 2
If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:
quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
Page table pages can be freed using:
quicklist_free(<quicklist-nr>, <destructor>, <page>)
Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.
If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.
Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure that the check function really only check things and do not perform
activities. Extract the tracing and object seeding out of the two check
functions and place them into slab_alloc and slab_free
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At kmem_cache_shrink check if we have any empty slabs on the partial
if so then remove them.
Also--as an anti-fragmentation measure--sort the partial slabs so that
the most fully allocated ones come first and the least allocated last.
The next allocations may fill up the nearly full slabs. Having the
least allocated slabs last gives them the maximum chance that their
remaining objects may be freed. Thus we can hopefully minimize the
partial slabs.
I think this is the best one can do in terms antifragmentation
measures. Real defragmentation (meaning moving objects out of slabs with
the least free objects to those that are almost full) can be implemted
by reverse scanning through the list produced here but that would mean
that we need to provide a callback at slab cache creation that allows
the deletion or moving of an object. This will involve slab API
changes, so defer for now.
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables listing the callers who allocated or freed objects in a
cache.
For example to list the allocators for kmalloc-128 do
cat /sys/slab/kmalloc-128/alloc_calls
7 sn_io_slot_fixup+0x40/0x700
7 sn_io_slot_fixup+0x80/0x700
9 sn_bus_fixup+0xe0/0x380
6 param_sysfs_setup+0xf0/0x280
276 percpu_populate+0xf0/0x1a0
19 __register_chrdev_region+0x30/0x360
8 expand_files+0x2e0/0x6e0
1 sys_epoll_create+0x60/0x200
1 __mounts_open+0x140/0x2c0
65 kmem_alloc+0x110/0x280
3 alloc_disk_node+0xe0/0x200
33 as_get_io_context+0x90/0x280
74 kobject_kset_add_dir+0x40/0x140
12 pci_create_bus+0x2a0/0x5c0
1 acpi_ev_create_gpe_block+0x120/0x9e0
41 con_insert_unipair+0x100/0x1c0
1 uart_open+0x1c0/0xba0
1 dma_pool_create+0xe0/0x340
2 neigh_table_init_no_netlink+0x260/0x4c0
6 neigh_parms_alloc+0x30/0x200
1 netlink_kernel_create+0x130/0x320
5 fz_hash_alloc+0x50/0xe0
2 sn_common_hubdev_init+0xd0/0x6e0
28 kernel_param_sysfs_setup+0x30/0x180
72 process_zones+0x70/0x2e0
cat /sys/slab/kmalloc-128/free_calls
558 <not-available>
3 sn_io_slot_fixup+0x600/0x700
84 free_fdtable_rcu+0x120/0x260
2 seq_release+0x40/0x60
6 kmem_free+0x70/0xc0
24 free_as_io_context+0x20/0x200
1 acpi_get_object_info+0x3a0/0x3e0
1 acpi_add_single_object+0xcf0/0x1e40
2 con_release_unimap+0x80/0x140
1 free+0x20/0x40
SLAB_STORE_USER must be enabled for a slab cache by either booting with
"slab_debug" or enabling user tracking specifically for the slab of interest.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We leave a mininum of partial slabs on nodes when we search for
partial slabs on other node. Define a constant for that value.
Then modify slub to keep MIN_PARTIAL slabs around.
This avoids bad situations where a function frees the last object
in a slab (which results in the page being returned to the page
allocator) only to then allocate one again (which requires getting
a page back from the page allocator if the partial list was empty).
Keeping a couple of slabs on the partial list reduces overhead.
Empty slabs are added to the end of the partial list to insure that
partially allocated slabs are consumed first (defragmentation).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This enables validation of slab. Validation means that all objects are
checked to see if there are redzone violations, if padding has been
overwritten or any pointers have been corrupted. Also checks the consistency
of slab counters.
Validation enables the detection of metadata corruption without the kernel
having to execute code that actually uses (allocs/frees) and object. It
allows one to make sure that the slab metainformation and the guard values
around an object have not been compromised.
A single slabcache can be checked by writing a 1 to the "validate" file.
i.e.
echo 1 >/sys/slab/kmalloc-128/validate
or use the slabinfo tool to check all slabs
slabinfo -v
Error messages will show up in the syslog.
Note that validation can only reach slabs that are on a list. This means that
we are usually restricted to partial slabs and active slabs unless
SLAB_STORE_USER is active which will build a full slab list and allows
validation of slabs that are fully in use. Booting with "slub_debug" set will
enable SLAB_STORE_USER and then full diagnostic are available.
Note that we attempt to push cpu slabs back to the lists when we start the
check. If the cpu slab is reactivated before we get to it (another processor
grabs it before we get to it) then it cannot be checked.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If slab tracking is on then build a list of full slabs so that we can verify
the integrity of all slabs and are also able to built list of alloc/free
callers.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Object tracking did not work the right way for several call chains. Fix this up
by adding a new parameter to slub_alloc and slub_free that specifies the
caller address explicitly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch adds PageTail(page) and PageHead(page) to check if a page is the
head or the tail of a compound page. This is done by masking the two bits
describing the state of a compound page and then comparing them. So one
comparision and a branch instead of two bit checks and two branches.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.
Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.
Having page->private available for SLUB would allow more meta information in
the page struct. I can probably avoid the 16 bit ints that I have in there
right now.
Also if page->private is available then a compound page may be equipped with
buffer heads. This may free up the way for filesystems to support larger
blocks than page size.
We add PageTail as an alias of PageReclaim. Compound pages cannot currently
be reclaimed. Because of the alias one needs to check PageCompound first.
The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2
[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Makes SLUB behave like SLAB in this area to avoid issues....
Throw a stack dump to alert people.
At some point the behavior should be switched back. NULL is no memory as
far as I can tell and if the use asked for 0 bytes then he need to get no
memory.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Structures may contain u64 items on 32 bit platforms that are only able to
address 64 bit items on 64 bit boundaries. Change the mininum alignment of
slabs to conform to those expectations.
ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
are mixed in the general slabs.
ARCH_SLAB_MINALIGN is changed because currently there is no consistent
specification of object alignment. We may have that in the future when the
KMEM_CACHE and related macros are used to generate slabs. These pass the
alignment of the structure generated by the compiler to the slab.
With KMEM_CACHE etc we could align structures that do not contain 64
bit values to 32 bit boundaries potentially saving some memory.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.
A. Management of object queues
A particular concern was the complex management of the numerous object
queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
each allocating CPU and use objects from a slab directly instead of
queueing them up.
B. Storage overhead of object queues
SLAB Object queues exist per node, per CPU. The alien cache queue even
has a queue array that contain a queue for each processor on each
node. For very large systems the number of queues and the number of
objects that may be caught in those queues grows exponentially. On our
systems with 1k nodes / processors we have several gigabytes just tied up
for storing references to objects for those queues This does not include
the objects that could be on those queues. One fears that the whole
memory of the machine could one day be consumed by those queues.
C. SLAB meta data overhead
SLAB has overhead at the beginning of each slab. This means that data
cannot be naturally aligned at the beginning of a slab block. SLUB keeps
all meta data in the corresponding page_struct. Objects can be naturally
aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
boundaries and can fit tightly into a 4k page with no bytes left over.
SLAB cannot do this.
D. SLAB has a complex cache reaper
SLUB does not need a cache reaper for UP systems. On SMP systems
the per CPU slab may be pushed back into partial list but that
operation is simple and does not require an iteration over a list
of objects. SLAB expires per CPU, shared and alien object queues
during cache reaping which may cause strange hold offs.
E. SLAB has complex NUMA policy layer support
SLUB pushes NUMA policy handling into the page allocator. This means that
allocation is coarser (SLUB does interleave on a page level) but that
situation was also present before 2.6.13. SLABs application of
policies to individual slab objects allocated in SLAB is
certainly a performance concern due to the frequent references to
memory policies which may lead a sequence of objects to come from
one node after another. SLUB will get a slab full of objects
from one node and then will switch to the next.
F. Reduction of the size of partial slab lists
SLAB has per node partial lists. This means that over time a large
number of partial slabs may accumulate on those lists. These can
only be reused if allocator occur on specific nodes. SLUB has a global
pool of partial slabs and will consume slabs from that pool to
decrease fragmentation.
G. Tunables
SLAB has sophisticated tuning abilities for each slab cache. One can
manipulate the queue sizes in detail. However, filling the queues still
requires the uses of the spin lock to check out slabs. SLUB has a global
parameter (min_slab_order) for tuning. Increasing the minimum slab
order can decrease the locking overhead. The bigger the slab order the
less motions of pages between per CPU and partial lists occur and the
better SLUB will be scaling.
G. Slab merging
We often have slab caches with similar parameters. SLUB detects those
on boot up and merges them into the corresponding general caches. This
leads to more effective memory use. About 50% of all caches can
be eliminated through slab merging. This will also decrease
slab fragmentation because partial allocated slabs can be filled
up again. Slab merging can be switched off by specifying
slub_nomerge on boot up.
Note that merging can expose heretofore unknown bugs in the kernel
because corrupted objects may now be placed differently and corrupt
differing neighboring objects. Enable sanity checks to find those.
H. Diagnostics
The current slab diagnostics are difficult to use and require a
recompilation of the kernel. SLUB contains debugging code that
is always available (but is kept out of the hot code paths).
SLUB diagnostics can be enabled via the "slab_debug" option.
Parameters can be specified to select a single or a group of
slab caches for diagnostics. This means that the system is running
with the usual performance and it is much more likely that
race conditions can be reproduced.
I. Resiliency
If basic sanity checks are on then SLUB is capable of detecting
common error conditions and recover as best as possible to allow the
system to continue.
J. Tracing
Tracing can be enabled via the slab_debug=T,<slabcache> option
during boot. SLUB will then protocol all actions on that slabcache
and dump the object contents on free.
K. On demand DMA cache creation.
Generally DMA caches are not needed. If a kmalloc is used with
__GFP_DMA then just create this single slabcache that is needed.
For systems that have no ZONE_DMA requirement the support is
completely eliminated.
L. Performance increase
Some benchmarks have shown speed improvements on kernbench in the
range of 5-10%. The locking overhead of slub is based on the
underlying base allocation size. If we can reliably allocate
larger order pages then it is possible to increase slub
performance much further. The anti-fragmentation patches may
enable further performance increases.
Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
SLUB Boot options
slub_nomerge Disable merging of slabs
slub_min_order=x Require a minimum order for slab caches. This
increases the managed chunk size and therefore
reduces meta data and locking overhead.
slub_min_objects=x Mininum objects per slab. Default is 8.
slub_max_order=x Avoid generating slabs larger than order specified.
slub_debug Enable all diagnostics for all caches
slub_debug=<options> Enable selective options for all caches
slub_debug=<o>,<cache> Enable selective options for a certain set of
caches
Available Debug options
F Double Free checking, sanity and resiliency
R Red zoning
P Object / padding poisoning
U Track last free / alloc
T Trace all allocs / frees (only use for individual slabs).
To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.
[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is only ever used prior to free_initmem().
(It will cause a warning when we run the section checking, but that's a
false-positive and it simply changes the source of an existing warning, which
is also a false-positive)
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
read or write. This function iterates through every zone and calls
spin_lock_irqsave() on the zone LRU lock. When reading min_free_kbytes,
this is a total waste of time that disables interrupts on the local
processor. It might even be noticable machines with large numbers of zones
if a process started constantly reading min_free_kbytes.
This patch only calls setup_per_zone_pages_min() only on write. Tested on
an x86 laptop and it did the right thing.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
possible nodes. This patch dynamically sizes the 'struct kmem_cache' to
allocate only needed space.
I moved nodelists[] field at the end of struct kmem_cache, and use the
following computation in kmem_cache_init()
cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
nr_node_ids * sizeof(struct kmem_list3 *);
On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
(This is because on x86_64, MAX_NUMNODES is 64)
On bigger NUMA setups, this might reduce the gfporder of "cache_cache"
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can avoid allocating empty shared caches and avoid unecessary check of
cache->limit. We save some memory. We avoid bringing into CPU cache
unecessary cache lines.
All accesses to l3->shared are already checking NULL pointers so this patch is
safe.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing comment in mm/slab.c is *perfect*, so I reproduce it :
/*
* CPU bound tasks (e.g. network routing) can exhibit cpu bound
* allocation behaviour: Most allocs on one cpu, most free operations
* on another cpu. For these cases, an efficient object passing between
* cpus is necessary. This is provided by a shared array. The array
* replaces Bonwick's magazine layer.
* On uniprocessor, it's functionally equivalent (but less efficient)
* to a larger limit. Thus disabled by default.
*/
As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
a preprocessor #if can detect if the machine is UP or SMP. Better to use
num_possible_cpus().
This means on UP we allocate a 'size=0 shared array', to be more efficient.
Another patch can later avoid the allocations of 'empty shared arrays', to
save some memory.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
prev_offset. Also update of prev_index in do_generic_mapping_read() is now
moved close to the update of prev_offset.
[wfg@mail.ustc.edu.cn: fix it]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce ra.offset and store in it an offset where the previous read
ended. This way we can detect whether reads are really sequential (and
thus we should not mark the page as accessed repeatedly) or whether they
are random and just happen to be in the same page (and the page should
really be marked accessed again).
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a macro for suppressing gcc from generating a warning about a
probable uninitialized state of a variable.
Example:
- spinlock_t *ptl;
+ spinlock_t *uninitialized_var(ptl);
Not a happy solution, but those warnings are obnoxious.
- Using the usual pointlessly-set-it-to-zero approach wastes several
bytes of text.
- Using a macro means we can (hopefully) do something else if gcc changes
cause the `x = x' hack to stop working
- Using a macro means that people who are worried about hiding true bugs
can easily turn it off.
Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.
If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once. Linus agrees.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generally we work under the assumption that memory the mem_map array is
contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we
have validated any page within this MAX_ORDER_NR_PAGES block we need not check
any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must
check each and every reference we make from a pfn.
Add a pfn_valid_within() helper which should be used when scanning pages
within a MAX_ORDER_NR_PAGES block when we have already checked the validility
of the block normally with pfn_valid(). This can then be optimised away when
we do not have holes within a MAX_ORDER_NR_PAGES block of pages.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the badness of a process is zero then oom_adj>0 has no effect. This
patch makes sure that the oom_adj shift actually increases badness points
appropriately.
Signed-off-by: Joshua N. Pritikin <jpritikin@pobox.com>
Cc: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.
I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd. All depending on whether the filler is async and/or can return
with a !uptodate page.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
loop as detailed by Michael Richardson in the following post:
<http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
those cases.
Cc: Michael Richardson <mcr@sandelman.ca>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minimum gcc version is 3.2 now. However, with likely profiling, even
modern gcc versions cannot always eliminate the call.
Replace the placeholder functions with the more conventional empty static
inlines, which should be optimal for everyone.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can use the global ZVC counters to establish the exact size of the LRU
and the free pages. This allows a more accurate determination of the dirty
ratio.
This patch will fix the broken ratio calculations if large amounts of
memory are allocated to huge pags or other consumers that do not put the
pages on to the LRU.
Notes:
- I did not add NR_SLAB_RECLAIMABLE to the calculation of the
dirtyable pages. Those may be reclaimable but they are at this
point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
then a huge number of reclaimable pages would stop writeback
from occurring.
- This patch used to be in mm as the last one in a series of patches.
It was removed when Linus updated the treatment of highmem because
there was a conflict. I updated the patch to follow Linus' approach.
This patch is neede to fulfill the claims made in the beginning of the
patchset that is now in Linus' tree.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nr_cpu_ids value is currently only calculated in smp_init. However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init. So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.
Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take. If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated. If it is set to the maximum possible then we only waste
some memory for early boot users.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new mm function apply_to_page_range() which applies a given function to
every pte in a given virtual address range in a given mm structure. This is a
generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
code in every place that a sequence of PTEs must be accessed.
Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen subsystems, for
example: to ensure that pagetables have been allocated for a virtual address
range, and to construct batched special pagetable update requests to map I/O
memory (in ioremap()).
[akpm@linux-foundation.org: fix warning, unpleasantly]
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@waste.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This introduce krealloc() that reallocates memory while keeping the contents
unchanged. The allocator avoids reallocation if the new size fits the
currently used cache. I also added a simple non-optimized version for
mm/slob.c for compatibility.
[akpm@linux-foundation.org: fix warnings]
Acked-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Set use_alien_caches to 0 on non NUMA platforms. And avoid calling the
cache_free_alien() when use_alien_caches is not set. This will avoid the
cache miss that happens while dereferencing slabp to get nodeid.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space. These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.
Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.
This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings. Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.
[ Zach - vmi.c will need some attention after this patch. It wasn't
immediately obvious to me what needs to be done. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Add hooks to allow a paravirt implementation to track the lifetime of
an mm. Paravirtualization requires three hooks, but only two are
needed in common code. They are:
arch_dup_mmap, which is called when a new mmap is created at fork
arch_exit_mmap, which is called when the last process reference to an
mm is dropped, which typically happens on exit and exec.
The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures. It's called when an mm is first used.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Ugly ifdef, but should handle all 64bit platforms that have suitable
zones. On some like Altix it's probably impossible without IOMMU
use to get memory <4GB this way, but they have to live with that.
Signed-off-by: Andi Kleen <ak@suse.de>
Do this really early in the 2.6.22-rc series, so that we'll get
feedback. And don't change by half measures. Just cut the default
dirty limit to a quarter of what it was, and see if anybody even
notices.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page_test_and_clear_dirty primitive really consists of two
operations, page_test_dirty and the page_clear_dirty. The combination
of the two is not an atomic operation, so it makes more sense to have
two separate operations instead of one.
In addition to the improved readability of the s390 version of
SetPageUptodate, it now avoids the page_test_dirty operation which is
an insert-storage-key-extended (iske) instruction which is an expensive
operation.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
NR_FILE_PAGES must be accounted for depending on the zone that the page
belongs to. If we replace the page in the radix tree then we may have to
shift the count to another zone.
Suggested-by: Ethan Solomita <solo@google.com>
Eventually-typed-in-by: Christoph Lameter <clameter@sgi.com>
Cc: Martin Bligh <mbligh@mbligh.org>
Cc: <stable@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog
to see lots of other processes killed with "No available memory
(MPOL_BIND)". memhog is killed correctly once we initialize nodemask in
constrained_alloc().
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_kill_task() calls __oom_kill_task() to OOM kill a selected task.
When finding other threads that share an mm with that task, we need to
kill those individual threads and not the same one.
(Bug introduced by f2a2a7108a)
Acked-by: William Irwin <bill.irwin@oracle.com>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
num_physpages is not exported out in mm/nommu.c, so the ip_conntrack module
link will fail.
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6:
[S390] cio: Fix handling of interrupt for csch().
[S390] page_mkclean data corruption.
Mention the slab name when listing corrupt objects. Although the function
that released the memory is mentioned, that is frequently ambiguous as such
functions often release several pieces of memory.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The git commit c2fda5fed8 which
added the page_test_and_clear_dirty call to page_mkclean and the
git commit 7658cc2892 which fixes
the "nasty and subtle race in shared mmap'ed page writeback"
problem in clear_page_dirty_for_io cause data corruption on s390.
The effect of the two changes is that for every call to
clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
the per page dirty bit is set set_page_dirty is called. Strangly
clear_page_dirty_for_io is called for not-uptodate pages, e.g.
over this call-chain:
[<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
[<000000000007c494>] generic_writepages+0x258/0x3e0
[<000000000007c692>] do_writepages+0x76/0x7c
[<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
[<00000000000c831a>] sync_sb_inodes+0x23e/0x398
[<00000000000c8802>] writeback_inodes+0x12e/0x140
[<000000000007b9ee>] wb_kupdate+0xd2/0x178
[<000000000007cca2>] pdflush+0x162/0x23c
The bad news now is that page_test_and_clear_dirty might claim
that a not-uptodate page is dirty since SetPageUptodate which
resets the per page dirty bit has not yet been called. The page
writeback that follows clobbers the data on disk.
The simplest solution to this problem is to move the call to
page_test_and_clear_dirty under the "if (page_mapped(page))".
If a file backed page is mapped it is uptodate.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix the bug, that reading into xip mapping from /dev/zero fills the user
page table with ZERO_PAGE() entries. Later on, xip cannot tell which pages
have been ZERO_PAGE() filled by access to a sparse mapping, and which ones
origin from /dev/zero. It will unmap ZERO_PAGE from all mappings when
filling the sparse hole with data. xip does now use its own zeroed page
for its sparse mappings. Please apply.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_madvise has down_write of mmap_sem, then madvise_remove calls
vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise
deadlocks from that ordering.
madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since
madvise_remove doesn't split or merge vmas, it's easy to handle this case with
a NULL prev, without restructuring sys_madvise. (Though sad to retake
mmap_sem when it's unlikely to be needed, and certainly down_read is
sufficient for MADV_REMOVE, unlike the other madvices.)
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem_truncate_range has its own truncate_inode_pages_range, to free any pages
racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when
this might have happened. But holepunching gets no chance to clear that flag
at the start of vmtruncate_range, so it's always set (unless a truncate came
just before), so holepunch almost always does this second
truncate_inode_pages_range.
shmem holepunch has unlikely swap<->file races hereabouts whatever we do
(without a fuller rework than is fit for this release): I was going to skip
the second truncate in the punch_hole case, but Miklos points out that would
make holepunch correctness more vulnerable to swapoff. So keep the second
truncate, but follow it by an unmap_mapping_range to eliminate the
disconnected pages (freed from pagecache while still mapped in userspace) that
it might have left behind.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi observes that during truncation of shmem page directories,
info->lock is released to improve latency (after lowering i_size and
next_index to exclude races); but this is quite wrong for holepunching, which
receives no such protection from i_size or next_index, and is left vulnerable
to races with shmem_unuse, shmem_getpage and shmem_writepage.
Hold info->lock throughout when holepunching? No, any user could prevent
rescheduling for far too long. Instead take info->lock just when needed: in
shmem_free_swp when removing the swap entries, and whenever removing a
directory page from the level above. But so long as we remove before
scanning, we can safely skip taking the lock at the lower levels, except at
misaligned start and end of the hole.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare
circumstances, because shmem_truncate_range() erroneously removes partially
truncated directory pages at the end of the range: later reclaim on pages
pointing to these removed directories triggers the BUG. Indeed, and it can
also cause data loss beyond the hole.
Fix this as in the patch proposed by Miklos, but distinguish between "limit"
(how far we need to search: ignore truncation's next_index optimization in the
holepunch case - if there are races it's more consistent to act on the whole
range specified) and "upper_limit" (how far we can free directory pages:
generally we must be careful to keep partially punched pages, but can relax at
end of file - i_size being held stable by i_mutex).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cs>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a small problem in handling page bounce.
At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
possible _number_ of a page frame, but the _amount_ of page frames. For
example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
0xFFFF.
request_queue structure has a member q->bounce_pfn and queue needs bounce
pages for the pages _above_ this limit. This routine is handled by
blk_queue_bounce(), where the following check is produced:
if (q->bounce_pfn >= blk_max_pfn)
return;
Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
equals 0x10000. In such situation the check above fails and for each bio
we always fall down for iterating over pages tied to the bio.
I want to notice, that for quite a big range of device drivers (ide, md,
...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
bounce_pfn. BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
then the check above doesn't fail. But for other drivers, which obtain
reuired value from drivers, it fails. For example sata_nv uses
ATA_DMA_MASK or dev->dma_mask.
I propose to use (max_pfn - 1) for blk_max_pfn. And the same for
blk_max_low_pfn. The patch also cleanses some checks related with
bounce_pfn.
Signed-off-by: Vasily Tarasov <vtaras@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Supply a get_unmapped_area() to fix NOMMU SYSV SHM support.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looking at oom_kill.c, found that the intention to not kill the selected
process if any of its children/siblings has OOM_DISABLE set, is not being
met.
Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: William Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current NFS client congestion logic is severly broken, it marks the
backing device congested during each nfs_writepages() call but doesn't
mirror this in nfs_writepage() which makes for deadlocks. Also it
implements its own waitqueue.
Replace this by a more regular congestion implementation that puts a cap on
the number of active writeback pages and uses the bdi congestion waitqueue.
Also always use an interruptible wait since it makes sense to be able to
SIGKILL the process even for mounts without 'intr'.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes a user-triggerable oops that was reported by Leonid
Ananiev as archived at http://lkml.org/lkml/2007/2/8/337.
dio writes invalidate clean pages that intersect the written region so that
subsequent buffered reads go to disk to read the new data. If this fails
the interface tries to tell the caller that the cache is inconsistent by
returning EIO.
Before this patch we had the problem where this invalidation failure would
clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.
Both fs/aio.c and bio completion call aio_complete() and we reference freed
memory, usually oopsing.
This patch addresses this problem by invalidating before the write so that
we can cleanly return -EIO before ->direct_IO() has had a chance to return
-EIOCBQUEUED.
There is a compromise here. During the dio write we can fault in mmap()ed
pages which intersect the written range with get_user_pages() if the user
provided them for the source buffer. This is a crazy thing to do, but we
can make it mostly work in most cases by trying the invalidation again.
The compromise is that we won't return an error if this second invalidation
fails if it's an AIO write and we have -EIOCBQUEUED.
This was tested by having two processes race performing large O_DIRECT and
buffered ordered writes. Within minutes ext3 would see a race between
ext3_releasepage() and jbd holding a reference on ordered data buffers and
would cause invalidation to fail, panicing the box. The test can be found
in the 'aio_dio_bugs' test group in test.kernel.org/autotest. After this
patch the test passes.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the
call covers a region from the start of a vma, and extending past that vma.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we do not check for vma flags if sys_move_pages is called to move
individual pages. If sys_migrate_pages is called to move pages then we
check for vm_flags that indicate a non migratable vma but that still
includes VM_LOCKED and we can migrate mlocked pages.
Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
into migrate.h so that is can be used from both locations.
Problem was spotted by Lee Schermerhorn
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem's super_operations were missed from the recent const-ification;
and simple_fill_super()'s, which can share with get_sb_pseudo()'s.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix invalidate_inode_pages2_range() so that it does not immediately exit
just because a single page in the specified range could not be removed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_lock_anon_vma() uses spin_lock() to block RCU. This doesn't work with
PREEMPT_RCU, we have to do rcu_read_lock() explicitely. Otherwise, it is
theoretically possible that slab returns anon_vma's memory to the system
before we do spin_unlock(&anon_vma->lock).
[ Hugh points out that this only matters for PREEMPT_RCU, which isn't merged
yet, and may never be. Regardless, this patch is conceptually the
right thing to do, even if it doesn't matter at this point. - Linus ]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
throttle_vm_writeout() is designed to wait for the dirty levels to subside.
But if the caller holds IO or FS locks, we might be holding up that writeout.
So change it to take a single nap to give other devices a chance to clean some
memory, then return.
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code is seemingly trying to make sure that rb_next() brings us to
successive increasing vma entries.
But the two variables, prev and pend, used to perform these checks, are
never advanced.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename PG_checked to PG_owner_priv_1 to reflect its availablilty as a
private flag for use by the owner/allocator of the page. In the case of
pagecache pages (which might be considered to be owned by the mm),
filesystems may use the flag.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem_{nopage,mmap} are no longer used in ipc/shm.c
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The alien cache is a per cpu per node array allocated for every slab on the
system. Currently we size this array for all nodes that the kernel does
support. For IA64 this is 1024 nodes. So we allocate an array with 1024
objects even if we only boot a system with 4 nodes.
This patch uses "nr_node_ids" to determine the number of possible nodes
supported by a hardware configuration and only allocates an alien cache
sized for possible nodes.
The initialization of nr_node_ids occurred too late relative to the bootstrap
of the slab allocator and so I moved the setup_nr_node_ids() into
free_area_init_nodes().
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
highest_possible_node_id() is currently used to calculate the last possible
node idso that the network subsystem can figure out how to size per node
arrays.
I think having the ability to determine the maximum amount of nodes in a
system at runtime is useful but then we should name this entry
correspondingly, it should return the number of node_ids, and the the value
needs to be setup only once on bootup. The node_possible_map does not
change after bootup.
This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id(). nr_node_ids is calculated on bootup when the
page allocators pagesets are initialized.
[deweerdt@free.fr: fix oops]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bind_zonelist() can create zero-length zonelist if there is a
memory-less-node. This patch checks the length of zonelist. If length is
0, returns -EINVAL.
tested on ia64/NUMA with memory-less-node.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When NFSD receives a write request, the data is typically in a number of
1448 byte segments and writev is used to collect them together.
Unfortunately, generic_file_buffered_write passes these to the filesystem
one at a time, so an e.g. 32K over-write becomes a series of partial-page
writes to each page, causing the filesystem to have to pre-read those pages
- wasted effort.
generic_file_buffered_write handles one segment of the vector at a time as
it has to pre-fault in each segment to avoid deadlocks. When writing from
kernel-space (and nfsd does) this is not an issue, so
generic_file_buffered_write does not need to break and iovec from nfsd into
little pieces.
This patch avoids the splitting when get_fs is KERNEL_DS as it is
from NFSd.
This issue was introduced by commit 6527c2bdf1
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Norman Weathers <norman.r.weathers@conocophillips.com>
Cc: Vladimir V. Saveliev <vs@namesys.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paper bag time. Thanks to Randy for noticing that I didn't actually assign
'present' to anything.
Unfortunately my original patch passed the few simple test cases I gave it,
purely by coincidence.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix mincore-anon patch to compile with CONFIG_SWAP=n
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many struct inode_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make mincore work for anon mappings, nonlinear, and migration entries.
Based on patch from Linus Torvalds <torvalds@linux-foundation.org>.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to
NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a
re-execution of the faulting instruction
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn
functionality by installing their own pte and returning NULL.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change a hard-coded constant 0 to the symbolic equivalent NOTIFY_DONE in
the ratelimit_handler() CPU notifier handler function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:
* make multi-line initial descriptions single line
* denote some function names, constants and structs as such
* change erroneous opening '/*' to '/**' in a few places
* reword some text for clarity
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().
Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It makes no sense to me to export invalidate_inode_pages() and not
invalidate_mapping_pages() and I actually need invalidate_mapping_pages()
because of its range specification ability...
akpm: also remove the export of invalidate_inode_pages() by making it an
inlined wrapper.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmem_cache_free() was missing the check for freeing held locks.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kernel unmaps an address range, it needs to transfer PTE state into
page struct. Currently, kernel transfer access bit via
mark_page_accessed(). The call to mark_page_accessed in the unmap path
doesn't look logically correct.
At unmap time, calling mark_page_accessed will causes page LRU state to be
bumped up one step closer to more recently used state. It is causing quite
a bit headache in a scenario when a process creates a shmem segment, touch
a whole bunch of pages, then unmaps it. The unmapping takes a long time
because mark_page_accessed() will start moving pages from inactive to
active list.
I'm not too much concerned with moving the page from one list to another in
LRU. Sooner or later it might be moved because of multiple mappings from
various processes. But it just doesn't look logical that when user asks a
range to be unmapped, it's his intention that the process is no longer
interested in these pages. Moving those pages to active list (or bumping
up a state towards more active) seems to be an over reaction. It also
prolongs unmapping latency which is the core issue I'm trying to solve.
As suggested by Peter, we should still preserve the info on pte young
pages, but not more.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ken Chen <kenchen@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shmem backed file does not have page writeback, nor it participates in
backing device's dirty or writeback accounting. So using generic
__set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit
overkill. It unnecessarily prolongs shm unmap latency.
For example, on a densely populated large shm segment (sevearl GBs), the
unmapping operation becomes painfully long. Because at unmap, kernel
transfers dirty bit in PTE into page struct and to the radix tree tag. The
operation of tagging the radix tree is particularly expensive because it
has to traverse the tree from the root to the leaf node on every dirty
page. What's bothering is that radix tree tag is used for page write back.
However, shmem is memory backed and there is no page write back for such
file system. And in the end, we spend all that time tagging radix tree and
none of that fancy tagging will be used. So let's simplify it by introduce
a new aops __set_page_dirty_no_writeback and this will speed up shm unmap.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
channel management. Other functionality may still expect GFP_DMA to
provide memory below 16M. So we need to make sure that CONFIG_ZONE_DMA is
set independent of CONFIG_GENERIC_ISA_DMA. Undo the modifications to
mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
theses explicitly in each arches Kconfig.
Reviews must occur for each arch in order to determine if ZONE_DMA can be
switched off. It can only be switched off if we know that all devices
supported by a platform are capable of performing DMA transfers to all of
memory (Some arches already support this: uml, avr32, sh sh64, parisc and
IA64/Altix).
In order to switch ZONE_DMA off conditionally, one would have to establish
a scheme by which one can assure that no drivers are enabled that are only
capable of doing I/O to a part of memory, or one needs to provide an
alternate means of performing an allocation from a specific range of memory
(like provided by alloc_pages_range()) and insure that all drivers use that
call. In that case the arches alloc_dma_coherent() may need to be modified
to call alloc_pages_range() instead of relying on GFP_DMA.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make ZONE_DMA optional in core code.
- ifdef all code for ZONE_DMA and related definitions following the example
for ZONE_DMA32 and ZONE_HIGHMEM.
- Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
0.
- Modify the VM statistics to work correctly without a DMA zone.
- Modify slab to not create DMA slabs if there is no ZONE_DMA.
[akpm@osdl.org: cleanup]
[jdike@addtoit.com: build fix]
[apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch simply defines CONFIG_ZONE_DMA for all arches. We later do special
things with CONFIG_ZONE_DMA after the VM and an arch are prepared to work
without ZONE_DMA.
CONFIG_ZONE_DMA can be defined in two ways depending on how an architecture
handles ISA DMA.
First if CONFIG_GENERIC_ISA_DMA is set by the arch then we know that the arch
needs ZONE_DMA because ISA DMA devices are supported. We can catch this in
mm/Kconfig and do not need to modify arch code.
Second, arches may use ZONE_DMA in an unknown way. We set CONFIG_ZONE_DMA for
all arches that do not set CONFIG_GENERIC_ISA_DMA in order to insure backwards
compatibility. The arches may later undefine ZONE_DMA if their arch code has
been verified to not depend on ZONE_DMA.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset follows up on the earlier work in Andrew's tree to reduce the
number of zones. The patches allow to go to a minimum of 2 zones. This one
allows also to make ZONE_DMA optional and therefore the number of zones can be
reduced to one.
ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
why we would not want to have ZONE_DMA
1. Some arches do not need ZONE_DMA at all.
2. With the advent of IOMMUs DMA zones are no longer needed.
The necessity of DMA zones may drastically be reduced
in the future. This patchset allows a compilation of
a kernel without that overhead.
3. Devices that require ISA DMA get rare these days. All
my systems do not have any need for ISA DMA.
4. The presence of an additional zone unecessarily complicates
VM operations because it must be scanned and balancing
logic must operate on its.
5. With only ZONE_NORMAL one can reach the situation where
we have only one zone. This will allow the unrolling of many
loops in the VM and allows the optimization of varous
code paths in the VM.
6. Having only a single zone in a NUMA system results in a
1-1 correspondence between nodes and zones. Various additional
optimizations to critical VM paths become possible.
Many systems today can operate just fine with a single zone. If you look at
what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
will be empty instead).
On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
constantly look at an empty zone in /proc/zoneinfo and empty slab in
/proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
stay empty.
The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).
The RFC posted earlier (see
http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
of #ifdefs in them. An effort has been made to minize the number of #ifdefs
and make this as compact as possible. The job was made much easier by the
ongoing efforts of others to extract common arch specific functionality.
I have been running this for awhile now on my desktop and finally Linux is
using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:
christoph@pentium940:~$ cat /proc/zoneinfo
Node 0, zone Normal
pages free 4435
min 1448
low 1810
high 2172
active 241786
inactive 210170
scanned 0 (a: 0 i: 0)
spanned 524224
present 524224
nr_anon_pages 61680
nr_mapped 14271
nr_file_pages 390264
nr_slab_reclaimable 27564
nr_slab_unreclaimable 1793
nr_page_table_pages 449
nr_dirty 39
nr_writeback 0
nr_unstable 0
nr_bounce 0
cpu: 0 pcp: 0
count: 156
high: 186
batch: 31
cpu: 0 pcp: 1
count: 9
high: 62
batch: 15
vm stats threshold: 20
cpu: 1 pcp: 0
count: 177
high: 186
batch: 31
cpu: 1 pcp: 1
count: 12
high: 62
batch: 15
vm stats threshold: 20
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0
This patch:
In two places in the VM we use ZONE_DMA to refer to the first zone. If
ZONE_DMA is optional then other zones may be first. So simply replace
ZONE_DMA with zone 0.
This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
number related to a pgdat. However, we still need a zonetable to index all
the zones for each node if this is a NUMA system. Therefore define
ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.
[apw@shadowen.org: fix mismerge]
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Values are available via ZVC sums.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Values are readily available via ZVC per node and global sums.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function is unnecessary now. We can use the summing features of the ZVCs to
get the values we need.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nr_free_pages is now a simple access to a global variable. Make it a macro
instead of a function.
The nr_free_pages now requires vmstat.h to be included. There is one
occurrence in power management where we need to add the include. Directly
refrer to global_page_state() there to clarify why the #include was added.
[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
so that the most frequently used ZVCs are put into the same cacheline. That
way calculations of the global, node and per zone vm state touches only a
single cacheline. This is mostly important for 64 bit systems were one 128
byte cacheline takes only 8 longs.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.
[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The determination of the dirty ratio to determine writeback behavior is
currently based on the number of total pages on the system.
However, not all pages in the system may be dirtied. Thus the ratio is always
too low and can never reach 100%. The ratio may be particularly skewed if
large hugepage allocations, slab allocations or device driver buffers make
large sections of memory not available anymore. In that case we may get into
a situation in which f.e. the background writeback ratio of 40% cannot be
reached anymore which leads to undesired writeback behavior.
This patchset fixes that issue by determining the ratio based on the actual
pages that may potentially be dirty. These are the pages on the active and
the inactive list plus free pages.
The problem with those counts has so far been that it is expensive to
calculate these because counts from multiple nodes and multiple zones will
have to be summed up. This patchset makes these counters ZVC counters. This
means that a current sum per zone, per node and for the whole system is always
available via global variables and not expensive anymore to calculate.
The patchset results in some other good side effects:
- Removal of the various functions that sum up free, active and inactive
page counts
- Cleanup of the functions that display information via the proc filesystem.
This patch:
The use of a ZVC for nr_inactive and nr_active allows a simplification of some
counter operations. More ZVC functionality is used for sums etc in the
following patches.
[akpm@osdl.org: UP build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After do_wp_page has tested page_mkwrite, it must release old_page after
acquiring page table lock, not before: at some stage that ordering got
reversed, leaving a (very unlikely) window in which old_page might be
truncated, freed, and reused in the same position.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
early_node_map[] on every call. This is an excessive amount of sorting and
that can be avoided. This patch always searches the whole early_node_map[]
in find_min_pfn_for_node() instead of returning the first value found. The
map is then only sorted once when required. Successfully boot tested on a
number of machines.
[akpm@osdl.org: cleanup]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the pointer passed to cache_reap to determine the work pointer and
consolidate exit paths.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up __cache_alloc and __cache_alloc_node functions a bit. We no
longer need to do NUMA_BUILD tricks and the UMA allocation path is much
simpler. No functional changes in this patch.
Note: saves few kernel text bytes on x86 NUMA build due to using gotos in
__cache_alloc_node() and moving __GFP_THISNODE check in to
fallback_alloc().
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Christoph Lameter <christoph@lameter.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PageSlab debug check in kfree_debugcheck() is broken for compound
pages. It is also redundant as we already do BUG_ON for non-slab pages in
page_get_cache() and page_get_slab() which are always called before we free
any actual objects.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a utility function install_special_mapping, for creating a
special vma using a fixed set of preallocated pages as backing, such as for a
vDSO. This consolidates some nearly identical code used for vDSO mapping
reimplemented for different architectures.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also split that long line up - people like to send us wordwrapped oom-kill
traces.
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".
drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.
Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove find_trylock_page as per the removal schedule.
Signed-off-by: Nick Piggin <npiggin@suse.de>
[ Let's see if anybody screams ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit e80ee884ae.
Pawel Sikora had a boot-time oops due to it - because the sign change
invalidates the following comparisons, since 'free_pages' can be
negative.
The micro-optimization just isn't worth it.
Bisected-by: Pawel Sikora <pluto@agmk.net>
Acked-by: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When expanding the stack, we don't currently check if the VMA will cross
into an area of the address space that is reserved for hugetlb pages.
Subsequent faults on the expanded portion of such a VMA will confuse the
low-level MMU code, resulting in an OOPS. Check for this.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.
Instead of complicating that move_pte, just forget the minor optimization
when mremapping, and change the one thing which needed it for correctness
- filemap_xip use ZERO_PAGE(0) throughout instead of according to address.
[ "There is no block device driver one could use for XIP on mips
platforms" - Carsten Otte ]
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes balance_dirty_page() always base its calculations on the
amount of non-highmem memory in the machine, rather than try to base it
on total memory and then falling back on non-highmem memory if the
mapping it was writing wasn't highmem capable.
This not only fixes a situation where two different writers can have
wildly different notions about what is a "balanced" dirty state, but it
also means that people with highmem machines don't run into an OOM
situation when regular memory fills up with dirty pages.
We used to try to handle the latter case by scaling down the dirty_ratio
if the machine had a lot of highmem pages in page_writeback_init(), but
it wasn't aggressive enough for some situations, and since basing the
dirty ratio on highmem memory was broken in the first place, let's just
stop doing so.
(A variation of this theme fixed Justin Piszcz's OOM problem when
copying an 18GB file on a RAID setup).
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFS can handle the case where invalidate_inode_pages2_range() fails, so the
premise behind commit 8258d4a574 is now gone.
Remove the WARN_ON_ONCE() which is causing users grief as we can see from
http://bugzilla.kernel.org/show_bug.cgi?id=7826
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.
This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the initialization of gate_vma.vm_flags and
gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in
/proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used
(CONFIG_COMPAT_VDSO on i386).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean. This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.
Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.
This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:
http://bugzilla.kernel.org/show_bug.cgi?id=7844
Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently one can specify an arbitrary node mask to mbind that includes
nodes not allowed. If that is done with an interleave policy then we will
go around all the nodes. Those outside of the currently allowed cpuset
will be redirected to the border nodes. Interleave will then create
imbalances at the borders of the cpuset.
This patch restricts the nodes to the currently allowed cpuset.
The RFC for this patch was discussed at
http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we issue a bounce trace when __blk_queue_bounce() is called,
but that merely means that the device has a lower dma mask than the
higher pages in the system. The bio itself may still be lower pages. So
move the bounce trace into __blk_queue_bounce(), when we know there will
actually be page bouncing.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NFS: Fix race in nfs_release_page()
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.
Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.
[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix an oops experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime. It alters the call paths to make sure
that the callers explicitly say whether the call is being made on behalf of
a hotplug even, or happening at boot-time.
It has been compile tested on ppc64, ia64, s390, i386 and x86_64.
Acked-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since get_user_pages() may be used with processes other than the
current process and calls flush_anon_page(), flush_anon_page() has to
cope in some way with non-current processes.
It may not be appropriate, or even desirable to flush a region of
virtual memory cache in the current process when that is different to
the process that we want the flush to occur for.
Therefore, pass the vma into flush_anon_page() so that the architecture
can work out whether the 'vmaddr' is for the current process or not.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
be zero.
Fix that up, and add a helper function for this operation too.
Also, recalculate lru_pages each time around the inner loop to get the
balancing correct.
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These days, if you swapoff when there isn't enough memory, OOM killer gives
"BUG: scheduling while atomic" and the machine hangs: badness() needs to do
its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held
here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any
moment, but that doesn't really matter).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Both process_zones() and drain_node_pages() check for populated zones
before touching pagesets. However, __drain_pages does not do so,
This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.
Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets. Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.
If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.
If the cpu is then brought up again then we call process_zones() which will
ignore the unpopulated zone. So the pageset pointers will still be NULL.
If the cpu is then again brought down then __drain_pages will attempt to
drain pages by following the NULL pageset pointer for unpopulated zones.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from
fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt
failed. But it wouldn't have freed them if __GFP_NO_GROW, so make sure
fallback_alloc doesn't waste its time on that case.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At the moment the inode/dentry cache hash tables (common by way of
alloc_large_system_hash()) are incorrectly sized by their respective
detection logic when we attempt to use large base pages on systems with
little memory.
This results in odd behaviour when using a 64kB PAGE_SIZE, such as:
Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)
The mount cache hash table is seemingly the only one that gets this right
by directly taking PAGE_SIZE in to account.
The following patch attempts to catch the bogus values and round it up to
at least 0-order.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the kernels later than 2.6.19 there is a regression that makes swsusp
fail if the resume device is not explicitly specified.
It can be fixed by adding an additional parameter to
mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
*) corresponding to the first available swap back to the caller.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a rather obvious buglet. Noticed while instrumenting the VM using
/proc/vmstat.
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Recent cleanup of slab.h broke SLOB allocator: the routine kmem_cache_init
has now the __init attribute for both slab.c and slob.c. This routine
cannot be removed after init in the case of slob.c -- it serves as a timer
callback.
Provide a separate timer callback routine, call it once from kmem_cache_init,
keep the __init attribute on the latter.
Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
constrained_alloc(), which is called to detect where oom is from, checks
passed zone_list(). If zone_list doesn't include all nodes, it thinks oom
is from mempolicy.
But there is memory-less-node. memory-less-node's zones are never included
in zonelist[].
contstrained_alloc() should get memory_less_node into count. Otherwise, it
always thinks 'oom is from mempolicy'. This means that current process
dies at any time. This patch fix it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The VM layer (on the face of it, fairly reasonably) expected that when
it does a ->writepage() call to the filesystem, it would write out the
full page at that point in time. Especially since it had earlier marked
the whole page dirty with "set_page_dirty()".
But that isn't actually the case: ->writepage() does not actually write
a page, it writes the parts of the page that have been explicitly marked
dirty before, *and* that had not got written out for other reasons since
the last time we told it they were dirty.
That last caveat is the important one.
Which _most_ of the time ends up being the whole page (since we had
called "set_page_dirty()" on the page earlier), but if the filesystem
had done any dirty flushing of its own (for example, to honor some
internal write ordering guarantees), it might end up doing only a
partial page IO (or none at all) when ->writepage() is actually called.
That is the correct thing in general (since we actually often _want_
only the known-dirty parts of the page to be written out), but the
shared dirty page handling had implicitly forgotten about these details,
and had a number of cases where it was doing just the "->writepage()"
part, without telling the low-level filesystem that the whole page might
have been re-dirtied as part of being mapped writably into user space.
Since most of the time the FS did actually write out the full page, we
didn't notice this for a loong time, and this needed some really odd
patterns to trigger. But it caused occasional corruption with rtorrent
and with the Debian "apt" database, because both use shared mmaps to
update the end result.
This fixes it. Finally. After way too much hair-pulling.
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Martin J. Bligh <mbligh@google.com>
Acked-by: Martin Michlmayr <tbm@cyrius.com>
Acked-by: Martin Johansson <martin@fatbob.nu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Andrei Popa <andrei.popa@i-neo.ro>
Cc: High Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Kenneth Cheng <kenneth.w.chen@intel.com>
Cc: Tobias Diedrich <ranma@tdiedrich.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make cancel_dirty_page() act more like all the other dirty and writeback
accounting functions: test for "mapping" being NULL, and do the
NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()).
Also, add it to the exports, so that modular filesystems can use it.
Acked-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add more debugging in the rmap code in an attempt to locate to source of
the occasional "mapcount went negative" assertions.
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The version of mm/vmscan.c in Linus' current tree has swapped parameters in
the shrink_all_zones declaration and call, used by the various
suspend-to-disk implementations. This doesn't seem to have any great
adverse effect, but it's clearly wrong.
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The declaration of kmem_ptr_validate in slab.h does not match the
one in slab.c. Remove the fastcall attribute (this is the only use in
slab.c).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>