android_kernel_xiaomi_sm8350

Author	SHA1	Message	Date
KOSAKI Motohiro	31f1de46b9	mempolicy: silently restrict nodemask to allowed nodes Kosaki Motohito noted that "numactl --interleave=all ..." failed in the presence of memoryless nodes. This patch attempts to fix that problem. Some background: numactl --interleave=all calls set_mempolicy(2) with a fully populated [out to MAXNUMNODES] nodemask. set_mempolicy() [in do_set_mempolicy()] calls contextualize_policy() which requires that the nodemask be a subset of the current task's mems_allowed; else EINVAL will be returned. A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY] i.e., nodes with memory. So, a fully populated nodemask will be declared invalid if it includes memoryless nodes. NOTE: the same thing will occur when running in a cpuset with restricted mem_allowed--for the same reason: node mask contains dis-allowed nodes. mbind(2), on the other hand, just masks off any nodes in the nodemask that are not included in the caller's mems_allowed. In each case [mbind() and set_mempolicy()], mpol_check_policy() will complain [again, resulting in EINVAL] if the nodemask contains any memoryless nodes. This is somewhat redundant as mpol_new() will remove memoryless nodes for interleave policy, as will bind_zonelist()--called by mpol_new() for BIND policy. Proposed fix: 1) modify contextualize_policy logic to: a) remember whether the incoming node mask is empty. b) if not, restrict the nodemask to allowed nodes, as is currently done in-line for mbind(). This guarantees that the resulting mask includes only nodes with memory. NOTE: this is a [benign, IMO] change in behavior for set_mempolicy(). Dis-allowed nodes will be silently ignored, rather than returning an error. c) fold this code into mpol_check_policy(), replace 2 calls to contextualize_policy() to call mpol_check_policy() directly and remove contextualize_policy(). 2) In existing mpol_check_policy() logic, after "contextualization": a) MPOL_DEFAULT: require that in coming mask "was_empty" b) MPOL_{BIND\|INTERLEAVE}: require that contextualized nodemask contains at least one node. c) add a case for MPOL_PREFERRED: if in coming was not empty and resulting mask IS empty, user specified invalid nodes. Return EINVAL. c) remove the now redundant check for memoryless nodes 3) remove the now redundant masking of policy nodes for interleave policy from mpol_new(). 4) Now that mpol_check_policy() contextualizes the nodemask, remove the in-line nodes_and() from sys_mbind(). I believe that this restores mbind() to the behavior before the memoryless-nodes patch series. E.g., we'll no longer treat an invalid nodemask with MPOL_PREFERRED as local allocation. [ Patch history: v1 -> v2: - Communicate whether or not incoming node mask was empty to mpol_check_policy() for better error checking. - As suggested by David Rientjes, remove the now unused cpuset_nodes_subset_current_mems_allowed() from cpuset.h v2 -> v3: - As suggested by Kosaki Motohito, fold the "contextualization" of policy nodemask into mpol_check_policy(). Looks a little cleaner. ] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-11 20:48:29 -08:00
Jonathan Corbet	900cf086fd	Be more robust about bad arguments in get_user_pages() So I spent a while pounding my head against my monitor trying to figure out the vmsplice() vulnerability - how could a failure to check for read access turn into a root exploit? It turns out that it's a buffer overflow problem which is made easy by the way get_user_pages() is coded. In particular, "len" is a signed int, and it is only checked at the end of a do {} while() loop. So, if it is passed in as zero, the loop will execute once and decrement len to -1. At that point, the loop will proceed until the next invalid address is found; in the process, it will likely overflow the pages array passed in to get_user_pages(). I think that, if get_user_pages() has been asked to grab zero pages, that's what it should do. Thus this patch; it is, among other things, enough to block the (already fixed) root exploit and any others which might be lurking in similar code. I also think that the number of pages should be unsigned, but changing the prototype of this function probably requires some more careful review. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-11 20:44:44 -08:00
David Rientjes	60c12b1202	memcontrol: add vm_match_cgroup() mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer is pointing to a specific cgroup. Instead of returning the pointer, we can just do the test itself in a new macro: vm_match_cgroup(mm, cgroup) returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it returns zero. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-09 11:08:33 -08:00
Nick Piggin	b1d0e4f535	mm: special mapping nopage Convert special mapping install from nopage to fault. Because the "vm_file" is NULL for the special mapping, the generic VM code has messed up "vm_pgoff" thinking that it's an anonymous mapping and the offset does't matter. For that reason, we need to undo the vm_pgoff offset that got added into vmf->pgoff. [ We _really_ should clean that up - either by making this whole special mapping code just use a real file entry rather than that ugly array of "struct page" pointers, or by just making the VM code realize that even if vm_file is NULL it may not be a regular anonymous mmap. - Linus ] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 18:57:39 -08:00
Martin Schwidefsky	2f569afd9c	CONFIG_HIGHPTE vs. sub-page page tables. Background: I've implemented 1K/2K page tables for s390. These sub-page page tables are required to properly support the s390 virtualization instruction with KVM. The SIE instruction requires that the page tables have 256 page table entries (pte) followed by 256 page status table entries (pgste). The pgstes are only required if the process is using the SIE instruction. The pgstes are updated by the hardware and by the hypervisor for a number of reasons, one of them is dirty and reference bit tracking. To avoid wasting memory the standard pte table allocation should return 1K/2K (31/64 bit) and 2K/4K if the process is using SIE. Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means the s390 version for pte_alloc_one cannot return a pointer to a struct page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one cannot return a pointer to a pte either, since that would require more than 32 bit for the return value of pte_alloc_one (and the pte * would not be accessible since its not kmapped). Solution: The only solution I found to this dilemma is a new typedef: a pgtable_t. For s390 pgtable_t will be a (pte ) - to be introduced with a later patch. For everybody else it will be a (struct page ). The additional problem with the initialization of the ptl lock and the NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and a destructor pgtable_page_dtor. The page table allocation and free functions need to call these two whenever a page table page is allocated or freed. pmd_populate will get a pgtable_t instead of a struct page pointer. To get the pgtable_t back from a pmd entry that has been installed with pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page call in free_pte_range and apply_to_pte_range. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:42 -08:00
Andrew Morton	b76db73540	mount-options-fix-tmpfs-fix Documentation/SubmitCheckist, please. Cc: Hugh Dickins <hugh@veritas.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:41 -08:00
akpm@linux-foundation.org	680d794bab	mount options: fix tmpfs Add .show_options super operation to tmpfs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:41 -08:00
Christoph Hellwig	36e7891442	kill do_generic_mapping_read do_generic_mapping_read was used by gfs2 for internals reads, but this use of the interface was rather suboptimal (as was the whole interface) and has been replaced by an internal helper now. This patch kills do_generic_mapping_read and surrounding damage in preparation of additional cleanups for the buffered read path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:39 -08:00
Jan Kara	2004dc8eec	Use pgoff_t instead of unsigned long Convert variables containing page indexes to pgoff_t. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:32 -08:00
Harvey Harrison	edde08f2a8	misc: removal of final callers using fastcall Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:31 -08:00
Nishanth Aravamudan	a3d0c6aa1b	hugetlb: add locking for overcommit sysctl When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules require that all counter modifications occur under the hugetlb_lock. Add a callback into the hugetlb code similar to the one for nr_hugepages. Grab the lock around the manipulation of nr_overcommit_hugepages in proc_doulongvec_minmax(). Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:23 -08:00
Ingo Molnar	3adbefee6f	SLUB: fix checkpatch warnings fix checkpatch --file mm/slub.c errors and warnings. $ q-code-quality-compare errors lines of code errors/KLOC mm/slub.c [before] 22 4204 5.2 mm/slub.c [after] 0 4210 0 no code changed: text data bss dec hex filename 22195 8634 136 30965 78f5 slub.o.before 22195 8634 136 30965 78f5 slub.o.after md5: 93cdfbec2d6450622163c590e1064358 slub.o.before.asm 93cdfbec2d6450622163c590e1064358 slub.o.after.asm [clameter: rediffed against Pekka's cleanup patch, omitted moves of the name of a function to the start of line] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-07 17:52:39 -08:00
Nick Piggin	a76d354629	Use non atomic unlock Slub can use the non-atomic version to unlock because other flags will not get modified with the lock held. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-07 17:47:42 -08:00
Christoph Lameter	8ff12cfc00	SLUB: Support for performance statistics The statistics provided here allow the monitoring of allocator behavior but at the cost of some (minimal) loss of performance. Counters are placed in SLUB's per cpu data structure. The per cpu structure may be extended by the statistics to grow larger than one cacheline which will increase the cache footprint of SLUB. There is a compile option to enable/disable the inclusion of the runtime statistics and its off by default. The slabinfo tool is enhanced to support these statistics via two options: -D Switches the line of information displayed for a slab from size mode to activity mode. -A Sorts the slabs displayed by activity. This allows the display of the slabs most important to the performance of a certain load. -r Report option will report detailed statistics on Example (tbench load): slabinfo -AD ->Shows the most active slabs Name Objects Alloc Free %Fast skbuff_fclone_cache 33 111953835 111953835 99 99 :0000192 2666 5283688 5281047 99 99 :0001024 849 5247230 5246389 83 83 vm_area_struct 1349 119642 118355 91 22 :0004096 15 66753 66751 98 98 :0000064 2067 25297 23383 98 78 dentry 10259 28635 18464 91 45 :0000080 11004 18950 8089 98 98 :0000096 1703 12358 10784 99 98 :0000128 762 10582 9875 94 18 :0000512 184 9807 9647 95 81 :0002048 479 9669 9195 83 65 anon_vma 777 9461 9002 99 71 kmalloc-8 6492 9981 5624 99 97 :0000768 258 7174 6931 58 15 So the skbuff_fclone_cache is of highest importance for the tbench load. Pretty high load on the 192 sized slab. Look for the aliases slabinfo -a \| grep 000192 :0000192 <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili Likely skbuff_head_cache. Looking into the statistics of the skbuff_fclone_cache is possible through slabinfo skbuff_fclone_cache ->-r option implied if cache name is mentioned .... Usual output ... Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 111953360 111946981 99 99 Slowpath 1044 7423 0 0 Page Alloc 272 264 0 0 Add partial 25 325 0 0 Remove partial 86 264 0 0 RemoteObj/SlabFrozen 350 4832 0 0 Total 111954404 111954404 Flushes 49 Refill 0 Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%) Looks good because the fastpath is overwhelmingly taken. skbuff_head_cache: Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 5297262 5259882 99 99 Slowpath 4477 39586 0 0 Page Alloc 937 824 0 0 Add partial 0 2515 0 0 Remove partial 1691 824 0 0 RemoteObj/SlabFrozen 2621 9684 0 0 Total 5301739 5299468 Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%) Descriptions of the output: Total: The total number of allocation and frees that occurred for a slab Fastpath: The number of allocations/frees that used the fastpath. Slowpath: Other allocations Page Alloc: Number of calls to the page allocator as a result of slowpath processing Add Partial: Number of slabs added to the partial list through free or alloc (occurs during cpuslab flushes) Remove Partial: Number of slabs removed from the partial list as a result of allocations retrieving a partial slab or by a free freeing the last object of a slab. RemoteObj/Froz: How many times were remotely freed object encountered when a slab was about to be deactivated. Frozen: How many times was free able to skip list processing because the slab was in use as the cpuslab of another processor. Flushes: Number of times the cpuslab was flushed on request (kmem_cache_shrink, may result from races in __slab_alloc) Refill: Number of times we were able to refill the cpuslab from remotely freed objects for the same slab. Deactivate: Statistics how slabs were deactivated. Shows how they were put onto the partial list. In general fastpath is very good. Slowpath without partial list processing is also desirable. Any touching of partial list uses node specific locks which may potentially cause list lock contention. Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-07 17:47:41 -08:00
Christoph Lameter	1f84260c8c	SLUB: Alternate fast paths using cmpxchg_local Provide an alternate implementation of the SLUB fast paths for alloc and free using cmpxchg_local. The cmpxchg_local fast path is selected for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an interrupt enable/disable sequence. This is known to be true for both x86 platforms so set FAST_CMPXCHG_LOCAL for both arches. Currently another requirement for the fastpath is that the kernel is compiled without preemption. The restriction will go away with the introduction of a new per cpu allocator and new per cpu operations. The advantages of a cmpxchg_local based fast path are: 1. Potentially lower cycle count (30%-60% faster) 2. There is no need to disable and enable interrupts on the fast path. Currently interrupts have to be disabled and enabled on every slab operation. This is likely avoiding a significant percentage of interrupt off / on sequences in the kernel. 3. The disposal of freed slabs can occur with interrupts enabled. The alternate path is realized using #ifdef's. Several attempts to do the same with macros and inline functions resulted in a mess (in particular due to the strange way that local_interrupt_save() handles its argument and due to the need to define macros/functions that sometimes disable interrupts and sometimes do something else). [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-07 17:47:41 -08:00
Christoph Lameter	683d0baad3	SLUB: Use unique end pointer for each slab page. We use a NULL pointer on freelists to signal that there are no more objects. However the NULL pointers of all slabs match in contrast to the pointers to the real objects which are in different ranges for different slab pages. Change the end pointer to be a pointer to the first object and set bit 0. Every slab will then have a different end pointer. This is necessary to ensure that end markers can be matched to the source slab during cmpxchg_local. Bring back the use of the mapping field by SLUB since we would otherwise have to call a relatively expensive function page_address() in __slab_alloc(). Use of the mapping field allows avoiding a call to page_address() in various other functions as well. There is no need to change the page_mapping() function since bit 0 is set on the mapping as also for anonymous pages. page_mapping(slab_page) will therefore still return NULL although the mapping field is overloaded. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-07 17:47:41 -08:00
Christoph Lameter	5bb983b0cc	SLUB: Deal with annoying gcc warning on kfree() gcc 4.2 spits out an annoying warning if one casts a const void * pointer to a void * pointer. No warning is generated if the conversion is done through an assignment. Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-07 17:47:41 -08:00
Bernhard Walle	72a7fe3967	Introduce flags for reserve_bootmem() This patchset adds a flags variable to reserve_bootmem() and uses the BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions between crashkernel area and already used memory. This patch: Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE. If that flag is set, the function returns with -EBUSY if the memory already has been reserved in the past. This is to avoid conflicts. Because that code runs before SMP initialisation, there's no race condition inside reserve_bootmem_core(). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix powerpc build] Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:25 -08:00
Balbir Singh	3c541e14bf	Memory controller remove control_type feature Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt that control_type might not be a good thing to implement right away. We can add this flexibility at a later point when required. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	072c56c13e	per-zone and reclaim enhancements for memory controller: per-zone-lock for cgroup Now, lru is per-zone. Then, lru_lock can be (should be) per-zone, too. This patch implementes per-zone lru lock. lru_lock is placed into mem_cgroup_per_zone struct. lock can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); &mz->lru_lock or mz = page_cgroup_zoneinfo(page_cgroup); &mz->lru_lock Signed-off-by: KAMEZAWA hiroyuki <kmaezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	1ecaab2bd2	per-zone and reclaim enhancements for memory controller: per zone lru for cgroup This patch implements per-zone lru for memory cgroup. This patch makes use of mem_cgroup_per_zone struct for per zone lru. LRU can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); &mz->active_list &mz->inactive_list or mz = page_cgroup_zoneinfo(page_cgroup); &mz->active_list &mz->inactive_list Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	1cfb419b39	per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity When using memory controller, there are 2 levels of memory reclaim. 1. zone memory reclaim because of system/zone memory shortage. 2. memory cgroup memory reclaim because of hitting limit. These two can be distinguished by sc->mem_cgroup parameter. (scan_global_lru() macro) This patch tries to make memory cgroup reclaim routine avoid affecting system/zone memory reclaim. This patch inserts if (scan_global_lru()) and hook to memory_cgroup reclaim support functions. This patch can be a help for isolating system lru activity and group lru activity and shows what additional functions are necessary. * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup. * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in cgroup. * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to be scanned in this priority in mem_cgroup. * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages to be scanned in this priority in mem_cgroup. * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable or not. * mem_cgroup_get_reclaim_priority() ... * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal) * mem_cgroup_remember_reclaim_priority() .... record reclaim priority as zone->prev_priority. This value is used for calc reclaim_mapped. [akpm@linux-foundation.org: fix unused var warning] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	cc38108e1b	per-zone and reclaim enhancements for memory controller: calculate the number of pages to be scanned per cgroup Define function for calculating the number of scan target on each Zone/LRU. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	6c48a1d040	per-zone and reclaim enhancements for memory controller: remember reclaim priority in memory cgroup Functions to remember reclaim priority per cgroup (as zone->prev_priority) [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: more build fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki	5932f3671b	per-zone and reclaim enhancements for memory controller: calculate active/inactive imbalance per cgroup Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:21 -08:00
KAMEZAWA Hiroyuki	58ae83db2a	per-zone and reclaim enhancements for memory controller: calculate mapper_ratio per cgroup Define function for calculating mapped_ratio in memory cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:21 -08:00
KAMEZAWA Hiroyuki	6d12e2d8dd	per-zone and reclaim enhancements for memory controller: per-zone active inactive counter This patch adds per-zone status in memory cgroup. These values are often read (as per-zone value) by page reclaiming. In current design, per-zone stat is just a unsigned long value and not an atomic value because they are modified only under lru_lock. (So, atomic_ops is not necessary.) This patch adds ACTIVE and INACTIVE per-zone status values. For handling per-zone status, this patch adds struct mem_cgroup_per_zone { ... } and some helper functions. This will be useful to add per-zone objects in mem_cgroup. This patch turns memory controller's early_init to be 0 for calling kmalloc() in initialization. Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:21 -08:00
KAMEZAWA Hiroyuki	c0149530d0	per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup Add macro to get node_id and zone_id of page_cgroup. Will be used in per-zone-xxx patches and others. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:21 -08:00
KAMEZAWA Hiroyuki	91a45470f7	per-zone and reclaim enhancements for memory controller: add scan_global_lru macro This is used to detect which scan_control scans global lru or mem_cgroup lru. And compiled to be static value (1) when memory controller is not configured. This may make the meaning obvious. Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:21 -08:00
KAMEZAWA Hiroyuki	df878fb04d	memory cgroup enhancements: implicit force_empty() at rmdir Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at rmdir(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	d2ceb9b7dd	memory cgroup enhancements: add memory.stat file Show accounted information of memory cgroup by memory.stat file [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	d52aa412d4	memory cgroup enhancements: add status accounting function for memory cgroup Add statistics account infrastructure for memory controller. All account information is stored per-cpu and caller will not have to take lock or use atomic ops. This will be used by memory.stat file later. CACHE includes swapcache now. I'd like to divide it to PAGECACHE and SWAPCACHE later. This patch adds 3 functions for accounting. * __mem_cgroup_stat_add() ... for usual routine. * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section. * mem_cgroup_read_stat() ... for reading stat value. * renamed PAGECACHE to CACHE (because it may include swapcache now) [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible] [akpm@linux-foundation.org: uninline things] [akpm@linux-foundation.org: remove dead code] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	3564c7c451	memory cgroup enhancements: remember "a page is on active list of cgroup or not" Remember page_cgroup is on active_list or not in page_cgroup->flags. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
Hugh Dickins	82369553d6	memcgroup: fix hang with shmem/tmpfs The memcgroup regime relies upon a cgroup reclaiming pages from itself within add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs rely upon using add_to_page_cache while holding a spinlock: when it cannot wait. The consequence is that when a cgroup reaches its limit, shmem_getpage just hangs - unless there is outside memory pressure too, neither kswapd nor radix_tree_preload get it out of the retry loop. In most cases we can mem_cgroup_cache_charge the page waitably first, to attach the page_cgroup in advance, so add_to_page_cache will do no more than increment a count; then mem_cgroup_uncharge_page after (in both success and failure cases) to balance the books again. And where there used to be a congestion_wait for kswapd (recently made redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page to go through a cycle of allocation and freeing, without accounting to any particular page, and without updating the statistics vector. This brings the cgroup below its limit so the next try usually succeeds. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
Hugh Dickins	3be91277e7	memcgroup: tidy up mem_cgroup_charge_common Tidy up mem_cgroup_charge_common before extending it. Adjust some comments, but mainly clean up its loop: I've an aversion to loops full of continues, then a break or a goto at the bottom. And the is_atomic test should be on the __GFP_WAIT bit, not GFP_ATOMIC bits. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
Balbir Singh	ac44d354d5	Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() Hugh Dickins noticed that we were using rcu_dereference() without rcu_read_lock() in the cache charging routine. The patch below fixes this problem Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	217bc3194d	memory cgroup enhancements: remember "a page is charged as page cache" Add a flag to page_cgroup to remember "this page is charged as cache." cache here includes page caches and swap cache. This is useful for implementing precise accounting in memory cgroup. TODO: distinguish page-cache and swap-cache Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	cc8475822f	memory cgroup enhancements: force_empty interface for dropping all account in empty cgroup This patch adds an interface "memory.force_empty". Any write to this file will drop all charges in this cgroup if there is no task under. %echo 1 > /....../memory.force_empty will drop all charges of memory cgroup if cgroup's tasks is empty. This is useful to invoke rmdir() against memory cgroup successfully. Tested and worked well on x86_64/fake-NUMA system. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	417eead304	memory cgroup enhancements: fix zone handling in try_to_free_mem_cgroup_page Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all necessary zones, it is not necessary to use for_each_online_node. And this for_each_online_node() makes reclaim routine start always from node 0. This is not good. This patch makes reclaim start from caller's node and just use usual (default) zonelist order. [akpm@linux-foundation.org: fix warning] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
Hugh Dickins	fa1de9008c	memcgroup: revert swap_state mods If we're charging rss and we're charging cache, it seems obvious that we should be charging swapcache - as has been done. But in practice that doesn't work out so well: both swapin readahead and swapoff leave the majority of pages charged to the wrong cgroup (the cgroup that happened to read them in, rather than the cgroup to which they belong). (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up as a problem: no allocation was ever done there, every page read being already charged to the cgroup which initiated the swapoff.) It all works rather better if we leave the charging to do_swap_page and unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to what it was before the memory-controller patches. This also speeds up significantly a contained process working at its limit: because it no longer needs to keep waiting for swap writeback to complete. Is it unfair that swap pages become uncharged once they're unmapped, even though they're still clearly private to particular cgroups? For a short while, yes; but PageReclaim arranges for those pages to go to the end of the inactive list and be reclaimed soon if necessary. shmem/tmpfs pages are a distinct case: their charging also benefits from this change, but their second life on the lists as swapcache pages may prove more unfair - that I need to check next. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Pavel Emelianov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
Hugh Dickins	436c6541b1	memcgroup: fix zone isolation OOM mem_cgroup_charge_common shows a tendency to OOM without good reason, when a memhog goes well beyond its rss limit but with plenty of swap available. Seen on x86 but not on PowerPC; seen when the next patch omits swapcache from memcgroup, but we presume it can happen without. mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM avoidance. Already it has to scan beyond the nr_to_scan limit when it finds a !LRU page or an active page when handling inactive or an inactive page when handling active. It needs to do exactly the same when it finds a page from the wrong zone (the x86 tests had two zones, the PowerPC tests had only one). Don't increment scan and then decrement it in these cases, just move the incrementation down. Fix recent off-by-one when checking against nr_to_scan. Cut out "Check if the meta page went away from under us", presumably left over from early debugging: no amount of such checks could save us if this list really were being updated without locking. This change does make the unlimited scan while holding two spinlocks even worse - bad for latency and bad for containment; but that's a separate issue which is better left to be fixed a little later. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Pavel Emelianov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	ff7283fa3a	bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages This patch makes mem_cgroup_isolate_pages() to be - ignore !PageLRU pages. - fixes the bug that isolation makes no progress if page_zone(page) != zone page once find. (just increment scan in this case.) kswapd and memory migration removes a page from list when it handles a page for reclaiming/migration. Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will be safe to avoid touching !PageLRU() page and its page_cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:20 -08:00
KAMEZAWA Hiroyuki	ae41be3742	bugfix for memory cgroup controller: migration under memory controller fix While using memory control cgroup, page-migration under it works as following. == 1. uncharge all refs at try to unmap. 2. charge regs again remove_migration_ptes() == This is simple but has following problems. == The page is uncharged and charged back again if mapped. - This means that cgroup before migration can be different from one after migration - If page is not mapped but charged as page cache, charge is just ignored (because not mapped, it will not be uncharged before migration) This is memory leak. == This patch tries to keep memory cgroup at page migration by increasing one refcnt during it. 3 functions are added. mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup mem_cgroup_page_migration() --- copy page->page_cgroup from old page to new page. During migration - old page is under PG_locked. - new page is under PG_locked, too. - both old page and new page is not on LRU. These 3 facts guarantee that page_cgroup() migration has no race. Tested and worked well in x86_64/fake-NUMA box. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
KAMEZAWA Hiroyuki	9175e0311e	bugfix for memory controller: add helper function for assigning cgroup to page This patch adds following functions. - clear_page_cgroup(page, pc) - page_cgroup_assign_new_page_group(page, pc) Mainly for cleanup. A manner "check page->cgroup again after lock_page_cgroup()" is implemented in straight way. A comment in mem_cgroup_uncharge() will be removed by force-empty patch Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Rik van Riel	f1a9ee758d	kswapd should only wait on IO if there is IO The current kswapd (and try_to_free_pages) code has an oddity where the code will wait on IO, even if there is no IO in flight. This problem is notable especially when the system scans through many unfreeable pages, causing unnecessary stalls in the VM. Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path will sleep if a significant number of pages are encountered that should be written out. This gives kswapd a chance to write out those pages, while the direct reclaim task sleeps. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
David Rientjes	fef1bdd68c	oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
David Rientjes	4c4a221489	memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct task, const struct mem_cgroup mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Badari Pulavarty	4c6bc8dd5a	mem-controller gfp-mask fix Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge(). Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Balbir Singh	35c754d79f	memory controller BUG_ON() Move mem_controller_cache_charge() above radix_tree_preload(). radix_tree_preload() disables preemption, even though the gfp_mask passed contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we hit a BUG_ON() in kmem_cache_alloc(). This patch moves mem_controller_cache_charge() to above radix_tree_preload() for cache charging. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Hugh Dickins	044d66c1d2	memcgroup: reinstate swapoff mod This patch reinstates the "swapoff: scan ptes preemptibly" mod we started with: in due course it should be rendered down into the earlier patches, leaving us with a more straightforward mem_cgroup_charge mod to unuse_pte, allocating with GFP_KERNEL while holding no spinlock and no atomic kmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Pavel Emelianov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
David Rientjes	3062fc67da	memcontrol: move mm_cgroup to header file Inline functions must preceed their use, so mm_cgroup() should be defined in linux/memcontrol.h. include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after being called include/linux/memcontrol.h:48: warning: previous declaration of 'mm_cgroup' was here [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Balbir Singh	e1a1cd590e	Memory controller: make charging gfp mask aware Nick Piggin pointed out that swap cache and page cache addition routines could be called from non GFP_KERNEL contexts. This patch makes the charging routine aware of the gfp context. Charging might fail if the cgroup is over it's limit, in which case a suitable error is returned. This patch was tested on a Powerpc box. I am still looking at being able to test the path, through which allocations happen in non GFP_KERNEL contexts. [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Balbir Singh	bed7161a51	Memory controller: make page_referenced() cgroup aware Make page_referenced() cgroup aware. Without this patch, page_referenced() can cause a page to be skipped while reclaiming pages. This patch ensures that other cgroups do not hold pages in a particular cgroup hostage. It is required to ensure that shared pages are freed from a cgroup when they are not actively referenced from the cgroup that brought them in Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Balbir Singh	8697d33194	Memory controller: add switch to control what type of pages to limit Choose if we want cached pages to be accounted or not. By default both are accounted for. A new set of tunables are added. echo -n 1 > mem_control_type switches the accounting to account for only mapped pages echo -n 3 > mem_control_type switches the behaviour back [bunk@kernel.org: mm/memcontrol.c: clenups] [akpm@linux-foundation.org: fix sparc32 build] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Pavel Emelianov	c7ba5c9e81	Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:19 -08:00
Balbir Singh	0eea103017	Memory controller improve user interface Change the interface to use bytes instead of pages. Page sizes can vary across platforms and configurations. A new strategy routine has been added to the resource counters infrastructure to format the data as desired. Suggested by David Rientjes, Andrew Morton and Herbert Poetzl Tested on a UML setup with the config for memory control enabled. [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Balbir Singh	66e1707bc3	Memory controller: add per cgroup LRU and reclaim Add the page_cgroup to the per cgroup LRU. The reclaim algorithm has been modified to make the isolate_lru_pages() as a pluggable component. The scan_control data structure now accepts the cgroup on behalf of which reclaims are carried out. try_to_free_pages() has been extended to become cgroup aware. [akpm@linux-foundation.org: fix warning] [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member] [bunk@kernel.org: make do_try_to_free_pages() static] [hugh@veritas.com: memcgroup: fix try_to_free order] [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Balbir Singh	67e465a77b	Memory controller: task migration Allow tasks to migrate from one cgroup to the other. We migrate mm_struct's mem_cgroup only when the thread group id migrates. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Balbir Singh	8a9f3ccd24	Memory controller: memory accounting Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's page_cgroup is protected with the last bit of the page_cgroup pointer, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the page_cgroup to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Credits go to Vaidyanathan Srinivasan for helping with reference counting work of the page cgroup. Almost all of the page cache accounting code has help from Vaidyanathan Srinivasan. [hugh@veritas.com: fix swapoff breakage] [akpm@linux-foundation.org: fix locking] Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: <Valdis.Kletnieks@vt.edu> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Pavel Emelianov	78fb74669e	Memory controller: accounting setup Basic setup routines, the mm_struct has a pointer to the cgroup that it belongs to and the the page has a page_cgroup associated with it. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Balbir Singh	8cdea7c054	Memory controller: cgroups setup Setup the memory cgroup and add basic hooks and controls to integrate and work with the cgroup. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Hugh Dickins	59bd26582d	memcgroup: temporarily revert swapoff mod This patch precisely reverts the "swapoff: scan ptes preemptibly" patch just presented. It's a temporary measure to allow existing memory controller patches to apply without rejects: in due course they should be rendered down into one sensible patch, and this reversion disappear. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-07 08:42:18 -08:00
Linus Torvalds	3e6bdf473f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86 * git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86: x86: fix deadlock, make pgd_lock irq-safe virtio: fix trivial build bug x86: fix mttr trimming x86: delay CPA self-test and repeat it x86: fix 64-bit sections generic: add __FINITDATA x86: remove suprious ifdefs from pageattr.c x86: mark the .rodata section also NX x86: fix iret exception recovery on 64-bit cpuidle: dubious one-bit signed bitfield in cpuidle.h x86: fix sparse warnings in powernow-k8.c x86: fix sparse error in traps_32.c x86: trivial sparse/checkpatch in quirks.c x86 ptrace: disallow null cs/ss MAINTAINERS: RDC R-321x SoC maintainer brk randomization: introduce CONFIG_COMPAT_BRK brk: check the lower bound properly x86: remove X2 workaround x86: make spurious fault handler aware of large mappings x86: make traps on entry code be debuggable in user space, 64-bit	2008-02-06 13:54:09 -08:00
Ingo Molnar	32a932332c	brk randomization: introduce CONFIG_COMPAT_BRK based on similar patch from: Pavel Machek <pavel@ucw.cz> Introduce CONFIG_COMPAT_BRK. If disabled then the kernel is free (but not obliged to) randomize the brk area. Heap randomization breaks ancient binaries, so we keep COMPAT_BRK enabled by default. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-06 22:39:44 +01:00
Jiri Kosina	4cc6028d40	brk: check the lower bound properly There is a check in sys_brk(), that tries to make sure that we do not underflow the area that is dedicated to brk heap. The check is however wrong, as it assumes that brk area starts immediately after the end of the code (+bss), which is wrong for example in environments with randomized brk start. The proper way is to check whether the address is not below the start_brk address. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-06 22:39:44 +01:00
Eric Dumazet	b324215190	PERCPU : __percpu_alloc_mask() can dynamically size percpu_data storage Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data, we can use nr_cpu_ids, which is generally < NR_CPUS. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:04 -08:00
Linus Torvalds	b297d520b9	Merge branch 'dmapool' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'dmapool' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: pool: Improve memory usage for devices which can't cross boundaries Change dmapool free block management dmapool: Tidy up includes and add comments dmapool: Validate parameters to dma_pool_create Avoid taking waitqueue lock in dmapool dmapool: Fix style problems Move dmapool.c to mm/ directory	2008-02-05 19:05:48 -08:00
Paul Mundt	f905bc447c	nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces. This builds on top of the earlier vmalloc_32_user() work introduced by `b50731732f`, as we now have places in the nommu allmodconfig that hit up against these missing APIs. As vmalloc_32_user() is already implemented, this is moved over to vmalloc_user() and simply made a wrapper. As all current nommu platforms are 32-bit addressable, there's no special casing we have to do for ZONE_DMA and things of that nature as per GFP_VMALLOC32. remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether we permit the remap or not, which means that we also have to rework the vmalloc_user() code to grovel for the VMA and set the flag. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Acked-by: David McCullough <david_mccullough@securecomputing.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:21 -08:00
Serge E. Hallyn	97829955ad	oom_kill: remove uid==0 checks Root processes are considered more important when out of memory and killing proceses. The check for CAP_SYS_ADMIN was augmented with a check for uid==0 or euid==0. There are several possible ways to look at this: 1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN alone. However CAP_SYS_RESOURCE is the one that really means "give me extra resources" so allow for that as well. 2. Any privileged code should be protected, but uid is not an indication of privilege. So we should check whether any capabilities are raised. 3. uid==0 makes processes on the host as well as in containers more important, so we should keep the existing checks. 4. uid==0 makes processes only on the host more important, even without any capabilities. So we should be keeping the (uid==0\|\|euid==0) check but only when userns==&init_user_ns. I'm following number 1 here. Signed-off-by: Serge Hallyn <serue@us.ibm.com> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Andrew Morgan	e338d263a7	Add 64-bit capability support to the kernel The patch supports legacy (32-bit) capability userspace, and where possible translates 32-bit capabilities to/from userspace and the VFS to 64-bit kernel space capabilities. If a capability set cannot be compressed into 32-bits for consumption by user space, the system call fails, with -ERANGE. FWIW libcap-2.00 supports this change (and earlier capability formats) http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/ [akpm@linux-foundation.org: coding-syle fixes] [akpm@linux-foundation.org: use get_task_comm()] [ezk@cs.sunysb.edu: build fix] [akpm@linux-foundation.org: do not initialise statics to 0 or NULL] [akpm@linux-foundation.org: unused var] [serue@us.ibm.com: export __cap_ symbols] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
David P. Quigley	4249259404	VFS/Security: Rework inode_getsecurity and callers to return resulting buffer This patch modifies the interface to inode_getsecurity to have the function return a buffer containing the security blob and its length via parameters instead of relying on the calling function to give it an appropriately sized buffer. Security blobs obtained with this function should be freed using the release_secctx LSM hook. This alleviates the problem of the caller having to guess a length and preallocate a buffer for this function allowing it to be used elsewhere for Labeled NFS. The patch also removed the unused err parameter. The conversion is similar to the one performed by Al Viro for the security_getprocattr hook. Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Acked-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:20 -08:00
Matt Mackall	20cecbae44	slob: reduce external fragmentation by using three free lists By putting smaller objects on their own list, we greatly reduce overall external fragmentation and increase repeatability. This reduces total SLOB overhead from > 50% to ~6% on a simple boot test. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Matt Mackall	679299b32d	slob: fix free block merging at head of subpage We weren't merging freed blocks at the beginning of the free list. Fixing this showed a 2.5% efficiency improvement in a userspace test harness. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Fengguang Wu	8bc3be2751	writeback: speed up writeback of big dirty files After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate that more io should be done. With it the big dirty file no longer has to wait for the next kupdate invokation 5s later. In sync_sb_inodes() we only set more_io on super_blocks we actually visited. This avoids the interaction between two pdflush deamons. Also in __sync_single_inode() we don't blindly keep requeuing the io if the filesystem cannot progress. Failing to do so may lead to 100% iowait. Tested-by: Mike Snitzer <snitzer@gmail.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Sam Ravnborg	a322f8ab66	mm: fix section mismatch warning in sparse.c Fix following warning: WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node() static sparse_early_usemap_alloc() were used only by sparse_init() and with sparse_init() annotated _init it is safe to annotate sparse_early_usemap_alloc with __init too. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Nick Piggin	0ed361dec3	mm: fix PageUptodate data race After running SetPageUptodate, preceeding stores to the page contents to actually bring it uptodate may not be ordered with the store to set the page uptodate. Therefore, another CPU which checks PageUptodate is true, then reads the page contents can get stale data. Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after PageUptodate. Many places that test PageUptodate, do so with the page locked, and this would be enough to ensure memory ordering in those places if SetPageUptodate were only called while the page is locked. Unfortunately that is not always the case for some filesystems, but it could be an idea for the future. Also bring the handling of anonymous page uptodateness in line with that of file backed page management, by marking anon pages as uptodate when they _are_ uptodate, rather than when our implementation requires that they be marked as such. Doing allows us to get rid of the smp_wmb's in the page copying functions, which were especially added for anonymous pages for an analogous memory ordering problem. Both file and anonymous pages are handled with the same barriers. FAQ: Q. Why not do this in flush_dcache_page? A. Firstly, flush_dcache_page handles only one side (the smb side) of the ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away memory barriers in a completely unrelated function is nasty; at least in the PageUptodate macros, they are located together with (half) the operations involved in the ordering. Thirdly, the smp_wmb is only required when first bringing the page uptodate, wheras flush_dcache_page should be called each time it is written to through the kernel mapping. It is logically the wrong place to put it. Q. Why does this increase my text size / reduce my performance / etc. A. Because it is adding the necessary instructions to eliminate the data-race. Q. Can it be improved? A. Yes, eg. if you were to create a rule that all SetPageUptodate operations run under the page lock, we could avoid the smp_rmb places where PageUptodate is queried under the page lock. Requires audit of all filesystems and at least some would need reworking. That's great you're interested, I'm eagerly awaiting your patches. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Shaohua Li	62e1c55300	page migraton: handle orphaned pages Orphaned page might have fs-private metadata, the page is truncated. As the page hasn't mapping, page migration refuse to migrate the page. It appears the page is only freed in page reclaim and if zone watermark is low, the page is never freed, as a result migration always fail. I thought we could free the metadata so such page can be freed in migration and make migration more reliable. [akpm@linux-foundation.org: go direct to try_to_free_buffers()] Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Masatake YAMATO	b5beb1caff	check ADVICE of fadvise64_64 even if get_xip_page is given I've written some test programs in ltp project. During writing I met an problem which I cannot solve in user land. So I wrote a patch for linux kernel. Please, include this patch if acceptable. The test program tests the 4th parameter of fadvise64_64: long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice); My test case calls fadvise64_64 with invalid advice value and checks errno is set to EINVAL. About the advice parameter man page says: ... Permissible values for advice include: POSIX_FADV_NORMAL ... POSIX_FADV_SEQUENTIAL ... POSIX_FADV_RANDOM ... POSIX_FADV_NOREUSE ... POSIX_FADV_WILLNEED ... POSIX_FADV_DONTNEED ... ERRORS ... EINVAL An invalid value was specified for advice. However, I got a bug report that the system call invocations in my test case returned 0 unexpectedly. I've inspected the kernel code: asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) { struct file file = fget(fd); struct address_space mapping; struct backing_dev_info bdi; loff_t endbyte; / inclusive / pgoff_t start_index; pgoff_t end_index; unsigned long nrpages; int ret = 0; if (!file) return -EBADF; if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) { ret = -ESPIPE; goto out; } mapping = file->f_mapping; if (!mapping \|\| len < 0) { ret = -EINVAL; goto out; } if (mapping->a_ops->get_xip_page) / no bad return value, but ignore advice */ goto out; ... out: fput(file); return ret; } I found the advice parameter is just ignored in the case mapping->a_ops->get_xip_page is given. This behavior is different from what is written on the man page. Is this o.k.? get_xip_page is given if CONFIG_EXT2_FS_XIP is true. Anyway I cannot find the easy way to detect get_xip_page field is given or CONFIG_EXT2_FS_XIP is true from the user space. I propose the following patch which checks the advice parameter even if get_xip_page is given. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Larry Woodman	e6f3602d2c	Include count of pagecache pages in show_mem() output The show_mem() output does not include the total number of pagecache pages. This would be helpful when analyzing the debug information in the /var/log/messages file after OOM kills occur. This patch includes the total pagecache pages in that output. Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Bjorn Steinbrink	a2b345642f	Fix dirty page accounting leak with ext3 data=journal In `46d2277c79` ("Clean up and make try_to_free_buffers() not race with dirty pages"), try_to_free_buffers was changed to bail out if the page was dirty. That in turn caused truncate_complete_page to leak massive amounts of memory, because the dirty bit was only cleared after the call to try_to_free_buffers. So the call to cancel_dirty_page was moved up to have the dirty bit cleared early in `3e67c0987d` ("truncate: clear page dirtiness before running try_to_free_buffers()"). The problem with that fix is, that the page can be redirtied after cancel_dirty_page was called, eg. like this: truncate_complete_page() cancel_dirty_page() // PG_dirty cleared, decr. dirty pages do_invalidatepage() ext3_invalidatepage() journal_invalidatepage() journal_unmap_buffer() __dispose_buffer() __journal_unfile_buffer() __journal_temp_unlink_buffer() mark_buffer_dirty(); // PG_dirty set, incr. dirty pages And then we end up with dirty pages being wrongly accounted. As a result, in `ecdfc9787f` ("Resurrect 'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers were reverted, so the original reason for the massive memory leak is gone, and we can also revert the move of the call to cancel_dirty_page from truncate_complete_page and get the accounting right again. I'm not sure if it matters, but opposed to the final check in __remove_from_page_cache, this one also cares about the task io accounting, so maybe we want to use this instead, although it's not quite the clean fix either. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl> Cc: Jan Kara <jack@ucw.cz> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Osterried <osterried@jesse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Qi Yong	ae1276b934	set_page_refcounted() VM_BUG_ON fix The current PageTail semantic is that a PageTail page is first a PageCompound page. So remove the redundant PageCompound test in set_page_refcounted(). Signed-off-by: Qi Yong <qiyong@fc-cn.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:19 -08:00
Harvey Harrison	920c7a5d0c	mm: remove fastcall from mm/ fastcall is always defined to be empty, remove it [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Andi Kleen	1e548deb5d	page allocator: remove unused arguments in zone_init_free_lists() Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Hugh Dickins	5a9bbdcd29	mm: don't waste swap on locked pages try_to_unmap always fails on a page found in a VM_LOCKED vma (unless migrating), and recycles it back to the active list. But if it's an anonymous page, we've already allocated swap to it: just wasting swap. Spot locked pages in page_referenced_one and treat them as referenced. Signed-off-by: Hugh Dickins <hugh@veritas.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ethan Solomita <solo@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	9eccf2a816	vmstat: remove prefetch Remove the prefetch logic in order to avoid touching impossible per cpu areas. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Bron Gondwana	195cf453d2	mm/page-writeback: highmem_is_dirtyable option Add vm.highmem_is_dirtyable toggle A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of approximately 2Gb size which contains a hash format that is written randomly by the dbclean process. On 2.6.16 this process took a few minutes. With lowmem only accounting of dirty ratios, this takes about 12 hours of 100% disk IO, all random writes. Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to add the highmem back to the total available memory count. [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build] Signed-off-by: Bron Gondwana <brong@fastmail.fm> Cc: Ethan Solomita <solo@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	3dfa5721f1	Page allocator: get rid of the list of cold pages We have repeatedly discussed if the cold pages still have a point. There is one way to join the two lists: Use a single list and put the cold pages at the end and the hot pages at the beginning. That way a single list can serve for both types of allocations. The discussion of the RFC for this and Mel's measurements indicate that there may not be too much of a point left to having separate lists for hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Robert Bragg	5dc3318528	mm: don't allow ioremapping of ranges larger than vmalloc space When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the vmlist search routine in __get_vm_area_node can mistakenly allow a driver to ioremap a range larger than vmalloc space. If at the time of the ioremap all existing vmlist areas sit below the determined alignment then the search routine continues past all entries and exits the for loop - straight into the found: label - without ever testing for integer wrapping or that the requested size fits. We were seeing a driver successfully ioremap 128M of flash even though there was only 120M of vmalloc space. From that point the system was left with the remainder of the first 16M of space to vmalloc/ioremap within. Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	a7f75e2586	vmstat: small revisions to refresh_cpu_vm_stats() 1. Add comments explaining how the function can be called. 2. Collect global diffs in a local array and only spill them once into the global counters when the zone scan is finished. This means that we only touch each global counter once instead of each time we fold cpu counters into zone counters. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Martin Schwidefsky	08e7d9b557	arch_rebalance_pgtables call In order to change the layout of the page tables after an mmap has crossed the adress space limit of the current page table layout a architecture hook in get_unmapped_area is needed. The arguments are the address of the new mapping and the length of it. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Benjamin Herrenschmidt	5e5419734c	add mm argument to pte/pmd/pud/pgd_free (with Martin Schwidefsky <schwidefsky@de.ibm.com>) The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as first argument. The free functions do not get the mm_struct argument. This is 1) asymmetrical and 2) to do mm related page table allocations the mm argument is needed on the free function as well. [kamalesh@linux.vnet.ibm.com: i386 fix] [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:18 -08:00
Christoph Lameter	9f8f217253	Page allocator: clean up pcp draining functions - Add comments explaing how drain_pages() works. - Eliminate useless functions - Rename drain_all_local_pages to drain_all_pages(). It does drain all pages not only those of the local processor. - Eliminate useless interrupt off / on sequences. drain_pages() disables interrupts on its own. The execution thread is pinned to processor by the caller. So there is no need to disable interrupts. - Put drain_all_pages() declaration in gfp.h and remove the declarations from suspend.h and from mm/memory_hotplug.c - Make software suspend call drain_all_pages(). The draining of processor local pages is may not the right approach if software suspend wants to support SMP. If they call drain_all_pages then we can make drain_pages() static. [akpm@linux-foundation.org: fix build] Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Nick Piggin	e2848a0efe	radix-tree: avoid atomic allocations for preloaded insertions Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Adrian Bunk	e31d9eb5c1	make __vmalloc_area_node() static __vmalloc_area_node() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Balbir Singh	625d9573d0	Remove unused code from mm/tiny-shmem.c This code in mm/tiny-shmem.c is under #if 0 - remove it. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Adrian Bunk	f61eaf9fc5	mm/page-writeback.c: make a function static task_dirty_limit() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Matt Mackall	1e88328111	maps4: make page monitoring /proc file optional Make /proc/ page monitoring configurable This puts the following files under an embedded config option: /proc/pid/clear_refs /proc/pid/smaps /proc/pid/pagemap /proc/kpagecount /proc/kpageflags [akpm@linux-foundation.org: Kconfig fix] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:17 -08:00
Matt Mackall	e6473092bd	maps4: introduce a generic page walker Introduce a general page table walker Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Matt Mackall	698dd4ba6b	maps4: move is_swap_pte Move is_swap_pte helper function to swapops.h for use by pagemap code Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Christoph Hellwig	61d5048f14	clean up vmtruncate vmtruncate is a twisted maze of gotos, this patch cleans it up to have a proper if else for the two major cases of extending and truncating truncate and thus makes it a lot more readable while keeping exactly the same functinality. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Hugh Dickins	1b1b32f2c6	tmpfs: fix shmem_swaplist races Intensive swapoff testing shows shmem_unuse spinning on an entry in shmem_swaplist pointing to itself: how does that come about? Days pass... First guess is this: shmem_delete_inode tests list_empty without taking the global mutex (so the swapping case doesn't slow down the common case); but there's an instant in shmem_unuse_inode's list_move_tail when the list entry may appear empty (a rare case, because it's actually moving the head not the the list member). So there's a danger of leaving the inode on the swaplist when it's freed, then reinitialized to point to itself when reused. Fix that by skipping the list_move_tail when it's a no-op, which happens to plug this. But this same spinning then surfaces on another machine. Ah, I'd never suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we still hold page lock, which would hold off inode deletion if the page were in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache doesn't wait on locked pages). Hmm: we could put the the inode on swaplist earlier, but then shmem_unuse_inode could never prune unswapped inodes. Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode; though I am a little uneasy about the iput which has to follow - it works, and I see nothing wrong with it, but it is surprising that shmem inode deletion may now occur below shmem_writepage. Revisit this fix later? And while we're looking at these races: the way shmem_unuse tests swapped without holding info->lock looks unsafe, if we've more than one swap area: a racing shmem_writepage on another page of the same inode could be putting it in swapcache, just as we're deciding to remove the inode from swaplist - there's a danger of going on swap without being listed, so a later swapoff would hang, being unable to locate the entry. Move that test and removal down into shmem_unuse_inode, once info->lock is held. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:16 -08:00
Hugh Dickins	b409f9fcf0	tmpfs: radix_tree_preloading Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache or swap cache, without any radix tree preload: so tending to deplete emergency reserves of memory. GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's being called under memory pressure, so must not wait for more memory to become available. But shmem_unuse_inode now has a window in which it can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its add_to_page_cache. shmem_getpage is not so straightforward: its filepage/swappage integrity relies upon exchanging between caches under spinlock, and it would need a lot of restructuring to place the preloads correctly. Instead, follow its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in add_to_page_cache, and begin each circuit of the repeat loop with a sleeping radix_tree_preload, followed immediately by radix_tree_preload_end - that won't guarantee success in the next add_to_page_cache, but doesn't need to. And we can then remove that bothersome congestion_wait: when needed, it'll automatically get done in the course of the radix_tree_preload. Signed-off-by: Hugh Dickins <hugh@veritas.com> Looks-good-to: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	2e0e26c76a	tmpfs: open a window in shmem_unuse_inode There are a couple of reasons (patches follow) why it would be good to open a window for sleep in shmem_unuse_inode, between its search for a matching swap entry, and its handling of the entry found. shmem_unuse_inode must then use igrab to hold the inode against deletion in that window, and its corresponding iput might result in deletion: so it had better unlock_page before the iput, and might as well release the page too. Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse into shmem_unuse_inode, in the case when it finds a match. Let try_to_unuse break on error in the shmem_unuse case, as it does in the unuse_mm case: though at this point in the series, no error to break on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	cb5f7b9a47	tmpfs: make shmem_unuse more preemptible shmem_unuse is at present an unbroken search through every swap vector page of every tmpfs file which might be swapped, all under shmem_swaplist_lock. This dates from long ago, when the caller held mmlist_lock over it all too: long gone, but there's never been much pressure for preemptible swapoff. Make it a little more preemptible, replacing shmem_swaplist_lock by shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a cond_resched_lock (on info->lock) at one convenient point in the shmem_unuse_inode loop, where it has no outstanding kmap_atomic. If we're serious about preemptible swapoff, there's much further to go e.g. I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM case dictate preemptiblility for other configs. But as in the earlier patch to make swapoff scan ptes preemptibly, my hidden agenda is really towards making memcgroups work, hardly about preemptibility at all. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	a0ee5ec520	tmpfs: allocate on read when stacked tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or size=0). But if a stacked filesystem such as unionfs gets pages from a sparse tmpfs file by reading holes, and then writes to them, it can easily exceed any such limit at present. So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed, pessimistically mark such pages as dirty, so they cannot get reclaimed and unaccounted by mistake. The venerable shmem_recalc_inode code (originally to account for the reclaim of clean pages) suffices to get the accounting right when swappages are dropped in favour of more uptodate filepages. This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage, caused by unionfs writing to a very sparse tmpfs file: to minimize memory allocation in swapout, tmpfs requires the swap vector be allocated upfront, which wasn't always happening in this stacked case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	d9fe526a83	tmpfs: allow filepage alongside swappage tmpfs has long allowed for a fresh filepage to be created in pagecache, just before shmem_getpage gets the chance to match it up with the swappage which already belongs to that offset. But unionfs_writepage now does a find_or_create_page, divorced from shmem_getpage, which leaves conflicting filepage and swappage outstanding indefinitely, when unionfs is over tmpfs. Therefore shmem_writepage (where a page is swizzled from file to swap) must now be on the lookout for existing swap, ready to free it in favour of the more uptodate filepage, instead of BUGging on that clash. And when the add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate filepage, otherwise swapoff would hang. Whereas when add_to_page_cache fails in shmem_getpage, it should retry in the same way it already does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	73b1262fa4	tmpfs: move swap swizzling into shmem move_to_swap_cache and move_from_swap_cache functions (which swizzle a page between tmpfs page cache and swap cache, to avoid page copying) are only used by shmem.c; and our subsequent fix for unionfs needs different treatments in the two instances of move_from_swap_cache. Move them from swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making add_to_swap_cache externally visible. shmem.c likes to say set_page_dirty where swap_state.c liked to say SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback makes moot (and implies we should lose that "shift page from clean_pages to dirty_pages list" comment: it's on neither). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	f000944d03	tmpfs: shuffle add_to_swap_caches add_to_swap_cache doesn't amount to much: merge it into its sole caller read_swap_cache_async. But we'll be needing to call __add_to_swap_cache from shmem.c, so promote it to the new add_to_swap_cache. Both were static, so there's no interface confusion to worry about. And lose that inappropriate "Anon pages are already on the LRU" comment in the merging: they're not already on the LRU, as Nick Piggin noticed. Signed-off-by: Hugh Dickins <hugh@veritas.com> No-problems-with: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	bb63be0a09	tmpfs: move swap_state stats update Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help fix, it's best to move the swap swizzling functions from swap_state.c to shmem.c. As a preliminary to that, move swap stats updating down into __add_to_swap_cache, which will remain internal to swap_state.c. Well, actually, just move down the incrementation of add_total: remove noent_race and exist_race completely, they are relics of my 2.4.11 testing. Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Michael Marineau	818db35992	tmpfs: fix mounts when size is less than the page size When tmpfs is mounted with a size less than one page, the number of blocks is set to 0 which makes the tmpfs mount unlimited. This can lead to a quick and surprising death if someone typos a tmpfs mount command and writes too much. tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0, as Documentation/filesystems/tmpfs.txt says. Hugh: do this by rounding size up instead of down in all cases: which slightly expands other odd-sized tmpfs mounts, but in a consistent way. Signed-off-by: Michael Marineau <mike@marineau.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Pavel Emelyanov	5b04c6890f	shmem: factor out sbi->free_inodes manipulations The shmem_sb_info structure has a number of free_inodes. This value is altered in appropriate places under spinlock and with the sbi->max_inodes != 0 check. Consolidate these manipulations into two helpers. This is minus 42 bytes of shmem.o and minus 4 :) lines of code. [akpm@linux-foundation.org: fix error return values] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	2e441889c3	swapoff: scan ptes preemptibly Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency in swapoff by scanning the page table preemptibly: so long as unuse_pte is careful to recheck that entry under pte lock. (To tell the truth, this patch was not inspired by any cries for lower latency here: rather, this restructuring permits a future memory controller patch to allocate with GFP_KERNEL in unuse_pte, where before it could not. But it would be wrong to tuck this change away inside a memcgroup patch.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	8952898b0d	swapin: fix valid_swaphandles defect valid_swaphandles is supposed to do a quick pass over the swap map entries neigbouring the entry which swapin_readahead is targetting, to determine for it a range worth reading all together. But since it always starts its search from the beginning of the swap "cluster", a reject (free entry) there immediately curtails the readaround, and every swapin_readahead from that cluster is for just a single page. Instead scan forwards and backwards around the target entry. Use better names for some variables: a swap_info pointer is usually called "si" not "swapdev". And at the end, if only the target page should be read, return count of 0 to disable readaround, to avoid the unnecessarily repeated call to read_swap_cache_async. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	5402b976ae	shmem_file_write is redundant With the old aops, writing to a tmpfs file had to use its own special method: the generic method would pass in a fresh page to prepare_write when the right page was there in swapcache - which was inefficient to handle, even once we'd concocted the code to handle it. With the new aops, the generic method uses shmem_write_end, which lets shmem_getpage find the right page: so now abandon shmem_file_write in favour of the generic method. Yes, that does do several things that tmpfs hasn't really needed (notably balance_dirty_pages_ratelimited, which ramfs also calls); but more use of common code is preferable. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	d3602444e1	shmem_getpage return page locked In the new aops, write_begin is supposed to return the page locked: though I've seen no ill effects, that's been overlooked in the case of shmem_write_begin, and should be fixed. Then shmem_write_end must unlock the page: do so _after_ updating i_size, as we found to be important in other filesystems (though since shmem pages don't go the usual writeback route, they never suffered from that corruption). For shmem_write_begin to return the page locked, we need shmem_getpage to return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's simplify the interface and return it locked even when SGP_READ. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:15 -08:00
Hugh Dickins	27d54b398e	shmem: SGP_QUICK and SGP_FAULT redundant Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no users now. Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and shmem_getpage is about to return with page always locked). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Hugh Dickins	02098feaa4	swapin needs gfp_mask for loop on tmpfs Building in a filesystem on a loop device on a tmpfs file can hang when swapping, the loop thread caught in that infamous throttle_vm_writeout. In theory this is a long standing problem, which I've either never seen in practice, or long ago suppressed the recollection, after discounting my load and my tmpfs size as unrealistically high. But now, with the new aops, it has become easy to hang on one machine. Loop used to grab_cache_page before the old prepare_write to tmpfs, which seems to have been enough to free up some memory for any swapin needed; but the new write_begin lets tmpfs find or allocate the page (much nicer, since grab_cache_page missed tmpfs pages in swapcache). When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which has __GFP_IO\|__GFP_FS stripped off, and throttle_vm_writeout is designed to break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in, read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the mapping_gfp_mask - hence the hang. So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to swapin_readahead to read_swap_cache_async to add_to_swap_cache. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Hugh Dickins	46017e9548	swapin_readahead: move and rearrange args swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c beside its kindred read_swap_cache_async. Why were its args in a different order? rearrange them. And since it was always followed by a read_swap_cache_async of the target page, fold that in and return struct page*. Then CONFIG_SWAP=n no longer needs valid_swaphandles and read_swap_cache_async stubs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Hugh Dickins	c4cc6d07b2	swapin_readahead: excise NUMA bogosity For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA code, advancing addr, and stepping on to the next vma at the boundary, to line up the mempolicy for each page allocation. It _might_ be a good idea to allocate swap more according to vma layout; but the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is allocated as needed for pages as they sink to the bottom of the inactive LRUs. Sometimes that may match vma layout, but not so often that it's worth going to these misleading vma->vm_next lengths: rip all that out. Originally I intended to retain the incrementation of addr, but correct its initial value: valid_swaphandles generally supplies an offset below the target addr (this is readaround rather than readahead), but addr has not been adjusted accordingly, so in the interleave case it has usually been allocating the target page from the "wrong" node (though that may not matter very much). But look at the equivalent shmem_swapin code: either by oversight or by design, though it has all the apparatus for choosing a new mempolicy per page, it uses the same idx throughout, choosing the same mempolicy and interleave node for each page of the cluster. Which is actually a much better strategy: each node has its own LRUs and its own kswapd, so if you're betting on any particular relationship between swap and node, the best bet is that nearby swap entries belong to pages from the same node - even when the mempolicy of the target page is to interleave. And examining a map of nodes corresponding to swap entries on a numa=fake system bears this out. (We could later tweak swap allocation to make it even more likely, but this patch is merely about removing cruft.) So, neither adjust nor increment addr in swapin_readahead, and then shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up once per cluster, and so few fields of pvma are used, let's skip the memset - from shmem_alloc_page also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	bf53d6f8fa	vmalloc: clean up page array indexing The page array is repeatedly indexed both in vunmap and vmalloc_area_node(). Add a temporary variable to make it easier to read (and easier to patch later). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	9e2779fa28	is_vmalloc_addr(): Check if an address is within the vmalloc boundaries Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	b3bdda02aa	vmalloc: add const to void* parameters Make vmalloc functions work the same way as kfree() and friends that take a const void * argument. [akpm@linux-foundation.org: fix consts, coding-style] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:14 -08:00
Christoph Lameter	48667e7a43	Move vmalloc_to_page() to mm/vmalloc. We already have page table manipulation for vmalloc in vmalloc.c. Move the vmalloc_to_page() function there as well. Move the definitions for vmalloc related functions in mm.h to a newly created section. A better place would be vmalloc.h but mm.h is basic and may depend on these functions. An alternative would be to include vmalloc.h in mm.h (like done for vmstat.h). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:13 -08:00
Christoph Lameter	eebd2aa355	Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations and makes code clearer. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. [akpm@linux-foundation.org: fix nfs and ntfs build] [akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:13 -08:00
Oleg Nesterov	8a459e44ad	sys_remap_file_pages: fix ->vm_file accounting Fix ->vm_file accounting, mmap_region() may do do_munmap(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-05 09:44:07 -08:00
Linus Torvalds	2c8296f8cf	Merge branch 'slub-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm * 'slub-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm: Explain kmem_cache_cpu fields SLUB: Do not upset lockdep SLUB: Fix coding style violations Add parameter to add_partial to avoid having two functions SLUB: rename defrag to remote_node_defrag_ratio Move count_partial before kmem_cache_shrink SLUB: Fix sysfs refcounting slub: fix shadowed variable sparse warnings	2008-02-04 12:14:55 -08:00
root	ba84c73c7a	SLUB: Do not upset lockdep inconsistent {softirq-on-W} -> {in-softirq-W} usage. swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes: (&n->list_lock){-+..}, at: [<ffffffff802935c1>] add_partial+0x31/0xa0 {softirq-on-W} state was registered at: [<ffffffff80259fb8>] __lock_acquire+0x3e8/0x1140 [<ffffffff80259838>] debug_check_no_locks_freed+0x188/0x1a0 [<ffffffff8025ad65>] lock_acquire+0x55/0x70 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff805c76de>] _spin_lock+0x1e/0x30 [<ffffffff802935c1>] add_partial+0x31/0xa0 [<ffffffff80296f9c>] kmem_cache_open+0x1cc/0x330 [<ffffffff805c7984>] _spin_unlock_irq+0x24/0x30 [<ffffffff802974f4>] create_kmalloc_cache+0x64/0xf0 [<ffffffff80295640>] init_alloc_cpu_cpu+0x70/0x90 [<ffffffff8080ada5>] kmem_cache_init+0x65/0x1d0 [<ffffffff807f1b4e>] start_kernel+0x23e/0x350 [<ffffffff807f112d>] _sinittext+0x12d/0x140 [<ffffffffffffffff>] 0xffffffffffffffff This change isn't really necessary for correctness, but it prevents lockdep from getting upset and then disabling itself. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:02 -08:00
Pekka Enberg	064287807c	SLUB: Fix coding style violations This fixes most of the obvious coding style violations in mm/slub.c as reported by checkpatch. Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:02 -08:00
Christoph Lameter	7c2e132c54	Add parameter to add_partial to avoid having two functions Add a parameter to add_partial instead of having separate functions. The parameter allows a more detailed control of where the slab pages is placed in the partial queues. If we put slabs back to the front then they are likely immediately used for allocations. If they are put at the end then we can maximize the time that the partial slabs spent without being subject to allocations. When deactivating slab we can put the slabs that had remote objects freed (we can see that because objects were put on the freelist that requires locks) to them at the end of the list so that the cachelines of remote processors can cool down. Slabs that had objects from the local cpu freed to them (objects exist in the lockless freelist) are put in the front of the list to be reused ASAP in order to exploit the cache hot state of the local cpu. Patch seems to slightly improve tbench speed (1-2%). Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:02 -08:00
Christoph Lameter	9824601ead	SLUB: rename defrag to remote_node_defrag_ratio The NUMA defrag works by allocating objects from partial slabs on remote nodes. Rename it to remote_node_defrag_ratio to be clear about this. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:02 -08:00
Christoph Lameter	f61396aed9	Move count_partial before kmem_cache_shrink Move the counting function for objects in partial slabs so that it is placed before kmem_cache_shrink. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-02-04 10:56:01 -08:00
Christoph Lameter	151c602f79	SLUB: Fix sysfs refcounting If CONFIG_SYSFS is set then free the kmem_cache structure when sysfs tells us its okay. Otherwise there is the danger (as pointed out by Al Viro) that sysfs thinks the kobject still exists after kmem_cache_destroy() removed it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>	2008-02-04 10:56:01 -08:00
Harvey Harrison	e374d48356	slub: fix shadowed variable sparse warnings Introduce 'len' at outer level: mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one mm/slub.c:3393:6: originally declared here No need to declare new node: mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one mm/slub.c:3491:6: originally declared here No need to declare new x: mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one mm/slub.c:3492:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>	2008-02-04 10:56:01 -08:00
Linus Torvalds	f5bb3a5e9d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits) Jesper Juhl is the new trivial patches maintainer Documentation: mention email-clients.txt in SubmittingPatches fs/binfmt_elf.c: spello fix do_invalidatepage() comment typo fix Documentation/filesystems/porting fixes typo fixes in net/core/net_namespace.c typo fix in net/rfkill/rfkill.c typo fixes in net/sctp/sm_statefuns.c lib/: Spelling fixes kernel/: Spelling fixes include/scsi/: Spelling fixes include/linux/: Spelling fixes include/asm-m68knommu/: Spelling fixes include/asm-frv/: Spelling fixes fs/: Spelling fixes drivers/watchdog/: Spelling fixes drivers/video/: Spelling fixes drivers/ssb/: Spelling fixes drivers/serial/: Spelling fixes drivers/scsi/: Spelling fixes ...	2008-02-04 07:58:52 -08:00
Nick Piggin	2f98735c9c	vm audit: add VM_DONTEXPAND to mmap for drivers that need it Drivers that register a ->fault handler, but do not range-check the offset argument, must set VM_DONTEXPAND in the vm_flags in order to prevent an expanding mremap from overflowing the resource. I've audited the tree and attempted to fix these problems (usually by adding VM_DONTEXPAND where it is not obvious). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-04 07:55:38 -08:00
Fengguang Wu	28bc44d7d1	do_invalidatepage() comment typo fix Fix a typo in the comment for do_invalidatepage(). Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2008-02-03 18:04:10 +02:00
Nick Piggin	124d3b7041	fix writev regression: pan hanging unkillable and un-straceable Frederik Himpe reported an unkillable and un-straceable pan process. Zero length iovecs can go into an infinite loop in writev, because the iovec iterator does not always advance over them. The sequence required to trigger this is not trivial. I think it requires that a zero-length iovec be followed by a non-zero-length iovec which causes a pagefault in the atomic usercopy. This causes the writev code to drop back into single-segment copy mode, which then tries to copy the 0 bytes of the zero-length iovec; a zero length copy looks like a failure though, so it loops. Put a test into iov_iter_advance to catch zero-length iovecs. We could just put the test in the fallback path, but I feel it is more robust to skip over zero-length iovecs throughout the code (iovec iterator may be used in filesystems too, so it should be robust). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-03 07:55:39 +11:00
Linus Torvalds	75659ca0c1	Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits) Remove commented-out code copied from NFS NFS: Switch from intr mount option to TASK_KILLABLE Add wait_for_completion_killable Add wait_event_killable Add schedule_timeout_killable Use mutex_lock_killable in vfs_readdir Add mutex_lock_killable Use lock_page_killable Add lock_page_killable Add fatal_signal_pending Add TASK_WAKEKILL exit: Use task_is_* signal: Use task_is_* sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL ptrace: Use task_is_* power: Use task_is_* wait: Use TASK_NORMAL proc/base.c: Use task_is_* proc/array.c: Use TASK_REPORT perfmon: Use task_is_* ... Fixed up conflicts in NFS/sunrpc manually..	2008-02-01 11:45:47 +11:00
Andi Kleen	03252919b7	x86: print which shared library/executable faulted in segfault etc. messages v3 They now look like: hal-resmgr[13791]: segfault at 3c rip 2b9c8caec182 rsp 7fff1e825d30 error 4 in libacl.so.1.1.0[2b9c8caea000+6000] This makes it easier to pinpoint bugs to specific libraries. And printing the offset into a mapping also always allows to find the correct fault point in a library even with randomized mappings. Previously there was no way to actually find the correct code address inside the randomized mapping. Relies on earlier patch to shorten the printk formats. They are often now longer than 80 characters, but I think that's worth it. [includes fix from Eric Dumazet to check d_path error value] Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:33:18 +01:00
Nick Piggin	95c354fe9f	spinlock: lockbreak cleanup The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:31:20 +01:00
Jiri Kosina	c1d171a002	x86: randomize brk Randomize the location of the heap (brk) for i386 and x86_64. The range is randomized in the range starting at current brk location up to 0x02000000 offset for both architectures. This, together with pie-executable-randomization.patch and pie-executable-randomization-fix.patch, should make the address space randomization on i386 and x86_64 complete. Arjan says: This is known to break older versions of some emacs variants, whose dumper code assumed that the last variable declared in the program is equal to the start of the dynamically allocated memory region. (The dumper is the code where emacs effectively dumps core at the end of it's compilation stage; this coredump is then loaded as the main program during normal use) iirc this was 5 years or so; we found this way back when I was at RH and we first did the security stuff there (including this brk randomization). It wasn't all variants of emacs, and it got fixed as a result (I vaguely remember that emacs already had code to deal with it for other archs/oses, just ifdeffed wrongly). It's a rare and wrong assumption as a general thing, just on x86 it mostly happened to be true (but to be honest, it'll break too if gcc does something fancy or if the linker does a non-standard order). Still its something we should at least document. Note 2: afaik it only broke the emacs build. I'm not 100% sure about that (it IS 5 years ago) though. [ akpm@linux-foundation.org: deuglification ] Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:30:40 +01:00
Paul Mundt	d5f68c6dbd	sh: Bump number of quicklists for SH-5. Sync up with the SH definitions. Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2008-01-28 13:18:55 +09:00
Peter Zijlstra	fa717060f1	sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:27 +01:00
Gautham R Shenoy	95402b3829	cpu-hotplug: replace per-subsystem mutexes with get_online_cpus() This patch converts the known per-subsystem mutexes to get_online_cpus put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug notification events. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:02 +01:00
Linus Torvalds	b47711bfbc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6: selinux: make mls_compute_sid always polyinstantiate security/selinux: constify function pointer tables and fields security: add a secctx_to_secid() hook security: call security_file_permission from rw_verify_area security: remove security_sb_post_mountroot hook Security: remove security.h include from mm.h Security: remove security_file_mmap hook sparse-warnings (NULL as 0). Security: add get, set, and cloning of superblock security information security/selinux: Add missing "space"	2008-01-25 08:44:29 -08:00
Linus Torvalds	df8dc74e8a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6 This can be broken down into these major areas: - Documentation updates (language translations and fixes, as well as kobject and kset documenatation updates.) - major kset/kobject/ktype rework and fixes. This cleans up the kset and kobject and ktype relationship and architecture, making sense of things now, and good documenation and samples are provided for others to use. Also the attributes for kobjects are much easier to handle now. This cleaned up a LOT of code all through the kernel, making kobjects easier to use if you want to. - struct bus_type has been reworked to now handle the lifetime rules properly, as the kobject is properly dynamic. - struct driver has also been reworked, and now the lifetime issues are resolved. - the block subsystem has been converted to use struct device now, and not "raw" kobjects. This patch has been in the -mm tree for over a year now, and finally all the issues are worked out with it. Older distros now properly work with new kernels, and no userspace updates are needed at all. - nozomi driver is added. This has also been in -mm for a long time, and many people have asked for it to go in. It is now in good enough shape to do so. - lots of class_device conversions to use struct device instead. The tree is almost all cleaned up now, only SCSI and IB is the remaining code to fix up... * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (196 commits) Driver core: coding style fixes Kobject: fix coding style issues in kobject c files Kobject: fix coding style issues in kobject.h Driver core: fix coding style issues in device.h spi: use class iteration api scsi: use class iteration api rtc: use class iteration api power supply : use class iteration api ieee1394: use class iteration api Driver Core: add class iteration api Driver core: Cleanup get_device_parent() in device_add() and device_move() UIO: constify function pointer tables Driver Core: constify the name passed to platform_device_register_simple driver core: fix build with SYSFS=n sysfs: make SYSFS_DEPRECATED depend on SYSFS Driver core: use LIST_HEAD instead of call to INIT_LIST_HEAD in __init kobject: add sample code for how to use ksets/ktypes/kobjects kobject: add sample code for how to use kobjects in a simple manner. kobject: update the kobject/kset documentation kobject: remove old, outdated documentation. ...	2008-01-25 08:35:13 -08:00
Pekka Enberg	556a169dab	slab: fix bootstrap on memoryless node If the node we're booting on doesn't have memory, bootstrapping kmalloc() caches resorts to fallback_alloc() which requires ->nodelists set for all nodes. Fix that by calling set_up_list3s() for CACHE_CACHE in kmem_cache_init(). As kmem_getpages() is called with GFP_THISNODE set, this used to work before because of breakage in 2.6.22 and before with GFP_THISNODE returning pages from the wrong node if a node had no memory. So it may have worked accidentally and in an unsafe manner because the pages would have been associated with the wrong node which could trigger bug ons and locking troubles. Tested-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> [ With additional one-liner by Olaf Hering - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-25 08:30:36 -08:00
Greg Kroah-Hartman	1eada11c88	Kobject: convert mm/slub.c to use kobject_init/add_ng() This converts the code to use the new kobject functions, cleaning up the logic in doing so. Cc: Christoph Lameter <clameter@sgi.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:31 -08:00
Greg Kroah-Hartman	0ff21e4663	kobject: convert kernel_kset to be a kobject kernel_kset does not need to be a kset, but a much simpler kobject now that we have kobj_attributes. We also rename kernel_kset to kernel_kobj to catch all users of this symbol with a build error instead of an easy-to-ignore build warning. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:24 -08:00
Greg Kroah-Hartman	081248de0a	kset: move /sys/slab to /sys/kernel/slab /sys/kernel is where these things should go. Also updated the documentation and tool that used this directory. Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:16 -08:00
Greg Kroah-Hartman	27c3a314d5	kset: convert slub to use kset_create Dynamically create the kset instead of declaring it statically. Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:15 -08:00
Greg Kroah-Hartman	3514faca19	kobject: remove struct kobj_type from struct kset We don't need a "default" ktype for a kset. We should set this explicitly every time for each kset. This change is needed so that we can make ksets dynamic, and cleans up one of the odd, undocumented assumption that the kset/kobject/ktype model has. This patch is based on a lot of help from Kay Sievers. Nasty bug in the block code was found by Dave Young <hidave.darkstar@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:10 -08:00
Richard Knutsson	88c3f7a8f2	Security: remove security_file_mmap hook sparse-warnings (NULL as 0). Fixing: CHECK mm/mmap.c mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer mm/mmap.c:1623:29: warning: Using plain integer as NULL pointer mm/mmap.c:1944:29: warning: Using plain integer as NULL pointer Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se> Signed-off-by: James Morris <jmorris@namei.org>	2008-01-25 11:29:48 +11:00
Mel Gorman	9c09a95cf4	slab: partially revert list3 changes Partial revert the changes made by `04231b3002` to the kmem_list3 management. On a machine with a memoryless node, this BUG_ON was triggering static void ____cache_alloc_node(struct kmem_cache cachep, gfp_t flags, int nodeid) { struct list_head entry; struct slab slabp; struct kmem_list3 l3; void obj; int x; l3 = cachep->nodelists[nodeid]; BUG_ON(!l3); Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-24 08:07:27 -08:00
Larry Woodman	c5c99429fa	fix hugepages leak due to pagetable page sharing The shared page table code for hugetlb memory on x86 and x86_64 is causing a leak. When a user of hugepages exits using this code the system leaks some of the hugepages. ------------------------------------------------------- Part of /proc/meminfo just before database startup: HugePages_Total: 5500 HugePages_Free: 5500 HugePages_Rsvd: 0 Hugepagesize: 2048 kB Just before shutdown: HugePages_Total: 5500 HugePages_Free: 4475 HugePages_Rsvd: 0 Hugepagesize: 2048 kB After shutdown: HugePages_Total: 5500 HugePages_Free: 4988 HugePages_Rsvd: 0 Hugepagesize: 2048 kB ---------------------------------------------------------- The problem occurs durring a fork, in copy_hugetlb_page_range(). It locates the dst_pte using huge_pte_alloc(). Since huge_pte_alloc() calls huge_pmd_share() it will share the pmd page if can, yet the main loop in copy_hugetlb_page_range() does a get_page() on every hugepage. This is a violation of the shared hugepmd pagetable protocol and creates additional referenced to the hugepages causing a leak when the unmap of the VMA occurs. We can skip the entire replication of the ptes when the hugepage pagetables are shared. The attached patch skips copying the ptes and the get_page() calls if the hugetlbpage pagetable is shared. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-24 08:07:27 -08:00
Anton Salikhmetov	8f7b3d156d	Update ctime and mtime for memory-mapped files Update ctime and mtime for memory-mapped files at a write access on a present, read-only PTE, as well as at a write on a non-present PTE. Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-23 09:58:55 -08:00
Carsten Otte	9723198c21	#ifdef very expensive debug check in page fault path This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page that verifies that a pfn is valid. This patch increases performance of the page fault microbenchmark in lmbench by 13% and overall dbench performance by 7% on s390x. pfn_valid() is an expensive operation on s390 that needs a high double digit amount of CPU cycles. Nick Piggin suggested that pfn_valid() involves an array lookup on systems with sparsemem, and therefore is an expensive operation there too. The check looks like a clear debug thing to me, it should never trigger on regular kernels. And if a pte is created for an invalid pfn, we'll find out once the memory gets accessed later on anyway. Please consider inclusion of this patch into mm. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-17 15:38:59 -08:00
Sam Ravnborg	1d6f4e60e7	mm: fix section mismatch warning in page_alloc.c With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw following warning: WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback') The culprit was zone_batchsize() which were annotated __devinit but used from process_zones() which is annotated __cpuinit. zone_batchsize() are used from another function annotated __meminit so the only valid option is to drop the annotation of zone_batchsize() so we know it is always valid to use it. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-17 15:38:58 -08:00
Linus Torvalds	c23f72cae9	Revert "writeback: introduce writeback_control.more_io to indicate more io" This reverts commit `2e6883bdf4`, as requested by Fengguang Wu. It's not quite fully baked yet, and while there are patches around to fix the problems it caused, they should get more testing. Says Fengguang: "I'll resend them both for -mm later on, in a more complete patchset". See http://bugzilla.kernel.org/show_bug.cgi?id=9738 for some of this discussion. Requested-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 21:21:29 -08:00
Ken Chen	68842c9b94	hugetlbfs: fix quota leak In the error path of both shared and private hugetlb page allocation, the file system quota is never undone, leading to fs quota leak. Fix them up. [akpm@linux-foundation.org: cleanup, micro-optimise] Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 08:52:23 -08:00
Christoph Lameter	96990a4ae9	quicklists: Only consider memory that can be used with GFP_KERNEL Quicklists calculates the size of the quicklists based on the number of free pages. This must be the number of free pages that can be allocated with GFP_KERNEL. node_page_state() includes the pages in ZONE_HIGHMEM and ZONE_MOVABLE which may lead the quicklists to become too large causing OOM. Signed-off-by: Christoph Lameter <clameter@sgi.com> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-14 08:52:22 -08:00
Thomas Bogendoerfer	467bc461d2	Fix crash with FLAT_MEMORY and ARCH_PFN_OFFSET != 0 When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in memmap_init_zone(). This bug got introduced by commit `c713216dee` Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:36 -08:00
Akinobu Mita	c51b1a160b	xip: fix get_zeroed_page with __GFP_HIGHMEM The use of get_zeroed_page() with __GFP_HIGHMEM is invalid. Use alloc_page() with __GFP_ZERO instead of invalid get_zeroed_page(). (This patch is only compile tested) Cc: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:36 -08:00
Linus Torvalds	158a962422	Unify /proc/slabinfo configuration Both SLUB and SLAB really did almost exactly the same thing for /proc/slabinfo setup, using duplicate code and per-allocator #ifdef's. This just creates a common CONFIG_SLABINFO that is enabled by both SLUB and SLAB, and shares all the setup code. Maybe SLOB will want this some day too. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-02 13:04:48 -08:00
Pekka J Enberg	57ed3eda97	slub: provide /proc/slabinfo This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work. [ mingo@elte.hu: build fix. ] Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-01 11:32:02 -08:00
Christoph Lameter	76be895001	SLUB: Improve hackbench speed Increase the mininum number of partial slabs to keep around and put partial slabs to the end of the partial queue so that they can add more objects. Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-21 15:51:07 -08:00
Linus Torvalds	3a6927906f	Do dirty page accounting when removing a page from the page cache Krzysztof Oledzki noticed a dirty page accounting leak on some of his machines, causing the machine to eventually lock up when the kernel decided that there was too much dirty data, but nobody could actually write anything out to fix it. The culprit turns out to be filesystems (cough ext3 with data=journal cough) that re-dirty the page when the "->invalidatepage()" callback is called. Fix it up by doing a final dirty page accounting check when we actually remove the page from the page cache. This fixes bugzilla entry 9182: http://bugzilla.kernel.org/show_bug.cgi?id=9182 Tested-by: Ingo Molnar <mingo@elte.hu> Tested-by: Krzysztof Oledzki <olel@ans.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-19 14:05:13 -08:00
Christoph Lameter	3811dbf671	SLUB: remove useless masking of GFP_ZERO Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already masked out in new_slab() (See how it calls allocate_slab). No need to do it twice. This reverts the SLUB parts of `7fd272550b`. Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Nishanth Aravamudan	368d2c6358	Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" This reverts commit `54f9f80d65` ("hugetlb: Add hugetlb_dynamic_pool sysctl") Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool sysctl is not needed, as its semantics can be expressed by 0 in the overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl (pool enabled). (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change) Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Nishanth Aravamudan	d1c3fb1f8f	hugetlb: introduce nr_overcommit_hugepages sysctl hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:17 -08:00
Mel Gorman	81eabcbe0b	mm: fix page allocation for larger I/O segments In some cases the IO subsystem is able to merge requests if the pages are adjacent in physical memory. This was achieved in the allocator by having expand() return pages in physically contiguous order in situations were a large buddy was split. However, list-based anti-fragmentation changed the order pages were returned in to avoid searching in buffered_rmqueue() for a page of the appropriate migrate type. This patch restores behaviour of rmqueue_bulk() preserving the physical order of pages returned by the allocator without incurring increased search costs for anti-fragmentation. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mark Lord <mlord@pobox.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
WANG Cong	bbd0682596	mm/sparse.c: improve the error handling for sparse_add_one_section() Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I see no reason to check 'usemap' until holding the 'pgdat_resize_lock'. [geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST] Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
WANG Cong	af0cd5a7c3	mm/sparse.c: check the return value of sparse_index_alloc() Since sparse_index_alloc() can return NULL on memory allocation failure, we must deal with the failure condition when calling it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Geoff Levand	a5ee6daa52	sparsemem: make SPARSEMEM_VMEMMAP selectable SPARSEMEM_VMEMMAP needs to be a selectable config option to support building the kernel both with and without sparsemem vmemmap support. This selection is desirable for platforms which could be configured one way for platform specific builds and the other for multi-platform builds. Signed-off-by: Miguel Botón <mboton@gmail.com> Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-17 19:28:16 -08:00
Adam Litke	72fad7139b	hugetlb: handle write-protection faults in follow_hugetlb_page The follow_hugetlb_page() fix I posted (merged as git commit `5b23dbe817`) missed one case. If the pte is present, but not writable and write access is requested by the caller to get_user_pages(), the code will do the wrong thing. Rather than calling hugetlb_fault to make the pte writable, it notes the presence of the pte and continues. This simple one-liner makes sure we also fault on the pte for this case. Please apply. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Dave Kleikamp <shaggy@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-10 19:43:55 -08:00
Linus Torvalds	7fd272550b	Avoid double memclear() in SLOB/SLUB Both slob and slub react to __GFP_ZERO by clearing the allocation, which means that passing the GFP_ZERO bit down to the page allocator is just wasteful and pointless. Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-09 10:17:52 -08:00
Matthew Wilcox	0b94e97a25	Use lock_page_killable Replacing lock_page with lock_page_killable in do_generic_mapping_read() allows us to kill `cat' of a file on an NFS-mounted filesystem Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-06 17:35:48 -05:00
Matthew Wilcox	2687a3569e	Add lock_page_killable This routine is like lock_page, but can be interrupted by a fatal signal Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-06 17:35:41 -05:00
Linus Torvalds	ad658cec23	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6: VM/Security: add security hook to do_brk Security: round mmap hint address above mmap_min_addr security: protect from stack expantion into low vm addresses Security: allow capable check to permit mmap or low vm space SELinux: detect dead booleans SELinux: do not clear f_op when removing entries	2007-12-05 09:26:52 -08:00
Eric Paris	ecaf18c15a	VM/Security: add security hook to do_brk Given a specifically crafted binary do_brk() can be used to get low pages available in userspace virtual memory and can thus be used to circumvent the mmap_min_addr low memory protection. Add security checks in do_brk(). Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:21 -08:00
Vegard Nossum	294a80a8ed	SLUB's ksize() fails for size > 2048 I can't pass memory allocated by kmalloc() to ksize() if it is allocated by SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2. The error of ksize() seems to be that it does not check if the allocation was made by SLUB or the page allocator. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:20 -08:00
Nick Piggin	369b8f5a70	mm: fix XIP file writes Writing to XIP files at a non-page-aligned offset results in data corruption because the writes were always sent to the start of the page. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:20 -08:00
Tetsuo Handa	f8fcc93319	Add EXPORT_SYMBOL(ksize); mm/slub.c exports ksize(), but mm/slob.c and mm/slab.c don't. It's used by binfmt_flat, which can be built as a module. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:18 -08:00
Denis Cheng	4b01a0b161	mm/backing-dev.c: fix percpu_counter_destroy call bug in bdi_init this call should use the array index j, not i. But with this approach, just one int i is enough, int j is not needed. Signed-off-by: Denis Cheng <crquan@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-12-05 09:21:18 -08:00
Eric Paris	5a211a5dea	VM/Security: add security hook to do_brk Given a specifically crafted binary do_brk() can be used to get low pages available in userspace virtually memory and can thus be used to circumvent the mmap_min_addr low memory protection. Add security checks in do_brk(). Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:25:30 +11:00
Eric Paris	7cd94146cd	Security: round mmap hint address above mmap_min_addr If mmap_min_addr is set and a process attempts to mmap (not fixed) with a non-null hint address less than mmap_min_addr the mapping will fail the security checks. Since this is just a hint address this patch will round such a hint address above mmap_min_addr. gcj was found to try to be very frugal with vm usage and give hint addresses in the 8k-32k range. Without this patch all such programs failed and with the patch they happily get a higher address. This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist without it and there would be no security check possible no matter what. So we should not bother compiling in this rounding if it is just a waste of time. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:25:10 +11:00
Eric Paris	8869477a49	security: protect from stack expantion into low vm addresses Add security checks to make sure we are not attempting to expand the stack into memory protected by mmap_min_addr Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2007-12-06 00:24:48 +11:00
Matthew Wilcox	e34f44b351	pool: Improve memory usage for devices which can't cross boundaries The previous implementation simply refused to allocate more than a boundary's worth of data from an entire page. Some users didn't know this, so specified things like SMP_CACHE_BYTES, not realising the horrible waste of memory that this was. It's fairly easy to correct this problem, just by ensuring we don't cross a boundary within a page. This even helps drivers like EHCI (which can't cross a 4k boundary) on machines with larger page sizes. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: David S. Miller <davem@davemloft.net>	2007-12-04 10:39:58 -05:00
Matthew Wilcox	a35a345514	Change dmapool free block management Use a list of free blocks within a page instead of using a bitmap. Update documentation to reflect this. As well as being a slight reduction in memory allocation, locked ops and lines of code, it speeds up a transaction processing benchmark by 0.4%. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-04 10:39:57 -05:00
Matthew Wilcox	6182a0943a	dmapool: Tidy up includes and add comments We were missing a copyright statement and license, so add GPLv2, David Brownell's copyright and my copyright. The asm/io.h include was superfluous, but we were missing a few other necessary includes. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-04 10:39:57 -05:00
Matthew Wilcox	399154be2d	dmapool: Validate parameters to dma_pool_create Check that 'align' is a power of two, like the API specifies. Align 'size' to 'align' correctly -- the current code has an off-by-one. The ALIGN macro in kernel.h doesn't. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: David S. Miller <davem@davemloft.net>	2007-12-04 10:39:56 -05:00
Matthew Wilcox	2cae367e48	Avoid taking waitqueue lock in dmapool With one trivial change (taking the lock slightly earlier on wakeup from schedule), all uses of the waitq are under the pool lock, so we can use the locked (or __) versions of the wait queue functions, and avoid the extra spinlock. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: David S. Miller <davem@davemloft.net>	2007-12-04 10:39:56 -05:00
Matthew Wilcox	e87aa77374	dmapool: Fix style problems Run Lindent and fix all issues reported by checkpatch.pl Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-04 10:39:55 -05:00
Matthew Wilcox	141e9d4b54	Move dmapool.c to mm/ directory Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-04 10:39:54 -05:00
Matthew Wilcox	80cbd911ca	Fix kmem_cache_free performance regression in slab The database performance group have found that half the cycles spent in kmem_cache_free are spent in this one call to BUG_ON. Moving it into the CONFIG_SLAB_DEBUG-only function cache_free_debugcheck() is a performance win of almost 0.5% on their particular benchmark. The call was added as part of commit `ddc2e812d5` with the comment that "overhead should be minimal". It may have been minimal at the time, but it isn't now. [ Quoth Pekka Enberg: "I don't think the BUG_ON per se caused the performance regression but rather the virt_to_head_page() changes to virt_to_cache() that were added later." ] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-30 08:08:05 -08:00
KAMEZAWA Hiroyuki	e0dc3a53de	memory hotplug fix: fix section mismatch in vmammap_allock_block() Fixes section mismatch below. WARNING: vmlinux.o(.text+0x946b5): Section mismatch: reference to .init.text:' __alloc_bootmem_node (between 'vmemmap_alloc_block' and 'vmemmap_pgd_populate') Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-29 09:24:54 -08:00
Mel Gorman	ba72cb8cb0	Fix boot problem with iSeries lacking hugepage support Ordinarily the size of a pageblock is determined at compile-time based on the hugepage size. On PPC64, the hugepage size is determined at runtime based on what is supported by the machine. With legacy machines such as iSeries that do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order being set to -PAGE_SHIFT and a crash results shortly afterwards. This patch adds a function to select a sensible value for pageblock order by default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT is a sensible value before using the hugepage size; if it is not MAX_ORDER-1 is used. This is a fix for 2.6.24. Credit goes to Stephen Rothwell for identifying the bug and testing candidate patches. Additional credit goes to Andy Whitcroft for spotting a problem with respects to IA-64 before releasing. Additional credit to David Gibson for testing with the libhugetlbfs test suite. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-29 09:24:51 -08:00
Hugh Dickins	09f345da75	prep_zero_page: remove bogus BUG_ON 2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait" assertion. Presumably added under the misconception that clear_highpage uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-28 11:04:28 -08:00
Hugh Dickins	e84e2e132c	tmpfs: restore missing clear_highpage tmpfs was misconverted to __GFP_ZERO in 2.6.11. There's an unusual case in which shmem_getpage receives the page from its caller instead of allocating. We must cover this case by clear_highpage before SetPageUptodate, as before. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-28 11:04:28 -08:00
Christian Borntraeger	ce7e9fae8d	[S390] Optimize storage key handling for anonymous pages page_mkclean used to call page_clear_dirty for every given page. This is different to all other architectures, where the dirty bit in the PTEs is only resetted, if page_mapping() returns a non-NULL pointer. We can move the page_test_dirty/page_clear_dirty sequence into the 2nd if to avoid unnecessary iske/sske sequences, which are expensive. This change also helps kvm for s390 as the host must transfer the dirty bit into the guest status bits. By moving the page_clear_dirty operation into the 2nd if, the vm will only call page_clear_dirty for pages where it walks the mapping anyway. There it calls ptep_clear_flush for writable ptes, so we can transfer the dirty bit to the guest. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2007-11-20 11:13:46 +01:00
Linus Torvalds	8c0863403f	dirty page balancing: Get rid of broken unmapped_ratio logic This code harks back to the days when we didn't count dirty mapped pages, which led us to try to balance the number of dirty unmapped pages by how much unmapped memory there was in the system. That makes no sense any more, since now the dirty counts include the mapped pages. Not to mention that the math doesn't work with HIGHMEM machines anyway, and causes the unmapped_ratio to potentially turn negative (which we do catch thanks to clamping it at a minimum value, but I mention that as an indication of how broken the code is). The code also was written at a time when the default dirty ratio was much larger, and the unmapped_ratio logic effectively capped that large dirty ratio a bit. Again, we've since lowered the dirty ratio rather aggressively, further lessening the point of that code. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-15 16:41:52 -08:00
Nick Piggin	d32ddd8f20	slob: fix memory corruption Previously, it would be possible for prev->next to point to &free_slob_pages, and thus we would try to move a list onto itself, and bad things would happen. It seems a bit hairy to be doing list operations with the list marker as an entry, rather than a head, but... this resolves the following crash: http://bugzilla.kernel.org/show_bug.cgi?id=9379 Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-15 08:36:27 -08:00
Balbir Singh	20a1022d4a	Swap delay accounting, include lock_page() delays The delay incurred in lock_page() should also be accounted in swap delay accounting Reported-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:44 -08:00
Randy Dunlap	42614fcde7	vmstat: fix section mismatch warning Mark start_cpu_timer() as __cpuinit instead of __devinit. Fixes this section warning: WARNING: vmlinux.o(.text+0x60e53): Section mismatch: reference to .init.text:start_cpu_timer (between 'vmstat_cpuup_callback' and 'vmstat_show') Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:42 -08:00
Adrian Bunk	be21f0ab0d	fix mm/util.c:krealloc() Commit `ef8b4520bd` added one NULL check for "p" in krealloc(), but that doesn't seem to be enough since there doesn't seem to be any guarantee that memcpy(ret, NULL, 0) works (spotted by the Coverity checker). For making it clearer what happens this patch also removes the pointless min(). Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:41 -08:00
Ken Chen	45c682a68a	hugetlb: fix i_blocks accounting For administrative purpose, we want to query actual block usage for hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that up since kernel already has all the information to track it properly. Signed-off-by: Ken Chen <kenchen@google.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adrian Bunk	8cde045c7e	mm/hugetlb.c: make a function static return_unused_surplus_pages() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	90d8b7e612	hugetlb: enforce quotas during reservation for shared mappings When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved to guarantee no problems will occur later when instantiating pages. If quotas are in force, page instantiation could fail due to a race with another process or an oversized (but approved) shared mapping. To prevent these scenarios, debit the quota for the full reservation amount up front and credit the unused quota when the reservation is released. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	9a119c056d	hugetlb: allow bulk updating in hugetlb_*_quota() Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to allow bulk updating of the sbinfo->free_blocks counter. This will be used by the next patch in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	2fc39cec6a	hugetlb: debit quota in alloc_huge_page Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota() seem out of place. The alloc/free API is unbalanced because we handle the hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota(). Move the get inside alloc_huge_page to clean up this disparity. This patch has been kept apart from the previous patch because of the somewhat dodgy ERR_PTR() use herein. Moving the quota logic means that alloc_huge_page() has two failure modes. Quota failure must result in a SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR() doesn't like the small positive errnos we have in VM_FAULT_* so they must be negated before they are used. Does anyone take issue with the way I am using PTR_ERR. If so, what are your thoughts on how to clean this up (without needing an if,else if,else block at each alloc_huge_page() callsite)? Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	c79fb75e5a	hugetlb: fix quota management for private mappings The hugetlbfs quota management system was never taught to handle MAP_PRIVATE mappings when that support was added. Currently, quota is debited at page instantiation and credited at file truncation. This approach works correctly for shared pages but is incomplete for private pages. In addition to hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but this function does not respect quotas. Private huge pages are treated very much like normal, anonymous pages. They are not "backed" by the hugetlbfs file and are not stored in the mapping's radix tree. This means that private pages are invisible to truncate_hugepages() so that function will not credit the quota. This patch (based on a prototype provided by Ken Chen) moves quota crediting for all pages into free_huge_page(). page->private is used to store a pointer to the mapping to which this page belongs. This is used to credit quota on the appropriate hugetlbfs instance. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adam Litke	348ea204cc	hugetlb: split alloc_huge_page into private and shared components Hugetlbfs implements a quota system which can limit the amount of memory that can be used by the filesystem. Before allocating a new huge page for a file, the quota is checked and debited. The quota is then credited when truncating the file. I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED mappings. Before detailing the problems and my proposed solutions, we should agree on a definition of quotas that properly addresses both private and shared pages. Since the purpose of quotas is to limit total memory consumption on a per-filesystem basis, I argue that all pages allocated by the fs (private and shared) should be charged against quota. Private Mappings ================ The current code will debit quota for private pages sometimes, but will never credit it. At a minimum, this causes a leak in the quota accounting which renders the accounting essentially useless as it is. Shared pages have a one to one mapping with a hugetlbfs file and are easy to account by debiting on allocation and crediting on truncate. Private pages are anonymous in nature and have a many to one relationship with their hugetlbfs files (due to copy on write). Because private pages are not indexed by the mapping's radix tree, thier quota cannot be credited at file truncation time. Crediting must be done when the page is unmapped and freed. Shared Pages ============ I discovered an issue concerning the interaction between the MAP_SHARED reservation system and quotas. Since quota is not checked until page instantiation, an over-quota mmap/reservation will initially succeed. When instantiating the first over-quota page, the program will receive SIGBUS. This is inconsistent since the reservation is supposed to be a guarantee. The solution is to debit the full amount of quota at reservation time and credit the unused portion when the reservation is released. This patch series brings quotas back in line by making the following modifications: * Private pages - Debit quota in alloc_huge_page() - Credit quota in free_huge_page() * Shared pages - Debit quota for entire reservation at mmap time - Credit quota for instantiated pages in free_huge_page() - Credit quota for unused reservation at munmap time This patch: The shared page reservation and dynamic pool resizing features have made the allocation of private vs. shared huge pages quite different. By splitting out the private/shared-specific portions of the process into their own functions, readability is greatly improved. alloc_huge_page now calls the proper helper and performs common operations. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Adam Litke	5b23dbe817	hugetlb: follow_hugetlb_page() for write access When calling get_user_pages(), a write flag is passed in by the caller to indicate if write access is required on the faulted-in pages. Currently, follow_hugetlb_page() ignores this flag and always faults pages for read-only access. This can cause data corruption because a device driver that calls get_user_pages() with write set will not expect COW faults to occur on the returned pages. This patch passes the write flag down to follow_hugetlb_page() and makes sure hugetlb_fault() is called with the right write_access parameter. [ezk@cs.sunysb.edu: build fix] Signed-off-by: Adam Litke <agl@us.ibm.com> Reviewed-by: Ken Chen <kenchen@google.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Yasunori Goto	887c3cb188	Add IORESOUCE_BUSY flag for System RAM i386 and x86-64 registers System RAM as IORESOURCE_MEM \| IORESOURCE_BUSY. But ia64 registers it as IORESOURCE_MEM only. In addition, memory hotplug code registers new memory as IORESOURCE_MEM too. This difference causes a failure of memory unplug of x86-64. This patch fixes it. This patch adds IORESOURCE_BUSY to avoid potential overlap mapping by PCI device. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Luck, Tony" <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Peter Zijlstra	5fce25a9df	mm: speed up writeback ramp-up on clean systems We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits and bdi ramp-up should happen. Doing it this way avoids many small writeouts on an otherwise idle system and should also speed up the ramp-up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
KAMEZAWA Hiroyuki	dbc0e4cefd	memory hotremove: unset migrate type "ISOLATE" after removal We should unset migrate type "ISOLATE" when we successfully removed memory. But current code has BUG and cannot works well. This patch also includes bugfix? to change get_pageblock_flags to get_pageblock_migratetype(). Thanks to Badari Pulavarty for finding this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
Lee Schermerhorn	3ad33b2436	Migration: find correct vma in new_vma_page() We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For anon-regions, we just fail to migrate any pages beyond the 1st vma in the range. This occurs because do_mbind() collects a list of pages to migrate by calling check_range(). check_range() walks the task's mm, spanning vmas as necessary, to collect the migratable pages into a list. Then, do_mbind() calls migrate_pages() passing the list of pages, a function to allocate new pages based on vma policy [new_vma_page()], and a pointer to the first vma of the range. For each page in the list, new_vma_page() calls page_address_in_vma() passing the page and the vma [first in range] to obtain the address to get for alloc_page_vma(). The page address is needed to get interleaving policy correct. If the pages in the list come from multiple vmas, eventually, new_page_address() will pass that page to page_address_in_vma() with the incorrect vma. For !PageAnon pages, this will result in a bug check in rmap.c:vma_address(). For anon pages, vma_address() will just return EFAULT and fail the migration. This patch modifies new_vma_page() to check the return value from page_address_in_vma(). If the return value is EFAULT, new_vma_page() searchs forward via vm_next for the vma that maps the page--i.e., that does not return EFAULT. This assumes that the pages in the list handed to migrate_pages() is in address order. This is currently case. The patch documents this assumption in a new comment block for new_vma_page(). If new_vma_page() cannot locate the vma mapping the page in a forward search in the mm, it will pass a NULL vma to alloc_page_vma(). This will result in the allocation using the task policy, if any, else system default policy. This situation is unlikely, but the patch documents this behavior with a comment. Note, this patch results in restarting from the first vma in a multi-vma range each time new_vma_page() is called. If this is not acceptable, we can make the vma argument a pointer, both in new_vma_page() and it's caller unmap_and_move() so that the value held by the loop in migrate_pages() always passes down the last vma in which a page was found. This will require changes to all new_page_t functions passed to migrate_pages(). Is this necessary? For this patch to work, we can't bug check in vma_address() for pages outside the argument vma. This patch removes the BUG_ON(). All other callers [besides new_vma_page()] already check the return status. Tested on x86_64, 4 node NUMA platform. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:38 -08:00
Akinobu Mita	cc550defe9	slab: fix typo in allocation failure handling This patch fixes wrong array index in allocation failure handling. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:36 -08:00
Linus Torvalds	44048d700b	Revert "Bias the placement of kernel pages at lower PFNs" This reverts commit `5adc5be7cd`. Alexey Dobriyan reports that it causes huge slowdowns under some loads, in his case a "mkfs.ext2" on a 30G partition. With the placement bias, the mkfs took over four minutes, with it reverted it's back to about ten seconds for Alexey. Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-12 14:14:44 -08:00
Denis Cheng	efe44183f6	SLUB: killed the unused "end" variable Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore. Signed-off-by: Denis Cheng <crquan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-12 10:32:29 -08:00
Linus Torvalds	221d46841b	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: lguest: tidy up documentation kernel/futex.c: make 3 functions static unexport access_process_vm lguest: make async_hcall() static	2007-11-05 11:39:00 -08:00
Christoph Lameter	05aa345034	SLUB: Fix memory leak by not reusing cpu_slab Fix the memory leak that may occur when we attempt to reuse a cpu_slab that was allocated while we reenabled interrupts in order to be able to grow a slab cache. The per cpu freelist may contain objects and in that situation we may overwrite the per cpu freelist pointer loosing objects. This only occurs if we find that the concurrently allocated slab fits our allocation needs. If we simply always deactivate the slab then the freelist will be properly reintegrated and the memory leak will go away. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-05 11:37:12 -08:00
Adrian Bunk	02c3530da6	unexport access_process_vm This patch removes the no longer used EXPORT_SYMBOL_GPL(access_process_vm). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2007-11-05 21:53:39 +11:00
Linus Torvalds	5307cc1aa5	Remove broken ptrace() special-case code from file mapping The kernel has for random historical reasons allowed ptrace() accesses to access (and insert) pages into the page cache above the size of the file. However, Nick broke that by mistake when doing the new fault handling in commit `54cb8821de` ("mm: merge populate and nopage into fault (fixes nonlinear)". The breakage caused a hang with gdb when trying to access the invalid page. The ptrace "feature" really isn't worth resurrecting, since it really is wrong both from a portability _and_ from an internal page cache validity standpoint. So this removes those old broken remnants, and fixes the ptrace() hang in the process. Noticed and bisected by Duane Griffin, who also supplied a test-case (quoth Nick: "Well that's probably the best bug report I've ever had, thanks Duane!"). Cc: Duane Griffin <duaneg@dghda.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-31 09:19:46 -07:00
Zach Brown	bdb76ef5a4	dio: fix cache invalidation after sync writes Commit commit `65b8291c40` ("dio: invalidate clean pages before dio write") introduced a bug which stopped dio from ever invalidating the page cache after writes. It still invalidated it before writes so most users were fine. Karl Schendel reported ( http://lkml.org/lkml/2007/10/26/481 ) hitting this bug when he had a buffered reader immediately reading file data after an O_DIRECT wirter had written the data. The kernel issued read-ahead beyond the position of the reader which overlapped with the O_DIRECT writer. The failure to invalidate after writes caused the reader to see stale data from the read-ahead. The following patch is originally from Karl. The following commentary is his: The below 3rd try takes on your suggestion of just invalidating no matter what the retval from the direct_IO call. I ran it thru the test-case several times and it has worked every time. The post-invalidate is probably still too early for async-directio, but I don't have a testcase for that; just sync. And, this won't be any worse in the async case. I added a test to the aio-dio-regress repository which mimics Karl's IO pattern. It verifed the bad behaviour and that the patch fixed it. I agree with Karl, this still doesn't help the case where a buffered reader follows an AIO O_DIRECT writer. That will require a bit more work. This gives up on the idea of returning EIO to indicate to userspace that stale data remains if the invalidation failed. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Karl Schendel <kschendel@datallegro.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 12:14:06 -07:00
Hugh Dickins	487e9bf25c	fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE It's possible to provoke unionfs (not yet in mainline, though in mm and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)). I expect it's possible to provoke the 2.6.23 ecryptfs in the same way (but the 2.6.24 ecryptfs no longer calls lower level's ->writepage). This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could leak from tmpfs via write_cache_pages and unionfs to userspace. There's already a fix (`e423003028` - writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so far as it goes; but insufficient because it doesn't address the underlying issue, that shmem_writepage expects to be called only by vmscan (relying on backing_dev_info capabilities to prevent the normal writeback path from ever approaching it). That's an increasingly fragile assumption, and ramdisk_writepage (the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check wbc->for_reclaim before returning it. Make the same check in shmem_writepage, thereby sidestepping the page_mapped BUG also. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Erez Zadok <ezk@cs.sunysb.edu> Cc: <stable@kernel.org> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 08:06:55 -07:00
Glauber de Oliveira Costa	8bca44bbd3	mm/sparse-vmemmap.c: make sure init_mm is included mm/sparse-vmemmap.c uses init_mm in some places. However, it is not present in any of the headers currently included in the file. init_mm is defined as extern in sched.h, so we add it to the headers list Up to now, this problem was masked by the fact that functions like set_pte_at() and pmd_populate_kernel() are usually macros that expand to simpler variants that does not use the first parameter at all. Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 08:06:55 -07:00
Linus Torvalds	6a22c57b8d	Revert "x86_64: allocate sparsemem memmap above 4G" This reverts commit `2e1c49db4c`. First off, testing in Fedora has shown it to cause boot failures, bisected down by Martin Ebourne, and reported by Dave Jobes. So the commit will likely be reverted in the 2.6.23 stable kernels. Secondly, in the 2.6.24 model, x86-64 has now grown support for SPARSEMEM_VMEMMAP, which disables the relevant code anyway, so while the bug is not visible any more, it's become invisible due to the code just being irrelevant and no longer enabled on the only architecture that this ever affected. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Martin Ebourne <fedora@ebourne.me.uk> Cc: Zou Nan hai <nanhai.zou@intel.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 14:05:37 -07:00
David Howells	f2b8544f5f	NOMMU: mm/nommu.c needs linux/module.h mm/nommu.c needs to #include linux/module.h for it to understand EXPORT_*() macros. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 07:53:26 -07:00
Linus Torvalds	cbf67812b2	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: compat_ioctl: fix block device compat ioctl regression [BLOCK] Fix bad sharing of tag busy list on queues with shared tag maps Fix a build error when BLOCK=n block: use lock bitops for the tag map. cciss: update copyright notices cfq_get_queue: fix possible NULL pointer access blk_sync_queue() should cancel request_queue->unplug_work cfq_exit_queue() should cancel cfq_data->unplug_work block layer: remove a unused argument of drive_stat_acct()	2007-10-29 07:49:28 -07:00
Al Viro	27bb628a1d	missing atomic_read_long() in slub.c nr_slabs is atomic_long_t, not atomic_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 07:41:32 -07:00
Emil Medve	3a424f2d56	Fix a build error when BLOCK=n mm/filemap.c: In function '__filemap_fdatawrite_range': mm/filemap.c:200: error: implicit declaration of function 'mapping_cap_writeback_dirty' This happens when we don't use/have any block devices and a NFS root filesystem is used. mapping_cap_writeback_dirty() is defined in linux/backing-dev.h which used to be provided in mm/filemap.c by linux/blkdev.h until commit `f5ff8422bb` (Fix warnings with !CONFIG_BLOCK). Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-29 11:33:06 +01:00
Hugh Dickins	1ddd439ef9	fix mprotect vma_wants_writenotify prot Fix mprotect bug in recent commit `3ed75eb8f1` (setup vma->vm_page_prot by vm_get_page_prot()): the vma_wants_writenotify case was setting the same prot as when not. Nothing wrong with the use of protection_map[] in mmap_region(), but use vm_get_page_prot() there too in the same ~VM_SHARED way. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Coly Li <coyli@suse.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-23 08:32:06 -07:00
Christoph Hellwig	3965516440	exportfs: make struct export_operations const Now that nfsd has stopped writing to the find_exported_dentry member we an mark the export_operations const Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: <linux-ext4@vger.kernel.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Chinner <dgc@sgi.com> Cc: Timothy Shimmin <tes@sgi.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Chris Mason <mason@suse.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-22 08:13:21 -07:00
Christoph Hellwig	480b116c98	shmem: new export ops I'm not sure what people were thinking when adding support to export tmpfs, but here's the conversion anyway: Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-22 08:13:20 -07:00
Yasunori Goto	b9049e2344	memory hotplug: make kmem_cache_node for SLUB on memory online avoid panic Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab() after memory online. When memory online is called, kmem_cache_nodes are created for all SLUBs for new node whose memory are available. slab_mem_going_online_callback() is called to make kmem_cache_node() in callback of memory online event. If it (or other callbacks) fails, then slab_mem_offline_callback() is called for rollback. In memory offline, slab_mem_going_offline_callback() is called to shrink all slub cache, then slab_mem_offline_callback() is called later. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: locking fix] [akpm@linux-foundation.org: build fix] Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-22 08:13:17 -07:00
Yasunori Goto	7b78d335ac	memory hotplug: rearrange memory hotplug notifier Current memory notifier has some defects yet. (Fortunately, nothing uses it.) This patch is to fix and rearrange for them. - Add information of start_pfn, nr_pages, and node id if node status is changes from/to memoryless node for callback functions. Callbacks can't do anything without those information. - Add notification going-online status. It is necessary for creating per node structure before the node's pages are available. - Move GOING_OFFLINE status notification after page isolation. It is good place for return memory like cache for callback, because returned page is not used again. - Make CANCEL events for rollingback when error occurs. - Delete MEM_MAPPING_INVALID notification. It will be not used. - Fix compile error of (un)register_memory_notifier(). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-22 08:13:17 -07:00
Al Viro	e91a810e88	oom_kill bug Wrong order of arguments Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-20 15:04:06 -07:00
Philipp Marek	ad3d0a3827	small documentation fixes Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-20 02:46:58 +02:00
Gabriel Craciunescu	e9534b3fd7	Typo fixes retrun -> return Typo fixes retrun -> return Signed-off-by: Gabriel Craciunescu <nix.or.die@googlemail.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-20 02:13:26 +02:00
Simon Arlott	183ff22bb6	spelling fixes: mm/ Spelling fixes in mm/. Signed-off-by: Simon Arlott <simon@fire.lp0.eu> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-20 01:27:18 +02:00
Robert P. J. Day	8518609dee	Explain clearly why kmalloc() can't use __GFP_HIGHMEM. Fix the wishy-washy comment to clearly explain why kmalloc() can't use the __GFP_HIGHMEM zone modifier. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:11:38 +02:00
Pavel Emelyanov	ba25f9dcc4	Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
Pavel Emelyanov	bac0abd617	Isolate some explicit usage of task->tgid With pid namespaces this field is now dangerous to use explicitly, so hide it behind the helpers. Also the pid and pgrp fields o task_struct and signal_struct are to be deprecated. Unfortunately this patch cannot be sent right now as this leads to tons of warnings, so start isolating them, and deprecate later. Actually the p->tgid == pid has to be changed to has_group_leader_pid(), but Oleg pointed out that in case of posix cpu timers this is the same, and thread_group_leader() is more preferable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:40 -07:00
Pavel Emelyanov	228ebcbe63	Uninline find_task_by_xxx set of functions The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:40 -07:00
Pavel Emelyanov	b488893a39	pid namespaces: changes to show virtual ids to user This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:40 -07:00
Matthias Kaehlcke	7b1915a989	mm/oom_kill.c: Use list_for_each_entry instead of list_for_each mm/oom_kill.c: Convert list_for_each to list_for_each_entry in oom_kill_process() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:38 -07:00
Serge E. Hallyn	b460cbc581	pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Paul Menage	8793d854ed	Task Control Groups: make cpusets a client of cgroups Remove the filesystem support logic from the cpusets system and makes cpusets a cgroup subsystem The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get passed through to the cgroup filesystem with the appropriate options to emulate the old cpuset filesystem behaviour. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Randy Dunlap	8f731f7d83	kernel-api docbook: fix content problems Fix kernel-api docbook contents problems. docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: of list entry Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra' Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req' Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private' Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:35 -07:00
Coly Li	3ed75eb8f1	setup vma->vm_page_prot by vm_get_page_prot() This patch uses vm_get_page_prot() to setup vma->vm_page_prot. Though inside vm_get_page_prot() the protection flags is AND with (VM_READ\|VM_WRITE\|VM_EXEC\|VM_SHARED), it does not hurt correct code. Signed-off-by: Coly Li <coyli@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:34 -07:00
Benjamin Herrenschmidt	1c7037db50	remove unused flush_tlb_pgtables Nobody uses flush_tlb_pgtables anymore, this patch removes all remaining traces of it from all archs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:34 -07:00
Linus Torvalds	53253383fd	Include <linux/backing-dev.h> in mm/filemap.c It gets it indirectly from blkdev.h when CONFIG_BLOCK is enabled, but it needs it unconditionally for the definition of mapping_cap_writeback_dirty. Noticed and bisected down to `4af3c9cc4f` ("Drop some headers from mm.h") by Avuton Olrich. Cc: Avuton Olrich <avuton@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:47:32 -07:00
Stephen Hemminger	c80544dc0b	sparse pointer use of zero as null Get rid of sparse related warnings from places that use integer as NULL pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Jeff Garzik <jeff@garzik.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Ian Kent <raven@themaw.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:31 -07:00
Akinobu Mita	12d00f6a12	cpu hotplug: slab: fix memory leak in cpu hotplug error path This patch fixes memory leak in error path. In reality, we don't need to call cpuup_canceled(cpu) for now. But upcoming cpu hotplug error handling change needs this. Cc: Christoph Lameter <clameter@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:21 -07:00
Akinobu Mita	fbf1e473bd	cpu hotplug: slab: cleanup cpuup_callback() cpuup_callback() is too long. This patch factors out CPU_UP_CANCELLED and CPU_UP_PREPARE handlings from cpuup_callback(). Cc: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:21 -07:00
Linus Torvalds	fb9fc39517	Merge branch 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen * 'xen-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: xfs: eagerly remove vmap mappings to avoid upsetting Xen xen: add some debug output for failed multicalls xen: fix incorrect vcpu_register_vcpu_info hypercall argument xen: ask the hypervisor how much space it needs reserved xen: lock pte pages while pinning/unpinning xen: deal with stale cr3 values when unpinning pagetables xen: add batch completion callbacks xen: yield to IPI target if necessary Clean up duplicate includes in arch/i386/xen/ remove dead code in pgtable_cache_init paravirt: clean up lazy mode handling paravirt: refactor struct paravirt_ops into smaller pv_*_ops	2007-10-17 11:10:11 -07:00
Adrian Bunk	cbfee34520	security/ cleanups This patch contains the following cleanups that are now possible: - remove the unused security_operations->inode_xattr_getsuffix - remove the no longer used security_operations->unregister_security - remove some no longer required exit code - remove a bunch of no longer used exports Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Serge E. Hallyn	b53767719b	Implement file posix capabilities Implement file posix capabilities. This allows programs to be given a subset of root's powers regardless of who runs them, without having to use setuid and giving the binary all of root's powers. This version works with Kaigai Kohei's userspace tools, found at http://www.kaigai.gr.jp/index.php. For more information on how to use this patch, Chris Friedhoff has posted a nice page at http://www.friedhoff.org/fscaps.html. Changelog: Nov 27: Incorporate fixes from Andrew Morton (security-introduce-file-caps-tweaks and security-introduce-file-caps-warning-fix) Fix Kconfig dependency. Fix change signaling behavior when file caps are not compiled in. Nov 13: Integrate comments from Alexey: Remove CONFIG_ ifdef from capability.h, and use %zd for printing a size_t. Nov 13: Fix endianness warnings by sparse as suggested by Alexey Dobriyan. Nov 09: Address warnings of unused variables at cap_bprm_set_security when file capabilities are disabled, and simultaneously clean up the code a little, by pulling the new code into a helper function. Nov 08: For pointers to required userspace tools and how to use them, see http://www.friedhoff.org/fscaps.html. Nov 07: Fix the calculation of the highest bit checked in check_cap_sanity(). Nov 07: Allow file caps to be enabled without CONFIG_SECURITY, since capabilities are the default. Hook cap_task_setscheduler when !CONFIG_SECURITY. Move capable(TASK_KILL) to end of cap_task_kill to reduce audit messages. Nov 05: Add secondary calls in selinux/hooks.c to task_setioprio and task_setscheduler so that selinux and capabilities with file cap support can be stacked. Sep 05: As Seth Arnold points out, uid checks are out of place for capability code. Sep 01: Define task_setscheduler, task_setioprio, cap_task_kill, and task_setnice to make sure a user cannot affect a process in which they called a program with some fscaps. One remaining question is the note under task_setscheduler: are we ok with CAP_SYS_NICE being sufficient to confine a process to a cpuset? It is a semantic change, as without fsccaps, attach_task doesn't allow CAP_SYS_NICE to override the uid equivalence check. But since it uses security_task_setscheduler, which elsewhere is used where CAP_SYS_NICE can be used to override the uid equivalence check, fixing it might be tough. task_setscheduler note: this also controls cpuset:attach_task. Are we ok with CAP_SYS_NICE being used to confine to a cpuset? task_setioprio task_setnice sys_setpriority uses this (through set_one_prio) for another process. Need same checks as setrlimit Aug 21: Updated secureexec implementation to reflect the fact that euid and uid might be the same and nonzero, but the process might still have elevated caps. Aug 15: Handle endianness of xattrs. Enforce capability version match between kernel and disk. Enforce that no bits beyond the known max capability are set, else return -EPERM. With this extra processing, it may be worth reconsidering doing all the work at bprm_set_security rather than d_instantiate. Aug 10: Always call getxattr at bprm_set_security, rather than caching it at d_instantiate. [morgan@kernel.org: file-caps clean up for linux/capability.h] [bunk@kernel.org: unexport cap_inode_killpriv] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Randy Dunlap	8d63494f78	remap_file_pages: kernel-doc corrections Fix kernel-doc for sys_remap_file_pages() and add info to the 'prot' NOTE. Rename __prot parameter to prot. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Dave Hansen	ce8d2cdf3d	r/o bind mounts: filesystem helpers for custom 'struct file's Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using the following script to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. This patch: Some filesystems forego the vfs and may_open() and create their own 'struct file's. This patch creates a couple of helper functions which can be used by these filesystems, and will provide a unified place which the r/o bind mount code may patch. Also, rename an existing, static-scope init_file() to a less generic name. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:04 -07:00
Fengguang Wu	369f2389e7	writeback: remove unnecessary wait in throttle_vm_writeout() We don't want to introduce pointless delays in throttle_vm_writeout() when the writeback limits are not yet exceeded, do we? Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Greg KH <greg@kroah.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:02 -07:00
Joern Engel	1c0eeaf569	introduce I_SYNC I_LOCK was used for several unrelated purposes, which caused deadlock situations in certain filesystems as a side effect. One of the purposes now uses the new I_SYNC bit. Also document the various bits and change their order from historical to logical. [bunk@stusta.de: make fs/inode.c:wake_up_inode() static] Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:02 -07:00
Fengguang Wu	2e6883bdf4	writeback: introduce writeback_control.more_io to indicate more io After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s: ~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_io s_more_io ------------------------- 1) 100M,1K 0 2) 1K 96M 3) 0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write <= 0, so write more 3) 1K written, nr_to_write > 0, no more writes(BUG) nr_to_write > 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write > 0 does not necessarily mean that "all data are written". This patch introduces a flag writeback_control.more_io to indicate this situation. With it the big dirty file no longer has to wait for the next kupdate invocation 5s later. Cc: David Chinner <dgc@sgi.com> Cc: Ken Chen <kenchen@google.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:02 -07:00
Robert P. J. Day	bda5b655fe	Delete gcc-2.95 compatible structure definition. Since nothing earlier than gcc-3.2 is supported for kernel compilation, that 2.95 hack can be removed. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:58 -07:00
Alexey Dobriyan	4af3c9cc4f	Drop some headers from mm.h mm.h doesn't use directly anything from mutex.h and backing-dev.h, so remove them and add them back to files which need them. Cross-compile tested on many configs and archs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:55 -07:00
Alexey Dobriyan	040b5c6f95	SLAB_PANIC more (proc, posix-timers, shmem) These aren't modular, so SLAB_PANIC is OK. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Andrew Morton	e423003028	writeback: don't propagate AOP_WRITEPAGE_ACTIVATE This is a writeback-internal marker but we're propagating it all the way back to userspace!. Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
Nick Piggin	7a4050791b	mm: document tree_lock->zone.lock lockorder zone->lock is quite an "inner" lock and mostly constrained to page alloc as well, so like slab locks, it probably isn't something that is critically important to document here. However unlike slab locks, zone lock could be used more widely in future, and page_alloc.c might possibly have more business to do tricky things with pagecache than does slab. So... I don't think it hurts to document it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	d773ed6b85	mm: test and set zone reclaim lock before starting reclaim Introduces new zone flag interface for testing and setting flags: int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag) Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is called, this flag is test and set before starting zone reclaim. Zone reclaim starts in __alloc_pages() when a zone's watermark fails and the system is in zone_reclaim_mode. If it's already in reclaim, there's no need to start again so it is simply considered full for that allocation attempt. There is a change of behavior with regard to concurrent zone shrinking. It is now possible for try_to_free_pages() or kswapd to already be shrinking a particular zone when __alloc_pages() starts zone reclaim. In this case, it is possible for two concurrent threads to invoke shrink_zone() for a single zone. This change forbids a zone to be in zone reclaim twice, which was always the behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking when starting zone reclaim. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	ae74138da6	oom: convert zone_scan_lock from mutex to spinlock There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if the lock can't be acquired; it will be available soon enough once the zonelist scanning is done. All other threads waiting for the OOM killer are also contingent on the exiting task being able to acquire the lock in clear_zonelist_oom() so it doesn't make sense to put it to sleep. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	3ff566963c	oom: do not take callback_mutex Since no task descriptor's 'cpuset' field is dereferenced in the execution of the OOM killer anymore, it is no longer necessary to take callback_mutex. [akpm@linux-foundation.org: restore cpuset_lock for other patches] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	bbe373f2c6	oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	7213f5066f	oom: suppress extraneous stack and memory dump Suppresses the extraneous stack and memory dump when a parallel OOM killing has been found. There's no need to fill the ring buffer with this information if its already been printed and the condition that triggered the previous OOM killer has not yet been alleviated. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	fe071d7e8a	oom: add oom_kill_allocating_task sysctl Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	ff0ceb9deb	oom: serialize out of memory calls A final allocation attempt with a very high watermark needs to be attempted before invoking out_of_memory(). OOM killer serialization needs to occur before this final attempt, otherwise tasks attempting to OOM-lock all zones in its zonelist may spin and acquire the lock unnecessarily after the OOM condition has already been alleviated. If the final allocation does succeed, the zonelist is simply OOM-unlocked and __alloc_pages() returns the page. Otherwise, the OOM killer is invoked. If the task cannot acquire OOM-locks on all zones in its zonelist, it is put to sleep and the allocation is retried when it gets rescheduled. One of its zones is already marked as being in the OOM killer so it'll hopefully be getting some free memory soon, at least enough to satisfy a high watermark allocation attempt. This prevents needlessly killing a task when the OOM condition would have already been alleviated if it had simply been given enough time. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
David Rientjes	098d7f128a	oom: add per-zone locking OOM killer synchronization should be done with zone granularity so that memory policy and cpuset allocations may have their corresponding zones locked and allow parallel kills for other OOM conditions that may exist elsewhere in the system. DMA allocations can be targeted at the zone level, which would not be possible if locking was done in nodes or globally. Synchronization shall be done with a variation of "trylocks." The goal is to put the current task to sleep and restart the failed allocation attempt later if the trylock fails. Otherwise, the OOM killer is invoked. Each zone in the zonelist that __alloc_pages() was called with is checked for the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present, the "trylock" to serialize the OOM killer fails and returns zero. Otherwise, all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function returns non-zero. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
David Rientjes	e815af95f9	oom: change all_unreclaimable zone member to flags Convert the int all_unreclaimable member of struct zone to unsigned long flags. This can now be used to specify several different zone flags such as all_unreclaimable and reclaim_in_progress, which can now be removed and converted to a per-zone flag. Flags are set and cleared as follows: zone_set_flag(struct zone zone, zone_flags_t flag) zone_clear_flag(struct zone zone, zone_flags_t flag) Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED, which have the same semantics as the old zone->all_unreclaimable and zone->reclaim_in_progress, respectively. Also converts all current users that set or clear either flag to use the new interface. Helper functions are defined to test the flags: int zone_is_all_unreclaimable(const struct zone zone) int zone_is_reclaim_locked(const struct zone zone) All flag operators are of the atomic variety because there are currently readers that are implemented that do not take zone->lock. [akpm@linux-foundation.org: add needed include] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
David Rientjes	70e24bdf6d	oom: move constraints to enum The OOM killer's CONSTRAINT definitions are really more appropriate in an enum, so define them in include/linux/oom.h. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
David Rientjes	5a3135c2e7	oom: move prototypes to appropriate header file Move the OOM killer's extern function prototypes to include/linux/oom.h and include it where necessary. [clg@fr.ibm.com: build fix] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Christoph Lameter	4ba9b9d0ba	Slab API: remove useless ctor parameter and reorder parameters Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void object, struct kmem_cache s, unsigned long flags) to ctor(struct kmem_cache s, void object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Christoph Lameter	b811c202a0	SLUB: simplify IRQ off handling Move irq handling out of new slab into __slab_alloc. That is useful for Mathieu's cmpxchg_local patchset and also allows us to remove the crude local_irq_off in early_kmem_cache_alloc(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	3e26c149c3	mm: dirty balancing for tasks Based on ideas of Andrew: http://marc.info/?l=linux-kernel&m=102912915020543&w=2 Scale the bdi dirty limit inversly with the tasks dirty rate. This makes heavy writers have a lower dirty limit than the occasional writer. Andrea proposed something similar: http://lwn.net/Articles/152277/ The main disadvantage to his patch is that he uses an unrelated quantity to measure time, which leaves him with a workload dependant tunable. Other than that the two approaches appear quite similar. [akpm@linux-foundation.org: fix warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	04fbfdc14e	mm: per device dirty threshold Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. [akpm@linux-foundation.org: fix warnings] [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	69cb51d18c	mm: count writeback pages per BDI Count per BDI writeback pages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	c9e51e4180	mm: count reclaimable pages per BDI Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	b2e8fb6efa	mm: scalable bdi statistics counters Provide scalable per backing_dev_info statistics counters. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	e0bf68ddec	mm: bdi init hooks provide BDI constructor/destructor hooks [akpm@linux-foundation.org: compile fix] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	c4dc4beed2	nfs: remove congestion_end() These patches aim to improve balance_dirty_pages() and directly address three issues: 1) inter device starvation 2) stacked device deadlocks 3) inter process starvation 1 and 2 are a direct result from removing the global dirty limit and using per device dirty limits. By giving each device its own dirty limit is will no longer starve another device, and the cyclic dependancy on the dirty limit is broken. In order to efficiently distribute the dirty limit across the independant devices a floating proportion is used, this will allocate a share of the total limit proportional to the device's recent activity. 3 is done by also scaling the dirty limit proportional to the current task's recent dirty rate. This patch: nfs: remove congestion_end(). It's redundant, clear_bdi_congested() already wakes the waiters. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:44 -07:00
Jeremy Fitzhardinge	74260714c5	xen: lock pte pages while pinning/unpinning When a pagetable is created, it is made globally visible in the rmap prio tree before it is pinned via arch_dup_mmap(), and remains in the rmap tree while it is unpinned with arch_exit_mmap(). This means that other CPUs may race with the pinning/unpinning process, and see a pte between when it gets marked RO and actually pinned, causing any pte updates to fail with write-protect faults. As a result, all pte pages must be properly locked, and only unlocked once the pinning/unpinning process has finished. In order to avoid taking spinlocks for the whole pagetable - which may overflow the PREEMPT_BITS portion of preempt counter - it locks and pins each pte page individually, and then finally pins the whole pagetable. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickens <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Keir Fraser <keir@xensource.com> Cc: Jan Beulich <jbeulich@novell.com>	2007-10-16 11:51:30 -07:00
Linus Torvalds	92d15c2ccb	Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block * 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits) Fix memory leak in dm-crypt SPARC64: sg chaining support SPARC: sg chaining support PPC: sg chaining support PS3: sg chaining support IA64: sg chaining support x86-64: enable sg chaining x86-64: update pci-gart iommu to sg helpers x86-64: update nommu to sg helpers x86-64: update calgary iommu to sg helpers swiotlb: sg chaining support i386: enable sg chaining i386 dma_map_sg: convert to using sg helpers mmc: need to zero sglist on init Panic in blk_rq_map_sg() from CCISS driver remove sglist_len remove blk_queue_max_phys_segments in libata revert sg segment size ifdefs Fixup u14-34f ENABLE_SG_CHAINING qla1280: enable use_sg_chaining option ...	2007-10-16 10:09:16 -07:00
Adrian Bunk	e2fc88d064	mm/vmstat.c: cleanups This patch contains the following cleanups: - make the needlessly global setup_vmstat() static - remove the unused refresh_vm_stats() Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Adrian Bunk	dbcb0f19c8	mm/mempolicy.c: cleanups This patch contains the following cleanups: - every file should include the headers containing the prototypes for its global functions - make the follosing needlessly global functions static: - migrate_to_node() - do_mbind() - sp_alloc() - mpol_rebind_policy() [akpm@linux-foundation.org: fix uninitialised var warning] Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Adrian Bunk	d8dc74f212	mm/shmem.c: make 3 functions static This patch makes three needlessly global functions static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Adam Litke	af767cbdd7	hugetlb: fix dynamic pool resize failure case When gather_surplus_pages() fails to allocate enough huge pages to satisfy the requested reservation, it frees what it did allocate back to the buddy allocator. put_page() should be called instead of update_and_free_page() to ensure that pool counters are updated as appropriate and the page's refcount is decremented. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Nishanth Aravamudan	63b4613c3f	hugetlb: fix hugepage allocation with memoryless nodes Anton found a problem with the hugetlb pool allocation when some nodes have no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee worked on versions that tried to fix it, but none were accepted. Christoph has created a set of patches which allow for GFP_THISNODE allocations to fail if the node has no memory. Currently, alloc_fresh_huge_page() returns NULL when it is not able to allocate a huge page on the current node, as specified by its custom interleave variable. The callers of this function, though, assume that a failure in alloc_fresh_huge_page() indicates no hugepages can be allocated on the system period. This might not be the case, for instance, if we have an uneven NUMA system, and we happen to try to allocate a hugepage on a node with less memory and fail, while there is still plenty of free memory on the other nodes. To correct this, make alloc_fresh_huge_page() search through all online nodes before deciding no hugepages can be allocated. Add a helper function for actually allocating the hugepage. Use a new global nid iterator to control which nid to allocate on. Note: we expect particular semantics for __GFP_THISNODE, which are now enforced even for memoryless nodes. That is, there is should be no fallback to other nodes. Therefore, we rely on the nid passed into alloc_pages_node() to be the nid the page comes from. If this is incorrect, accounting will break. Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2 memoryless nodes). Before on the ppc64 box: Trying to clear the hugetlb pool Done. 0 free Trying to resize the pool to 100 Node 0 HugePages_Free: 25 Node 1 HugePages_Free: 75 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. Initially 100 free Trying to resize the pool to 200 Node 0 HugePages_Free: 50 Node 1 HugePages_Free: 150 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. 200 free After: Trying to clear the hugetlb pool Done. 0 free Trying to resize the pool to 100 Node 0 HugePages_Free: 50 Node 1 HugePages_Free: 50 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. Initially 100 free Trying to resize the pool to 200 Node 0 HugePages_Free: 100 Node 1 HugePages_Free: 100 Node 2 HugePages_Free: 0 Node 3 HugePages_Free: 0 Done. 200 free Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Adam Litke	6b0c880dfe	hugetlb: fix pool resizing corner case When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we are careful to keep enough pages around to satisfy reservations. But the calculation is flawed for the following scenario: Action Pool Counters (Total, Free, Resv) ====== ============= Set pool to 1 page 1 1 0 Map 1 page MAP_PRIVATE 1 1 0 Touch the page to fault it in 1 0 0 Set pool to 3 pages 3 2 0 Map 2 pages MAP_SHARED 3 2 2 Set pool to 2 pages 2 1 2 <-- Mistake, should be 3 2 2 Touch the 2 shared pages 2 0 1 <-- Program crashes here The last touch above will terminate the process due to lack of huge pages. This patch corrects the calculation so that it factors in pages being used for private mappings. Andrew, this is a standalone fix suitable for mainline. It is also now corrected in my latest dynamic pool resizing patchset which I will send out soon. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Ken Chen <kenchen@google.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:03 -07:00
Adam Litke	54f9f80d65	hugetlb: Add hugetlb_dynamic_pool sysctl The maximum size of the huge page pool can be controlled using the overall size of the hugetlb filesystem (via its 'size' mount option). However in the common case the this will not be set as the pool is traditionally fixed in size at boot time. In order to maintain the expected semantics, we need to prevent the pool expanding by default. This patch introduces a new sysctl controlling dynamic pool resizing. When this is enabled the pool will expand beyond its base size up to the size of the hugetlb filesystem. It is disabled by default. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:02 -07:00
Adam Litke	e4e574b767	hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings Shared mappings require special handling because the huge pages needed to fully populate the VMA must be reserved at mmap time. If not enough pages are available when making the reservation, allocate all of the shortfall at once from the buddy allocator and add the pages directly to the hugetlb pool. If they cannot be allocated, then fail the mapping. The page surplus is accounted for in the same way as for private mappings; faulted surplus pages will be freed at unmap time. Reserved, surplus pages that have not been used must be freed separately when their reservation has been released. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:02 -07:00
Adam Litke	7893d1d505	hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that the hugetlb pool will be exhausted or completely reserved when a hugepage is needed to satisfy a page fault. Before killing the process in this situation, try to allocate a hugepage directly from the buddy allocator. The explicitly configured pool size becomes a low watermark. When dynamically grown, the allocated huge pages are accounted as a surplus over the watermark. As huge pages are freed on a node, surplus pages are released to the buddy allocator so that the pool will shrink back to the watermark. Surplus accounting also allows for friendlier explicit pool resizing. When shrinking a pool that is fully in-use, increase the surplus so pages will be returned to the buddy allocator as soon as they are freed. When growing a pool that has a surplus, consume the surplus first and then allocate new pages. Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:02 -07:00

... 4 5 6 7 8 ...

2188 Commits