android_kernel_xiaomi_sm8350

Author	SHA1	Message	Date
Paul Jackson	4225399a66	[PATCH] cpuset: rebind vma mempolicies fix Fix more of longstanding bug in cpuset/mempolicy interaction. NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset to just the Memory Nodes allowed by that cpuset. The kernel maintains internal state for each mempolicy, tracking what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies. When a tasks cpuset memory placement changes, whether because the cpuset changed, or because the task was attached to a different cpuset, then the tasks mempolicies have to be rebound to the new cpuset placement, so as to preserve the cpuset-relative numbering of the nodes in that policy. An earlier fix handled such mempolicy rebinding for mempolicies attached to a task. This fix rebinds mempolicies attached to vma's (address ranges in a tasks address space.) Due to the need to hold the task->mm->mmap_sem semaphore while updating vma's, the rebinding of vma mempolicies has to be done when the cpuset memory placement is changed, at which time mmap_sem can be safely acquired. The tasks mempolicy is rebound later, when the task next attempts to allocate memory and notices that its task->cpuset_mems_generation is out-of-date with its cpusets mems_generation. Because walking the tasklist to find all tasks attached to a changing cpuset requires holding tasklist_lock, a spinlock, one cannot update the vma's of the affected tasks while doing the tasklist scan. In general, one cannot acquire a semaphore (which can sleep) while already holding a spinlock (such as tasklist_lock). So a list of mm references has to be built up during the tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem acquired, and the vma's in that mm rebound. Once the tasklist lock is dropped, affected tasks may fork new tasks, before their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to point to the cpuset being rebound (there can only be one; cpuset modifications are done under a global 'manage_sem' semaphore), and the mpol_copy code that is used to copy a tasks mempolicies during fork catches such forking tasks, and ensures their children are also rebound. When a task is moved to a different cpuset, it is easier, as there is only one task involved. It's mm->vma's are scanned, using the same mpol_rebind_policy() as used above. It may happen that both the mpol_copy hook and the update done via the tasklist scan update the same mm twice. This is ok, as the mempolicies of each vma in an mm keep track of what mems_allowed they are relative to, and safely no-op a second request to rebind to the same nodes. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	74cb21553f	[PATCH] cpuset: numa_policy_rebind cleanup Cleanup, reorganize and make more robust the mempolicy.c code to rebind mempolicies relative to the containing cpuset after a tasks memory placement changes. The real motivator for this cleanup patch is to lay more groundwork for the upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's after the containing cpuset memory placement changes. NUMA mempolicies are constrained by the cpuset their task is a member of. When either (1) a task is moved to a different cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be recalculated, relative to their new cpuset placement. The old code used an unreliable method of determining what was the old mems_allowed constraining the mempolicy. It just looked at the tasks mems_allowed value. This sort of worked with the present code, that just rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken, referring to the old nodes. But in an upcoming patch, the vma mempolicies will be rebound as well. Then the order in which the various task and vma mempolicies are updated will no longer be deterministic, and one can no longer count on the task->mems_allowed holding the old value for as long as needed. It's not even clear if the current code was guaranteed to work reliably for task mempolicies. So I added a mems_allowed field to each mempolicy, stating exactly what mems_allowed the policy is relative to, and updated synchronously and reliably anytime that the mempolicy is rebound. Also removed a useless wrapper routine, numa_policy_rebind(), and had its caller, cpuset_update_task_memory_state(), call directly to the rewritten policy_rebind() routine, and made that rebind routine extern instead of static, and added a "mpol_" prefix to its name, making it mpol_rebind_policy(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	909d75a3b7	[PATCH] cpuset: implement cpuset_mems_allowed Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code needed, to obtain the mems_allowed vector of a cpuset, and replaced the workaround in sys_migrate_pages() to call this new method. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:44 -08:00
Paul Jackson	cf2a473c40	[PATCH] cpuset: combine refresh_mems and update_mems The important code paths through alloc_pages_current() and alloc_page_vma(), by which most kernel page allocations go, both called cpuset_update_current_mems_allowed(), which in turn called refresh_mems(). -Both- of these latter two routines did a tasklock, got the tasks cpuset pointer, and checked for out of date cpuset->mems_generation. That was a silly duplication of code and waste of CPU cycles on an important code path. Consolidated those two routines into a single routine, called cpuset_update_task_memory_state(), since it updates more than just mems_allowed. Changed all callers of either routine to call the new consolidated routine. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:43 -08:00
Paul Jackson	3e0d98b9f1	[PATCH] cpuset: memory pressure meter Provide a simple per-cpuset metric of memory pressure, tracking the -rate- that the tasks in a cpuset call try_to_free_pages(), the synchronous (direct) memory reclaim code. This enables batch managers monitoring jobs running in dedicated cpusets to efficiently detect what level of memory pressure that job is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or reprioritize jobs that are trying to use more memory than allowed on the nodes assigned them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. This patch just provides a very economical way for the batch manager to monitor a cpuset for signs of memory pressure. It's up to the batch manager or other user code to decide what to do about it and take action. ==> Unless this feature is enabled by writing "1" to the special file /dev/cpuset/memory_pressure_enabled, the hook in the rebalance code of __alloc_pages() for this metric reduces to simply noticing that the cpuset_memory_pressure_enabled flag is zero. So only systems that enable this feature will compute the metric. Why a per-cpuset, running average: Because this meter is per-cpuset, rather than per-task or mm, the system load imposed by a batch scheduler monitoring this metric is sharply reduced on large systems, because a scan of the tasklist can be avoided on each set of queries. Because this meter is a running average, instead of an accumulating counter, a batch scheduler can detect memory pressure with a single read, instead of having to read and accumulate results for a period of time. Because this meter is per-cpuset rather than per-task or mm, the batch scheduler can obtain the key information, memory pressure in a cpuset, with a single read, rather than having to query and accumulate results over all the (dynamically changing) set of tasks in the cpuset. A per-cpuset simple digital filter (requires a spinlock and 3 words of data per-cpuset) is kept, and updated by any task attached to that cpuset, if it enters the synchronous (direct) page reclaim code. A per-cpuset file provides an integer number representing the recent (half-life of 10 seconds) rate of direct page reclaims caused by the tasks in the cpuset, in units of reclaims attempted per second, times 1000. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:42 -08:00
Paul Jackson	5966514db6	[PATCH] cpuset: mempolicy one more nodemask conversion Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous conversion had left one routine using bitmaps, since it involved a corresponding change to kernel/cpuset.c Fix that interface by replacing with a simple macro that calls nodes_subset(), or if !CONFIG_CPUSET, returns (1). Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:42 -08:00
Matt Mackall	10cef60295	[PATCH] slob: introduce the SLOB allocator configurable replacement for slab allocator This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is disabled, the kernel falls back to using the 'SLOB' allocator. SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer, similar to the original Linux kmalloc allocator that SLAB replaced. It's signicantly smaller code and is more memory efficient. But like all similar allocators, it scales poorly and suffers from fragmentation more than SLAB, so it's only appropriate for small systems. It's been tested extensively in the Linux-tiny tree. I've also stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not recommended). Here's a comparison for otherwise identical builds, showing SLOB saving nearly half a megabyte of RAM: $ size vmlinux* text data bss dec hex filename 3336372 529360 190812 4056544 3de5e0 vmlinux-slab 3323208 527948 190684 4041840 3dac70 vmlinux-slob $ size mm/{slab,slob}.o text data bss dec hex filename 13221 752 48 14021 36c5 mm/slab.o 1896 52 8 1956 7a4 mm/slob.o /proc/meminfo: SLAB SLOB delta MemTotal: 27964 kB 27980 kB +16 kB MemFree: 24596 kB 25092 kB +496 kB Buffers: 36 kB 36 kB 0 kB Cached: 1188 kB 1188 kB 0 kB SwapCached: 0 kB 0 kB 0 kB Active: 608 kB 600 kB -8 kB Inactive: 808 kB 812 kB +4 kB HighTotal: 0 kB 0 kB 0 kB HighFree: 0 kB 0 kB 0 kB LowTotal: 27964 kB 27980 kB +16 kB LowFree: 24596 kB 25092 kB +496 kB SwapTotal: 0 kB 0 kB 0 kB SwapFree: 0 kB 0 kB 0 kB Dirty: 4 kB 12 kB +8 kB Writeback: 0 kB 0 kB 0 kB Mapped: 560 kB 556 kB -4 kB Slab: 1756 kB 0 kB -1756 kB CommitLimit: 13980 kB 13988 kB +8 kB Committed_AS: 4208 kB 4208 kB 0 kB PageTables: 28 kB 28 kB 0 kB VmallocTotal: 1007312 kB 1007312 kB 0 kB VmallocUsed: 48 kB 48 kB 0 kB VmallocChunk: 1007264 kB 1007264 kB 0 kB (this work has been sponsored in part by CELF) From: Ingo Molnar <mingo@elte.hu> Fix 32-bitness bugs in mm/slob.c. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:41 -08:00
Matt Mackall	30992c97ae	[PATCH] slob: introduce mm/util.c for shared functions Add mm/util.c for functions common between SLAB and SLOB. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:41 -08:00
Ravikiran G Thirumalai	22fc6eccbf	[PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros ____cacheline_maxaligned_in_smp is currently used to align critical structures and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find L1_CACHE_SHIFT_MAX useless. However, we have been using ____cacheline_maxaligned_in_smp to align structures on the internode cacheline size. As per Andi's suggestion, following patch kills ____cacheline_maxaligned_in_smp and introduces INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches. Arches needing L3/Internode cacheline alignment can define INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp With this patch, L1_CACHE_SHIFT_MAX can be killed Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:13:38 -08:00
Kirill Korotaev	2f659f462d	[PATCH] Optimise oom kill of current task When oom_killer kills current there's no need to call schedule_timeout_interruptible() since task must die ASAP. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:45 -08:00
Christoph Lameter	6ce3c4c0ff	[PATCH] Move page migration related functions near do_migrate_pages() Group page migration functions in mempolicy.c Add a forward declaration for migrate_page_add (like gather_stats()) and use our new found mobility to group all page migration related function around do_migrate_pages(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	48fce3429d	[PATCH] mempolicies: unexport get_vma_policy() Since the numa_maps functionality is now in mempolicy.c we no longer need to export get_vma_policy(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	132beacf97	[PATCH] Drop page table lock before calling migrate_page_add() migrate_page_add cannot be called with a spinlock held (calls isolate_lru_page which calles schedule_on_each_cpu). Drop ptl lock in check_pte_range before calling migrate_page_add(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	1a75a6c825	[PATCH] Fold numa_maps into mempolicies.c First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2 - Use the check_range() in mempolicy.c to gather statistics. - Improve the numa_maps code in general and fix some comments. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Christoph Lameter	38e35860db	[PATCH] mempolicies: private pointer in check_range and MPOL_MF_INVERT This was was first posted at http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2 (Part of this functionality is also contained in the direct migration pathset. The functionality here is more generic and independent of that patchset.) - Add internal flags MPOL_MF_INVERT to control check_range() behavior. - Replace the pagelist passed through by check_range by a general private pointer that may be used for other purposes. (The following patches will use that to merge numa_maps into mempolicy.c and to better group the page migration code in the policy layer) - Improve some comments. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Dave Jones	ef2bf0dc86	[PATCH] rmap: additional diagnostics in page_remove_rmap() We seem to be hitting this assertion failure too often for it to be hardware bugs. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:44 -08:00
Tobias Klauser	cd105df459	[PATCH] mm: clean up local variables Clean up a local variable with the same name as a variable in a larger block. Also move a variable into the block where it's actually used. Spotted by http://linuxicc.sourceforge.net/ Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:43 -08:00
Christoph Lameter	aea47ff363	[PATCH] mm: make hugepages obey cpusets. See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2 http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2 Make hugepages obey cpusets. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:43 -08:00
Christoph Lameter	d0d963281c	[PATCH] SwapMig: Switch error handling in migrate_pages to use -Exx Use -Exxx instead of numeric return codes and cleanup the code in migrate_pages() using -Exx error codes. Consolidate successful migration handling Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	d498471133	[PATCH] SwapMig: Extend parameters for migrate_pages() Extend the parameters of migrate_pages() to allow the caller control over the fate of successfully migrated or impossible to migrate pages. Swap migration and direct migration will have the same interface after this patch so that patches can be independently applied to the policy layer and the core migration code. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	ee27497df3	[PATCH] SwapMig: Drop unused pages immediately Drop unused pages immediately If a page is encountered that is only referenced by the migration code then there is no reason to swap or migrate the page. Release the page by calling move_to_lru(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	1480a540c9	[PATCH] SwapMig: add_to_swap() avoid atomic allocations Add gfp_mask to add_to_swap add_to_swap does allocations with GFP_ATOMIC in order not to interfere with swapping. During migration we may have use add_to_swap extensively which may lead to out of memory errors. This patch makes add_to_swap take a parameter that specifies the gfp mask. The page migration code can then make add_to_swap use GFP_KERNEL. Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	8419c31810	[PATCH] SwapMig: CONFIG_MIGRATION fixes Move move_to_lru, putback_lru_pages and isolate_lru in section surrounded by CONFIG_MIGRATION saving some codesize for single processor kernels. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	39743889aa	[PATCH] Swap Migration V5: sys_migrate_pages interface sys_migrate_pages implementation using swap based page migration This is the original API proposed by Ray Bryant in his posts during the first half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org. The intent of sys_migrate is to migrate memory of a process. A process may have migrated to another node. Memory was allocated optimally for the prior context. sys_migrate_pages allows to shift the memory to the new node. sys_migrate_pages is also useful if the processes available memory nodes have changed through cpuset operations to manually move the processes memory. Paul Jackson is working on an automated mechanism that will allow an automatic migration if the cpuset of a process is changed. However, a user may decide to manually control the migration. This implementation is put into the policy layer since it uses concepts and functions that are also needed for mbind and friends. The patch also provides a do_migrate_pages function that may be useful for cpusets to automatically move memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation. The current code here is based on the swap based page migration capability and thus is not able to preserve the physical layout relative to it containing nodeset (which may be a cpuset). When direct page migration becomes available then the implementation needs to be changed to do a isomorphic move of pages between different nodesets. The current implementation simply evicts all pages in source nodeset that are not in the target nodeset. Patch supports ia64, i386 and x86_64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:42 -08:00
Christoph Lameter	dc9aa5b9d6	[PATCH] Swap Migration V5: MPOL_MF_MOVE interface Add page migration support via swap to the NUMA policy layer This patch adds page migration support to the NUMA policy layer. An additional flag MPOL_MF_MOVE is introduced for mbind. If MPOL_MF_MOVE is specified then pages that do not conform to the memory policy will be evicted from memory. When they get pages back in new pages will be allocated following the numa policy. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	7cbe34cf86	[PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support Include page migration if the system is NUMA or having a memory model that allows distinct areas of memory (SPARSEMEM, DISCONTIGMEM). And: - Only include lru_add_drain_per_cpu if building for an SMP system. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	49d2e9cc45	[PATCH] Swap Migration V5: migrate_pages() function This adds the basic page migration function with a minimal implementation that only allows the eviction of pages to swap space. Page eviction and migration may be useful to migrate pages, to suspend programs or for remapping single pages (useful for faulty pages or pages with soft ECC failures) The process is as follows: The function wanting to migrate pages must first build a list of pages to be migrated or evicted and take them off the lru lists via isolate_lru_page(). isolate_lru_page determines that a page is freeable based on the LRU bit set. Then the actual migration or swapout can happen by calling migrate_pages(). migrate_pages does its best to migrate or swapout the pages and does multiple passes over the list. Some pages may only be swappable if they are not dirty. migrate_pages may start writing out dirty pages in the initial passes over the pages. However, migrate_pages may not be able to migrate or evict all pages for a variety of reasons. The remaining pages may be returned to the LRU lists using putback_lru_pages(). Changelog V4->V5: - Use the lru caches to return pages to the LRU Changelog V3->V4: - Restructure code so that applying patches to support full migration does require minimal changes. Rename swapout_pages() to migrate_pages(). Changelog V2->V3: - Extract common code from shrink_list() and swapout_pages() Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: "Michael Kerrisk" <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	930d915252	[PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap Add PF_SWAPWRITE to control a processes permission to write to swap. - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd and pdflush - Set PF_SWAPWRITE flag for kswapd and pdflush Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Christoph Lameter	21eac81f25	[PATCH] Swap Migration V5: LRU operations This is the start of the `swap migration' patch series. Swap migration allows the moving of the physical location of pages between nodes in a numa system while the process is running. This means that the virtual addresses that the process sees do not change. However, the system rearranges the physical location of those pages. The main intent of page migration patches here is to reduce the latency of memory access by moving pages near to the processor where the process accessing that memory is running. The patchset allows a process to manually relocate the node on which its pages are located through the MF_MOVE and MF_MOVE_ALL options while setting a new memory policy. The pages of process can also be relocated from another process using the sys_migrate_pages() function call. Requires CAP_SYS_ADMIN. The migrate_pages function call takes two sets of nodes and moves pages of a process that are located on the from nodes to the destination nodes. Manual migration is very useful if for example the scheduler has relocated a process to a processor on a distant node. A batch scheduler or an administrator can detect the situation and move the pages of the process nearer to the new processor. sys_migrate_pages() could be used on non-numa machines as well, to force all of a particualr process's pages out to swap, if someone thinks that's useful. Larger installations usually partition the system using cpusets into sections of nodes. Paul has equipped cpusets with the ability to move pages when a task is moved to another cpuset. This allows automatic control over locality of a process. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically (as is the case today). Swap migration works by simply evicting the page. The pages must be faulted back in. The pages are then typically reallocated by the system near the node where the process is executing. For swap migration the destination of the move is controlled by the allocation policy. Cpusets set the allocation policy before calling sys_migrate_pages() in order to move the pages as intended. No allocation policy changes are performed for sys_migrate_pages(). This means that the pages may not faulted in to the specified nodes if no allocation policy was set by other means. The pages will just end up near the node where the fault occurred. There's another patch series in the pipeline which implements "direct migration". The direct migration patchset extends the migration functionality to avoid going through swap. The destination node of the relation is controllable during the actual moving of pages. The crutch of using the allocation policy to relocate is not necessary and the pages are moved directly to the target. Its also faster since swap is not used. And sys_migrate_pages() can then move pages directly to the specified node. Implement functions to isolate pages from the LRU and put them back later. This patch: An earlier implementation was provided by Hirokazu Takahashi <taka@valinux.co.jp> and IWAMOTO Toshihiro <iwamoto@valinux.co.jp> for the memory hotplug project. From: Magnus This breaks out isolate_lru_page() and putpack_lru_page(). Needed for swap migration. Signed-off-by: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:41 -08:00
Nick Piggin	48db57f8ff	[PATCH] mm: free_pages opt Try to streamline free_pages_bulk by ensuring callers don't pass in a 'count' that exceeds the list size. Some cleanups: Rename __free_pages_bulk to __free_one_page. Put the page list manipulation from __free_pages_ok into free_one_page. Make __free_pages_ok static. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Nick Piggin	23316bc86f	[PATCH] mm: cleanup zone_pcp Use zone_pcp everywhere even though NUMA code "knows" the internal details of the zone. Stop other people trying to copy, and it looks nicer. Also, only print the pagesets of online cpus in zoneinfo. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Rohit Seth	8ad4b1fb82	[PATCH] Make high and batch sizes of per_cpu_pagelists configurable As recently there has been lot of traffic on the right values for batch and high water marks for per_cpu_pagelists. This patch makes these two variables configurable through /proc interface. A new tunable /proc/sys/vm/percpu_pagelist_fraction is added. This entry controls the fraction of pages at most in each zone that are allocated for each per cpu page list. The min value for this is 8. It means that we don't allow more than 1/8th of pages in each zone to be allocated in any single per_cpu_pagelist. The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Andrew Morton	9d0243bca3	[PATCH] drop-pagecache Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel to discard as much pagecache and/or reclaimable slab objects as it can. THis operation requires root permissions. It won't drop dirty data, so the user should run `sync' first. Caveats: a) Holds inode_lock for exorbitant amounts of time. b) Needs to be taught about NUMA nodes: propagate these all the way through so the discarding can be controlled on a per-node basis. This is a debugging feature: useful for getting consistent results between filesystem benchmarks. We could possibly put it under a config option, but it's less than 300 bytes. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Christoph Lameter	bec6b0c89b	[PATCH] slab: remove nested #ifdef CONFIG_NUMA For some reason there is an #ifdef CONFIG_NUMA within another #ifdef CONFIG_NUMA in the page allocator. Remove innermost #ifdef CONFIG_NUMA Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:40 -08:00
Pekka Enberg	b28a02de8c	[PATCH] slab: fix code formatting The slab allocator code is inconsistent in coding style and messy. For this patch, I ran Lindent for mm/slab.c and fixed up goofs by hand. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	4d268eba11	[PATCH] slab: extract slab order calculation to separate function This patch moves the ugly loop that determines the 'optimal' size (page order) of cache slabs from kmem_cache_create() to a separate function and cleans it up a bit. Thanks to Matthew Wilcox for the help with this patch. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	85289f98dd	[PATCH] slab: extract slabinfo header printing to separate function This patch extracts slabinfo header printing to a separate function print_slabinfo_header() to make s_start() more readable. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Pekka Enberg	f9f7500521	[PATCH] slab: remove unused align parameter from alloc_percpu __alloc_percpu and alloc_percpu both take an 'align' argument which is completely ignored. snmp6_mib_init() in net/ipv6/af_inet6.c attempts to use it, but it will be ignored. Therefore, remove the 'align' argument and fixup the lone caller. Signed-off-by: Matthew Dobson <colpatch@us.ibm.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:39 -08:00
Andrew Morton	84c2008af0	[PATCH] revert "mm: page_state fixes" Hugh says: page_alloc_cpu_notify() specifically contains code to /* Add dead cpu's page_states to our own. */ which handles this more efficiently. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-08 20:12:38 -08:00
Arnd Bergmann	67207b9664	[PATCH] spufs: The SPU file system, base This is the current version of the spu file system, used for driving SPEs on the Cell Broadband Engine. This release is almost identical to the version for the 2.6.14 kernel posted earlier, which is available as part of the Cell BE Linux distribution from http://www.bsc.es/projects/deepcomputing/linuxoncell/. The first patch provides all the interfaces for running spu application, but does not have any support for debugging SPU tasks or for scheduling. Both these functionalities are added in the subsequent patches. See Documentation/filesystems/spufs.txt on how to use spufs. Signed-off-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2006-01-09 14:49:12 +11:00
Andrew Morton	22905f775d	identify multipage ->writepages() calls NFS needs to be able to distinguish between single-page ->writepage() calls and multipage ->writepages() calls. For the single-page writepage calls NFS can kick off the I/O within the context of ->writepage(). For multipage ->writepages calls, nfs_writepage() will leave the I/O pending and nfs_writepages() will kick off the I/O when it all has been queued up within NFS. Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2006-01-06 14:58:38 -05:00
Rafael J. Wysocki	3a291a20bd	[PATCH] mm: add a new function (needed for swap suspend) This adds the function get_swap_page_of_type() allowing us to specify an index in swap_info[] and select a swap_info_struct structure to be used for allocating a swap page. This function (or another one of similar functionality) will be necessary for implementing the image-writing part of swsusp in the user space. It can also be used for simplifying the current in-kernel implementation of the image-writing part of swsusp. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:43 -08:00
Anton Blanchard	c898ec16e8	[PATCH] allow flatmem to be disabled when only sparsemem is implemented On architectures that implement sparsemem but not discontigmem we want to be able to hide the flatmem option in some cases. On ppc64 for example, when we select NUMA we must not select flatmem. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:37 -08:00
David Howells	b0e15190ea	[PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU The attached patch makes the SYSV IPC shared memory facilities use the new ramfs facilities on a no-MMU kernel. The following changes are made: (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to allow the IPC SHM facilities to commune with the tiny-shmem and shmem code. (2) ramfs files now need resizing using do_truncate() rather than by modifying the inode size directly (see shmem_file_setup()). This causes ramfs to attempt to bind a block of pages of sufficient size to the inode. (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:32 -08:00
Nick Piggin	a74609fafa	[PATCH] mm: page_state opt Optimise page_state manipulations by introducing interrupt unsafe accessors to page_state fields. Callers must provide their own locking (either disable interrupts or not update from interrupt context). Switch over the hot callsites that can easily be moved under interrupts off sections. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:29 -08:00
Christoph Lameter	070f80326a	[PATCH] build_zonelists_node(): rename args Give j and r meaningful names. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	02a68a5ebc	[PATCH] Fix zone policy determination The use k in the inner loop means that the highest zone nr is always used if any zone of a node is populated. This means that the policy zone is not correctly determined on arches that do no use HIGHMEM like ia64. Change the loop to decrement k which also simplifies the BUG_ON. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	4be38e351c	[PATCH] mm: move determination of policy_zone into page allocator Currently the function to build a zonelist for a BIND policy has the side effect to set the policy_zone. This seems to be a bit strange. policy zone seems to not be initialized elsewhere and therefore 0. Do we police ZONE_DMA if no bind policy has been used yet? This patch moves the determination of the zone to apply policies to into the page allocator. We determine the zone while building the zonelist for nodes. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Christoph Lameter	1a93205bdf	[PATCH] mm: simplify build_zonelists_node by removing the case statement. Simplify build_zonelists_node by removing the case statement. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Con Kolivas	f3fe65122d	[PATCH] mm: add populated_zone() helper There are numerous places we check whether a zone is populated or not. Provide a helper function to check for populated zones and convert all checks for zone->present_pages. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Andrew Morton	80bfed904c	[PATCH] consolidate lru_add_drain() and lru_drain_cache() Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Rajesh Shah <rajesh.shah@intel.com> Cc: Li Shaohua <shaohua.li@intel.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:28 -08:00
Andrew Morton	210fe53030	[PATCH] vmscan: balancing fix Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: The shrink_zone() logic can, under some circumstances, cause far too many pages to be reclaimed. Say, we're scanning at high priority and suddenly hit a large number of reclaimable pages on the LRU. Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. Problem is, this change caused significant imbalance in inter-zone scan balancing by truncating scans of larger zones. Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone balancing algorithm would require that if we're scanning 100 pages of ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are reclaimed. Thus effectively causing smaller zones to be scanned relatively harder than large ones. Now I need to remember what the workload was which caused me to write this patch originally, then fix it up in a different way... Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	41e9b63b35	[PATCH] mm: pfault optimisation This atomic operation is superfluous: the pte will be added with the referenced bit set, and the page will be referenced through this mapping after the page fault handler returns anyway. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	9617d95e6e	[PATCH] mm: rmap optimisation Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:27 -08:00
Nick Piggin	224abf92b2	[PATCH] mm: bad_page optimisation Cut down size slightly by not passing bad_page the function name (it should be able to be determined by dump_stack()). And cut down the number of printks in bad_page. Also, cut down some branching in the destroy_compound_page path. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nick Piggin	9328b8faae	[PATCH] mm: dma32 zone statistics Add dma32 to zone statistics. Also attempt to arrange struct page_state a bit better (visually). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Andrew Morton	7756b9e4e3	[PATCH] kill last zone_reclaim() bits Remove the last bits of Martin's ill-fated sys_set_zone_reclaim(). Cc: Martin Hicks <mort@wildopensource.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nikita Danilov	bbfbb7cec9	[PATCH] find_lock_page(): call __lock_page() directly. As find_lock_page() already checks with TestSetPageLocked() that page is locked, there is no need to call lock_page() that will try-lock page again (chances of page being unlocked in between are small). Call __lock_page() directly, this saves one atomic operation. Also, mark truncate-while-slept path as unlikely while we are here. (akpm: ug. But this is actually a common path for normal old read()s against a page which is under readahead I/O so ho-hum.) Signed-off-by: Nikita Danilov <danilov@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
David Howells	a226f6c899	[PATCH] FRV: Clean up bootmem allocator's page freeing algorithm The attached patch cleans up the way the bootmem allocator frees pages. A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is called from mm/bootmem.c to turn pages over to the main allocator. All the bits of code to initialise pages (clearing PG_reserved and setting the page count) are moved to here. The checks on page validity are removed, on the assumption that the struct page arrays will have been prepared correctly. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Ravikiran G Thirumalai	008857c1a4	[PATCH] Cleanup bootmem allocator and fix alloc_bootmem_low Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes alloc_bootmem__limit api and fixes alloc_boot_low api to do the right thing -- allocate from low32 memory. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:26 -08:00
Nick Piggin	085cc7d5de	[PATCH] mm: page_alloc cleanups Small cleanups that does not change generated code with the gcc's I've tested with. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	a86b1f5316	[PATCH] mm: page_state fixes read_page_state and __get_page_state only traverse online CPUs, which will cause results to fluctuate when CPUs are plugged in or out. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	2d92c5c915	[PATCH] mm: remove pcp low struct per_cpu_pages.low is useless. Remove it. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	13e7444b0e	[PATCH] mm: remove bad_range bad_range is supposed to be a temporary check. It would be a pity to throw it out. Make it depend on CONFIG_DEBUG_VM instead. CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the page allocator. Add that to page_is_buddy instead. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	92be2e33b1	[PATCH] mm: microopt conditions Micro optimise some conditionals where we don't need lazy evaluation. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	77a8a78834	[PATCH] mm: set_page_refs opt Inline set_page_refs. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Nick Piggin	c54ad30c78	[PATCH] mm: pagealloc opt Slightly optimise some page allocation and freeing functions by taking advantage of knowing whether or not interrupts are disabled. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:25 -08:00
Hugh Dickins	c484d41042	[PATCH] mm: free_pages_and_swap_cache opt Minor optimization (though it doesn't help in the PREEMPT case, severely constrained by small ZAP_BLOCK_SIZE). free_pages_and_swap_cache works in chunks of 16, calling release_pages which works in chunks of PAGEVEC_SIZE. But PAGEVEC_SIZE was dropped from 16 to 14 in 2.6.10, so we're now doing more spin_lock_irq'ing than necessary: use PAGEVEC_SIZE throughout. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:24 -08:00
Mike Kravetz	a94b3ab7ea	[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM could handle pSeries numa layouts. However, support for DISCONTIGMEM has been replaced by SPARSEMEM on powerpc. As a result, this config option and supporting code is no longer needed. I have already sent a patch to Paul that removes the option from powerpc specific code. This removes the arch independent piece. Doesn't really matter which is applied first. Signed-off-by: Mike Kravetz <kravetz@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:24 -08:00
Christoph Lameter	6bda666a03	[PATCH] hugepages: fold find_or_alloc_pages into huge_no_page() The number of parameters for find_or_alloc_page increases significantly after policy support is added to huge pages. Simplify the code by folding find_or_alloc_huge_page() into hugetlb_no_page(). Adam Litke objected to this piece in an earlier patch but I think this is a good simplification. Diffstat shows that we can get rid of almost half of the lines of find_or_alloc_page(). If we can find no consensus then lets simply drop this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	21abb1478a	[PATCH] Remove old node based policy interface from mempolicy.c mempolicy.c contains provisional interface for huge page allocation based on node numbers. This is in use in SLES9 but was never used (AFAIK) in upstream versions of Linux. Huge page allocations now use zonelists to figure out where to allocate pages. The use of zonelists allows us to find the closest hugepage which was the consideration of the NUMA distance for huge page allocations. Remove the obsolete functions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	5da7ca8607	[PATCH] Add NUMA policy support for huge pages. The huge_zonelist() function in the memory policy layer provides an list of zones ordered by NUMA distance. The hugetlb layer will walk that list looking for a zone that has available huge pages but is also in the nodeset of the current cpuset. This patch does not contain the folding of find_or_alloc_huge_page() that was controversial in the earlier discussion. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Christoph Lameter	96df9333c9	[PATCH] mm: dequeue a huge page near to this node This was discussed at http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2 This patch changes the dequeueing to select a huge page near the node executing instead of always beginning to check for free nodes from node 0. This will result in a placement of the huge pages near the executing processor improving performance. The existing implementation can place the huge pages far away from the executing processor causing significant degradation of performance. The search starting from zero also means that the lower zones quickly run out of memory. Selecting a huge page near the process distributed the huge pages better. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
David Gibson	1e8f889b10	[PATCH] Hugetlb: Copy on Write support Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be supported. This helps us to safely use hugetlb pages in many more applications. The patch makes the following changes. If needed, I also have it broken out according to the following paragraphs. 1. Add a pair of functions to set/clear write access on huge ptes. The writable check in make_huge_pte is moved out to the caller for use by COW later. 2. Hugetlb copy-on-write requires special case handling in the following situations: - copy_hugetlb_page_range() - Copied pages must be write protected so a COW fault will be triggered (if necessary) if those pages are written to. - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the page cache. MAP_PRIVATE pages still need to be locked however. 3. Provide hugetlb_cow() and calls from hugetlb_fault() and hugetlb_no_page() which handles the COW fault by making the actual copy. 4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED mapping check. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:23 -08:00
Adam Litke	86e5216f8d	[PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COW This patch splits the "no_page()" type activity into its own function, hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults and delegates to the appropriate handler depending on the type of fault. Right now we still have only hugetlb_no_page() but a later patch introduces a COW fault. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Adam Litke	85ef47f74a	[PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page find_lock_huge_page() isn't a great name, since it does extra things not analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is closer to the mark. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Adam Litke	f0916794f0	[PATCH] Hugetlb: Remove duplicate i_size check cleanup Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Badari Pulavarty	f6b3ec238d	[PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store Here is the patch to implement madvise(MADV_REMOVE) - which frees up a given range of pages & its associated backing store. Current implementation supports only shmfs/tmpfs and other filesystems return -ENOSYS. "Some app allocates large tmpfs files, then when some task quits and some client disconnect, some memory can be released. However the only way to release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli Databases want to use this feature to drop a section of their bufferpool (shared memory segments) - without writing back to disk/swap space. This feature is also useful for supporting hot-plug memory on UML. Concerns raised by Andrew Morton: - "We have no plan for holepunching! If we _do_ have such a plan (or might in the future) then what would the API look like? I think sys_holepunch(fd, start, len), so we should start out with that." - Using madvise is very weird, because people will ask "why do I need to mmap my file before I can stick a hole in it?" - None of the other madvise operations call into the filesystem in this manner. A broad question is: is this capability an MM operation or a filesytem operation? truncate, for example, is a filesystem operation which sometimes has MM side-effects. madvise is an mm operation and with this patch, it gains FS side-effects, only they're really, really significant ones." Comments: - Andrea suggested the fs operation too but then it's more efficient to have it as a mm operation with fs side effects, because they don't immediatly know fd and physical offset of the range. It's possible to fixup in userland and to use the fs operation but it's more expensive, the vmas are already in the kernel and we can use them. Short term plan & Future Direction: - We seem to need this interface only for shmfs/tmpfs files in the short term. We have to add hooks into the filesystem for correctness and completeness. This is what this patch does. - In the future, plan is to support both fs and mmap apis also. This also involves (other) filesystem specific functions to be implemented. - Current patch doesn't support VM_NONLINEAR - which can be addressed in the future. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrea Arcangeli <andrea@suse.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Hans Reiser	d7339071f6	[PATCH] reiser4: vfs: add truncate_inode_pages_range() This patch makes truncate_inode_pages_range from truncate_inode_pages. truncate_inode_pages became a one-liner call to truncate_inode_pages_range. Reiser4 needs truncate_inode_pages_ranges because it tries to keep correspondence between existences of metadata pointing to data pages and pages to which those metadata point to. So, when metadata of certain part of file is removed from filesystem tree, only pages of corresponding range are to be truncated. (Needed by the madvise(MADV_REMOVE) patch) Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:22 -08:00
Andy Whitcroft	5ac24eefd1	[PATCH] memhotplug: __add_section remove unused pgdat definition __add_section defines an unused pointer to the zones pgdat. Remove this definition. This fixes a compile warning. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:21 -08:00
Paul Jackson	47f3a867f6	[PATCH] mm: fix __alloc_pages cpuset ALLOC_* flags Two changes to the setting of the ALLOC_CPUSET flag in mm/page_alloc.c:__alloc_pages() - A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET. This case of all cases, since it is handling a request that will free up more memory than is asked for (exiting tasks, e.g.) should be allowed to escape cpuset constraints when memory is tight. - A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC (!wait) requests. With this, cpuset confinement applies to all requests except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason this logic has to be either way, I am choosing the path of the simplest code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-06 08:33:21 -08:00
Zach Brown	994fc28c7b	[PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE readpage(), prepare_write(), and commit_write() callers are updated to understand the special return code AOP_TRUNCATED_PAGE in the style of writepage() and WRITEPAGE_ACTIVATE. AOP_TRUNCATED_PAGE tells the caller that the callee has unlocked the page and that the operation should be tried again with a new page. OCFS2 uses this to detect and work around a lock inversion in its aop methods. There should be no change in behaviour for methods that don't return AOP_TRUNCATED_PAGE. WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are made enums so that kerneldoc can be used to document their semantics. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2006-01-03 11:45:42 -08:00
Andi Kleen	8f493d797b	[PATCH] Make sure interleave masks have at least one node set Otherwise a bad mem policy system call can confuse the interleaving code into referencing undefined nodes. Originally reported by Doug Chapman I was told it's CVE-2005-3358 (one has to love these security people - they make everything sound important) Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-01-02 17:01:42 -08:00
Linus Torvalds	4d7672b462	Make sure we copy pages inserted with "vm_insert_page()" on fork The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-16 10:21:23 -08:00
Al Viro	78d9955bb0	[PATCH] missing prototype (mm/page_alloc.c) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-15 10:04:30 -08:00
Yasunori Goto	118c71bcac	[PATCH] Fix calculation of grow_pgdat_span() in mm/memory_hotplug.c The calculation for node_spanned_pages at grow_pgdat_span() is clearly wrong. This is patch for it. (Please see grow_zone_span() to compare. It is correct.) Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-13 21:18:16 -08:00
Linus Torvalds	1ff8038988	get_user_pages: don't try to follow PFNMAP pages Nick Piggin points out that a few drivers play games with VM_IO (why? who knows..) and thus a pfn-remapped area may not have that bit set even if remap_pfn_range() set it originally. So make it explicit in get_user_pages() that we don't follow VM_PFNMAP pages, since pretty much by definition they do not have a "struct page" associated with them. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-12 16:24:33 -08:00
Haren Myneni	66d43e98ea	[PATCH] fix in __alloc_bootmem_core() when there is no free page in first node's memory Hitting BUG_ON() in __alloc_bootmem_core() when there is no free page available in the first node's memory. For the case of kdump on PPC64 (Power 4 machine), the captured kernel is used two memory regions - memory for TCE tables (tce-base and tce-size at top of RAM and reserved) and captured kernel memory region (crashk_base and crashk_size). Since we reserve the memory for the first node, we should be returning from __alloc_bootmem_core() to search for the next node (pg_dat). Currently, find_next_zero_bit() is returning the n^th bit (eidx) when there is no free page. Then, test_bit() is failed since we set 0xff only for the actual size initially (init_bootmem_core()) even though rounded up to one page for bdata->node_bootmem_map. We are hitting the BUG_ON after failing to enter second "for" loop. Signed-off-by: Haren Myneni <haren@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-12 08:57:45 -08:00
Linus Torvalds	67121172f9	Allow arbitrary read-only shared pfn-remapping too The VM layer (for historical reasons) turns a read-only shared mmap into a private-like mapping with the VM_MAYWRITE bit clear. Thus checking just VM_SHARED isn't actually sufficient. So use a trivial helper function for the cases where we wanted to inquire if a mapping was COW-like or not. Moo! Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 20:38:17 -08:00
Linus Torvalds	7fc7e2eeec	Remove (at least temporarily) the "incomplete PFN mapping" support With the previous commit, we can handle arbitrary shared re-mappings even without this complexity, and since the only known private mappings are for strange users of /dev/mem (which never create an incomplete one), there seems to be no reason to support it. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 19:57:52 -08:00
Linus Torvalds	fb155c1619	Allow arbitrary shared PFNMAP's A shared mapping doesn't cause COW-pages, so we don't need to worry about the whole vm_pgoff logic to decide if a PFN-remapped page has gone through COW or not. This makes it possible to entirely avoid the special "partial remapping" logic for the common case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-11 19:46:02 -08:00
Linus Torvalds	e3c3374fbf	Make vm_insert_page() available to NVidia module It used to use remap_pfn_range(), which wasn't GPL-only either, and the new interface is actually simpler and does more checking, so we shouldn't unnecessarily discourage people from switching over. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-03 20:48:11 -08:00
Nick Piggin	0ceaacc978	[PATCH] Fix up per-cpu page batch sizes The code to clamp batch sizes to 2^n - 1 went missing and an extra check got added, which must have been a hunk of the "higer order pcp batch refills" work sneaking in. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-12-03 20:46:40 -08:00
Linus Torvalds	a145dd411e	VM: add "vm_insert_page()" function This is what a lot of drivers will actually want to use to insert individual pages into a user VMA. It doesn't have the old PageReserved restrictions of remap_pfn_range(), and it doesn't complain about partial remappings. The page you insert needs to be a nice clean kernel allocation, so you can't insert arbitrary page mappings with this, but that's not what people want. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-30 09:35:19 -08:00
Trond Myklebust	49c91fb01f	[PATCH] VM: Fix typos in get_locked_pte Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 17:29:57 -08:00
Hugh Dickins	325f04dbca	[PATCH] pfnmap: do_no_page BUG_ON again Use copy_user_highpage directly instead of cow_user_page in do_no_page: in the immediately following page_cache_release, and elsewhere, it is assuming that new_page is normal. If any VM_PFNMAP driver can get to do_no_page, it's just a BUG (but not in the case of do_anonymous_page). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:09:17 -08:00
Hugh Dickins	e5bbe4dfc8	[PATCH] pfnmap: remove src_page from do_wp_page Clean away do_wp_page's "src_page": cow_user_page makes it unnecessary. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:09:16 -08:00
Linus Torvalds	5d2a2dbbc1	cow_user_page: fix page alignment High Dickins points out that the user virtual address passed to the page fault handler isn't necessarily page-aligned. Also, add a comment on why the copy could fail for the user address case. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:07:55 -08:00
Linus Torvalds	c9cfcddfd6	VM: add common helper function to create the page tables This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 14:03:14 -08:00
Linus Torvalds	238f58d898	Support strange discontiguous PFN remappings These get created by some drivers that don't generally even want a pfn remapping at all, but would really mostly prefer to just map pages they've allocated individually instead. For now, create a helper function that turns such an incomplete PFN remapping call into a loop that does that explicit mapping. In the long run we almost certainly want to export a totally different interface for that, though. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 13:01:56 -08:00
Ben Collins	eca351336a	[PATCH] Fix missing pfn variables caused by vm changes I image this showed up because of "unused var..." when the changes occured, because flush_cache_page() is a noop in most places. This showed up for me on parisc however, where flush_cache_page() is a real function. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 12:57:17 -08:00
Nick Piggin	fa2a455b02	[PATCH] Fix vma argument in get_usr_pages() for gate areas The system call gate area handling called vm_normal_page() with the wrong vma (which was always NULL, and caused an oops). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-29 07:53:32 -08:00
Andrea Arcangeli	ea164d73a7	[PATCH] shrinker->nr = LONG_MAX means deadlock for icache With Andrew Morton <akpm@osdl.org> The slab scanning code tries to balance the scanning rate of slabs versus the scanning rate of LRU pages. To do this, it retains state concerning how many slabs have been scanned - if a particular slab shrinker didn't scan enough objects, we remember that for next time, and scan more objects on the next pass. The problem with this is that with (say) a huge number of GFP_NOIO direct-reclaim attempts, the number of objects which are to be scanned when we finally get a GFP_KERNEL request can be huge. Because some shrinker handlers just bail out if !__GFP_FS. So the patch clamps the number of objects-to-be-scanned to 2* the total number of objects in the slab cache. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:26 -08:00
Rik van Riel	f7b7fd8f3e	[PATCH] temporarily disable swap token on memory pressure Some users (hi Zwane) have seen a problem when running a workload that eats nearly all of physical memory - th system does an OOM kill, even when there is still a lot of swap free. The problem appears to be a very big task that is holding the swap token, and the VM has a very hard time finding any other page in the system that is swappable. Instead of ignoring the swap token when sc->priority reaches 0, we could simply take the swap token away from the memory hog and make sure we don't give it back to the memory hog for a few seconds. This patch resolves the problem Zwane ran into. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:25 -08:00
Nick Piggin	3148890bfa	[PATCH] mm: __alloc_pages cleanup fix I believe this patch is required to fix breakage in the asynch reclaim watermark logic introduced by this patch: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee Just some background of the watermark logic in case it isn't clear... Basically what we have is this: --- pages_high \| \| (a) \| --- pages_low \| \| (b) \| --- pages_min \| \| (c) \| --- 0 Now when pages_low is reached, we want to kick asynch reclaim, which gives us an interval of "b" before we must start synch reclaim, and gives kswapd an interval of "a" before it need go back to sleep. When pages_min is reached, normal allocators must enter synch reclaim, but PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive allocations, etc.) get access to varying amounts of the reserve "c". Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Seth, Rohit" <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:24 -08:00
Alan Stern	e0f39591cc	[PATCH] Workaround for gcc 2.96 (undefined references) LD .tmp_vmlinux1 mm/built-in.o(.text+0x100d6): In function `copy_page_range': : undefined reference to `__pud_alloc' mm/built-in.o(.text+0x1010b): In function `copy_page_range': : undefined reference to `__pmd_alloc' mm/built-in.o(.text+0x11ef4): In function `__handle_mm_fault': : undefined reference to `__pud_alloc' fs/built-in.o(.text+0xc930): In function `install_arg_page': : undefined reference to `__pud_alloc' make: *** [.tmp_vmlinux1] Error 1 Those missing references in mm/memory.c arise from this code in include/linux/mm.h, combined with the fact that __PGTABLE_PMD_FOLDED and __PGTABLE_PUD_FOLDED are both set and __ARCH_HAS_4LEVEL_HACK is not: /* * The following ifdef needed to get the 4level-fixup.h header to work. * Remove it when 4level-fixup.h has been removed. / #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK) static inline pud_t pud_alloc(struct mm_struct mm, pgd_t pgd, unsigned long address) { return (unlikely(pgd_none(pgd)) && __pud_alloc(mm, pgd, address))? NULL: pud_offset(pgd, address); } static inline pmd_t pmd_alloc(struct mm_struct mm, pud_t pud, unsigned long address) { return (unlikely(pud_none(pud)) && __pmd_alloc(mm, pud, address))? NULL: pmd_offset(pud, address); } #endif / CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */ With my configuration the pgd_none and pud_none routines are inlines returning a constant 0. Apparently the old compiler avoids generating calls to __pud_alloc and __pmd_alloc but still lists them as undefined references in the module's symbol table. I don't know which change caused this problem. I think it was added somewhere between 2.6.14 and 2.6.15-rc1, because I remember building several 2.6.14-rc kernels without difficulty. However I can't point to an individual culprit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:42:22 -08:00
Linus Torvalds	6aab341e0a	mm: re-architect the VM_UNPAGED logic This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-28 14:34:23 -08:00
Oleg Drokin	479ef592f3	[PATCH] 32bit integer overflow in invalidate_inode_pages2() Fix a 32 bit integer overflow in invalidate_inode_pages2_range. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-23 16:08:39 -08:00
Hugh Dickins	7b6ac9dffe	[PATCH] mm: update split ptlock Kconfig Closer attention to the arithmetic shows that neither ppc64 nor sparc really uses one page for multiple page tables: how on earth could they, while pte_alloc_one returns just a struct page pointer, with no offset? Well, arm26 manages it by returning a pte_t pointer cast to a struct page pointer, harumph, then compensating in its pmd_populate. But arm26 is never SMP, so it's not a problem for split ptlock either. And the PA-RISC situation has been recently improved: CONFIG_PA20 works without the 16-byte alignment which inflated its spinlock_t. But the current union of spinlock_t with private does make the 7xxx struct page significantly larger, even without debug, so disable its split ptlock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-23 16:08:38 -08:00
Eric Paris	0bd0f9fb19	[PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of nr_huge_pages If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously it is possible for the nr_huge_pages variable to become incorrect. There is no locking in the set_max_huge_pages function around alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers to alloc_fresh_huge_page could race against each other as could a call to alloc_fresh_huge_page and a call to update_and_free_page. This patch just expands the area covered by the hugetlb_lock to cover the call into alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section is performance critical where more specific locking would be needed. My reproducer was to run a couple copies of the following script simultaneously while [ true ]; do echo 1000 > /proc/sys/vm/nr_hugepages echo 500 > /proc/sys/vm/nr_hugepages echo 750 > /proc/sys/vm/nr_hugepages echo 100 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages done and then watch /proc/meminfo and eventually you will see things like HugePages_Total: 100 HugePages_Free: 109 After applying the patch all seemed well. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:43 -08:00
Hugh Dickins	689bcebfda	[PATCH] unpaged: PG_reserved bad_page It used to be the case that PG_reserved pages were silently never freed, but in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should work through such cases as they appear, fixing the code; but for now it's safer to issue the message without freeing the page, leaving PG_reserved set. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	f57e88a8d8	[PATCH] unpaged: ZERO_PAGE in VM_UNPAGED It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas, let's not insert the ZERO_PAGE there - though whether it would matter will depend on what we decide about ZERO_PAGE refcounting. But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED area, do_no_page should never be: just BUG_ON. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	ee498ed730	[PATCH] unpaged: anon in VM_UNPAGED copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area, zap_pte_range needs to free them, do_wp_page needs to COW them: just like ordinary pages, not like the unpaged. But recognizing them is a little subtle: because PageReserved is no longer a condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the distro permits, and whether it's advisable on this or that architecture, is another matter). So if we can see a PageAnon, it may not be ours to mess with (or may be ours from elsewhere in the address space). I suspect there's an entertaining insoluble self-referential problem here, but the page_is_anon function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will always be an odd choice. In updating the comment on page_address_in_vma, noticed a potential NULL dereference, in a path we don't actually take, but fixed it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	920fc356f5	[PATCH] unpaged: COW on VM_UNPAGED Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do Copy-On-Write without touching the VM_UNPAGED's page counts - but this is incomplete, because the anonymous page it inserts will itself need to be handled, here and in other functions - next patch. We still don't copy the page if the pfn is invalid, because the copy_user_highpage interface does not allow it. But that's not been a problem in the past: can be added in later if the need arises. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	101d2be764	[PATCH] unpaged: VM_NONLINEAR VM_RESERVED There's one peculiar use of VM_RESERVED which the previous patch left behind: because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout cursor, but should never meet VM_RESERVED vmas, it was a way of extending VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose. But that's an empty set - they don't have the populate function required. So just throw away those VM_RESERVED tests. But one more interesting in rmap.c has to go too: try_to_unmap_one will want to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	0b14c179a4	[PATCH] unpaged: VM_UNPAGED Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	664beed019	[PATCH] unpaged: unifdefed PageCompound It looks like snd_xxx is not the only nopage to be using PageReserved as a way of holding a high-order page together: which no longer works, but is masked by our failure to free from VM_RESERVED areas. We cannot fix that bug without first substituting another way to hold the high-order page together, while farming out the 0-order pages from within it. That's just what PageCompound is designed for, but it's been kept under CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line put_page), doesn't slow down what most needs to be fast (already using hugetlb), and unifies the way we handle high-order pages. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	83e9b7e929	[PATCH] unpaged: private write VM_RESERVED The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed with -EACCES: because do_wp_page lacks the refinement to COW pages in those areas, nor do we expect to find anonymous pages in them; and it seemed just bloat to add code for handling such a peculiar case. But immediately it caused vbetool and ddcprobe (using lrmi) to fail. So revert the "deprecated" messages, letting mmap and mprotect succeed. But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've added the code to do it right: so this particular patch is only good if the app doesn't really need to write to that private area. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:42 -08:00
Hugh Dickins	ed5297a940	[PATCH] unpaged: get_user_pages VM_RESERVED The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the areas flagged VM_RESERVED in place of PageReserved. That is correct in theory - we ought not to interfere with struct pages in such a reserved area; but in practice it broke BTTV for one. So revert to prohibiting only on VM_IO: if someone gets into trouble with get_user_pages on VM_RESERVED, it'll just be a "don't do that". You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the first place, but now's not the time for breaking drivers without notice. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-22 09:13:41 -08:00
Kyle McMartin	2161558fa5	Merge branch 'master'	2005-11-18 16:39:20 -05:00
Matthew Wilcox	9ab8851549	[PARISC] Fix compile warning caused by conflicting types of expand_upwards() Fix compile warning caused by conflicting types of expand_upwards. IA64 requires it to not be static inline, as it's used outside mm/mmap.c Signed-off-by: Matthew Wilcox <willy@parisc-linux.org> Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>	2005-11-18 16:16:42 -05:00
Hans Reiser	58bb01a9cd	[PATCH] re-export clear_page_dirty_for_io() 2.6.14 has this exported, and reiser4 (at least) uses it. Put things back the way they were. Signed-off-by: Vladimir V. Saveliev <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-18 07:49:45 -08:00
Jens Axboe	6b1de9161e	[PATCH] VM: fix zone list restart in page allocatate We must reassign z before looping through the zones kicking kswapd, since it will be NULL if we hit an OOM condition and jump back to the beginning again. 'z' is initially assigned before the restart: label. So move the restart label up a little. Signed-off-by: Jens Axboe <axboe@suse.de>	2005-11-17 12:43:01 -08:00
Linus Torvalds	4060994c3e	Merge x86-64 update from Andi	2005-11-14 19:56:02 -08:00
Andi Kleen	07808b74e7	[PATCH] x86_64: Remove obsolete ARCH_HAS_ATOMIC_UNSIGNED and page_flags_t Has been introduced for x86-64 at some point to save memory in struct page, but has been obsolete for some time. Just remove it. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-14 19:55:14 -08:00
Andi Kleen	b0d4169321	[PATCH] x86_64: When cpu_up fails clean up page allocator properly Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-14 19:55:13 -08:00
Andi Kleen	a2f1b42490	[PATCH] x86_64: Add 4GB DMA32 zone Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones. As a bit of historical background: when the x86-64 port was originally designed we had some discussion if we should use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or both. Both was ruled out at this point because it was in early 2.4 when VM is still quite shakey and had bad troubles even dealing with one DMA zone. We settled on the 16MB DMA zone mainly because we worried about older soundcards and the floppy. But this has always caused problems since then because device drivers had trouble getting enough DMA able memory. These days the VM works much better and the wide use of NUMA has proven it can deal with many zones successfully. So this patch adds both zones. This helps drivers who need a lot of memory below 4GB because their hardware is not accessing more (graphic drivers - proprietary and free ones, video frame buffer drivers, sound drivers etc.). Previously they could only use IOMMU+16MB GFP_DMA, which was not enough memory. Another common problem is that hardware who has full memory addressing for >4GB misses it for some control structures in memory (like transmit rings or other metadata). They tended to allocate memory in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent, but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory (even on AMD systems the IOMMU tends to be quite small) especially if you have many devices. With the new zone pci_alloc_consistent can just put this stuff into memory below 4GB which works better. One argument was still if the zone should be 4GB or 2GB. The main motivation for 2GB would be an unnamed not so unpopular hardware raid controller (mostly found in older machines from a particular four letter company) who has a strange 2GB restriction in firmware. But that one works ok with swiotlb/IOMMU anyways, so it doesn't really need GFP_DMA32. I chose 4GB to be compatible with IA64 and because it seems to be the most common restriction. The new zone is so far added only for x86-64. For other architectures who don't set up this new zone nothing changes. Architectures can set a compatibility define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32 as GFP_DMA. Otherwise it's a nop because on 32bit architectures it's normally not needed because GFP_NORMAL (=0) is DMA able enough. One problem is still that GFP_DMA means different things on different architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA (trusting it to be 4GB) #elif __x86_64__ (use other hacks like the swiotlb because 16MB is not enough) ... . This was quite ugly and is now obsolete. These should be now converted to use GFP_DMA32 unconditionally. I haven't done this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent which will use GFP_DMA32 transparently. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-14 19:55:13 -08:00
Christoph Lameter	50c85a19e7	[PATCH] slab: remove alloc_pages() calls The slab allocator never uses alloc_pages since kmem_getpages() is always called with a valid nodeid. Remove the branch and the code from kmem_getpages() Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Pekka Enberg	065d41cb26	[PATCH] slab: convert cache to page mapping macros This patch converts object cache <-> page mapping macros to static inline functions to make the more explicit and readable. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Nick Piggin	669ed17521	[PATCH] mm: highmem watermarks The pages_high - pages_low and pages_low - pages_min deltas are the asynch reclaim watermarks. As such, the should be in the same ratios as any other zone for highmem zones. It is the pages_min - 0 delta which is the PF_MEMALLOC reserve, and this is the region that isn't very useful for highmem. This patch ensures highmem systems have similar characteristics as non highmem ones with the same amount of memory, and also that highmem zones get similar reclaim pressures to other zones. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Rohit Seth	7fb1d9fca5	[PATCH] mm: __alloc_pages cleanup Clean up of __alloc_pages. Restoration of previous behaviour, plus further cleanups by introducing an 'alloc_flags', removing the last of should_reclaim_zone. Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Robin Holt	51c6f666fc	[PATCH] mm: ZAP_BLOCK causes redundant work The address based work estimate for unmapping (for lockbreak) is and always was horribly inefficient for sparse mappings. The problem is most simply explained with an example: If we find a pgd is clear, we still have to call into unmap_page_range PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in order to progress the working address to the next pgd. The fundamental way to solve the problem is to keep track of the end address we've processed and pass it back to the higher layers. From: Nick Piggin <npiggin@suse.de> Modification to completely get away from address based work estimate and instead use an abstract count, with a very small cost for empty entries as opposed to present pages. On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB of virtual address space takes 1.69s; with the following patch applied, this operation can be done 1000 times in less than 0.01s From: Andrew Morton <akpm@osdl.org> With CONFIG_HUTETLB_PAGE=n: mm/memory.c: In function `unmap_vmas': mm/memory.c:779: warning: division by zero Due to zap_work -= (end - start) / (HPAGE_SIZE / PAGE_SIZE); So make the dummy HPAGE_SIZE non-zero Signed-off-by: Robin Holt <holt@sgi.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Kirill Korotaev	885036d32f	[PATCH] mm: __GFP_NOFAIL fix In __alloc_pages(): if ((p->flags & (PF_MEMALLOC \| PF_MEMDIE)) && !in_interrupt()) { /* go through the zonelist yet again, ignoring mins / for (i = 0; zones[i] != NULL; i++) { struct zone z = zones[i]; page = buffered_rmqueue(z, order, gfp_mask); if (page) { zone_statistics(zonelist, z); goto got_pg; } } goto nopage; <<<< HERE!!! FAIL... } kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when it allocates it with __GFP_NOFAIL flag. Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Denis Lunev <den@sw.ru> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-13 18:14:12 -08:00
Linus Torvalds	63f45b8094	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial	2005-11-11 16:29:22 -08:00
Dave Jones	6b482c6779	[PATCH] Don't print per-cpu vm stats for offline cpus. I just hit a page allocation error on a kernel configured to support 64 CPUs. It spewed 60 completely useless unnecessary lines of info. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-10 13:25:53 -08:00
Adrian Bunk	dc6f3f276e	mm/slab.c: fix a comment typo	2005-11-08 16:44:08 +01:00
Adrian Bunk	4936967374	[PATCH] mm/swap_state.c: unexport swapper_space I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:07 -08:00
Adrian Bunk	e2de225710	[PATCH] mm/swapfile.c: unexport total_swap_pages I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:07 -08:00
Adrian Bunk	1b09d16489	[PATCH] mm/swap.c: unexport vm_acct_memory I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:07 -08:00
Adrian Bunk	f8b8db77b0	[PATCH] unexport nr_swap_pages I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:07 -08:00
Adrian Bunk	e6a7e0e7ce	[PATCH] unexport clear_page_dirty_for_io I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:06 -08:00
Adrian Bunk	99697dc02d	[PATCH] unexport hugetlb_total_pages I didn't find any possible modular usage in the kernel. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:06 -08:00
Adrian Bunk	55be570c52	[PATCH] mm/{mmap,nommu}.c: several unexports I didn't find any possible modular usage in the kernel. This patch was already ACK'ed by Christoph Hellwig. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:54:06 -08:00
Randy Dunlap	d44e0780bc	[PATCH] kernel-doc: fix warnings in vmalloc.c Fix new kernel-doc errors in vmalloc.c. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:56 -08:00
Randy Dunlap	1e5d533142	[PATCH] more kernel-doc cleanups, additions Various core kernel-doc cleanups: - add missing function parameters in ipc, irq/manage, kernel/sys, kernel/sysctl, and mm/slab; - move description to just above function for kernel_restart() Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:55 -08:00
Andrew Morton	7361f4d8ca	[PATCH] readahead commentary Add a few comments surrounding the generic readahead API. Also convert some ulongs into pgoff_t: the identifier for PAGE_CACHE_SIZE offsets into pagecache. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:37 -08:00
Manfred Spraul	cd61ef6268	[PATCH] slab: Use same schedule timeout for all cpus in cache_reap Chen noticed that cache_reap uses REAPTIMEOUT_CPUC+smp_processor_id() as the timeout for rescheduling. The "+smp_processor_id()" part is wrong, the timeout should be identical for all cpus: start_cpu_timer already adds a cpu dependant offset to avoid any clustering. The attached patch removes smp_processor_id(). Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:24 -08:00
Pekka J Enberg	2109a2d1b1	[PATCH] mm: rename kmem_cache_s to kmem_cache This patch renames struct kmem_cache_s to kmem_cache so we can start using it instead of kmem_cache_t typedef. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:24 -08:00
Andrew Morton	4f12bb4f77	[PATCH] slab: don't BUG on duplicated cache slab presently goes BUG if someone tries to register an already-registered cache. But this can happen if the user accidentally loads a module which is already statically linked into the kernel. Nuking the kernel is rather a harsh reaction. Change it into a warning, and just fail the kmem_cache_alloc() attempt. If the module is well-behaved, the modprobe will fail and all is well. Notes: - Swaps the ranking of cache_chain_sem and lock_cpu_hotplug(). Doesn't seem important. Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:24 -08:00
Hugh Dickins	2d4b95f060	[PATCH] Suppress split ptlock on arches which may use one page for multiple page tables Suppress split ptlock on arches which may use one page for multiple page tables. Reconsider what better to do (particularly on ppc64) later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-07 07:53:23 -08:00
Benjamin Herrenschmidt	3c726f8dee	[PATCH] ppc64: support 64k pages Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel base page size to 64K. The resulting kernel still boots on any hardware. On current machines with 4K pages support only, the kernel will maintain 16 "subpages" for each 64K page transparently. Note that while real 64K capable HW has been tested, the current patch will not enable it yet as such hardware is not released yet, and I'm still verifying with the firmware architects the proper to get the information from the newer hypervisors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-11-06 16:56:47 -08:00
Steve French	7f28570185	Export __pagevec_release and pagevec_lookup_tag These are needed to implement cifs_writepages Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2005-11-01 10:22:55 -08:00
Matt Mackall	5d57bd39eb	[PATCH] Error checks omitted in init_tmpfs() in mm/tiny-shmem.c From: Hareesh Nagarajan <hnagar2@gmail.com> Signed-off-by: Hareesh Nagarajan <hnagar2@gmail.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:27 -08:00
Paul E. McKenney	a241ec65ae	[PATCH] RCU torture-testing kernel module This patch is a rewrite of the one submitted on October 1st, using modules (http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2). This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an intense torture test of the RCU infratructure. This is needed due to the continued changes to the RCU infrastructure to accommodate dynamic ticks, CPU hotplug, realtime, and so on. Most of the code is in a separate file that is compiled only if the CONFIG variable is set. Documentation on how to run the test and interpret the output is also included. This code has been tested on i386 and ppc64, and an earlier version of the code has received extensive testing on a number of architectures as part of the PREEMPT_RT patchset. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:27 -08:00
Tejun Heo	c7e9dd4dd0	[PATCH] vm: remove redundant assignment from __pagevec_release_nonlru() This patch removes redundant assignment from __pagevec_release_nonlru(). pages_to_free.cold is set to pvec->cold by pagevec_init() call right above the assignment. Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:22 -08:00
Tejun Heo	39e88ca2c9	[PATCH] fs: error case fix in __generic_file_aio_read When __generic_file_aio_read() hits an error during reading, it reports the error iff nothing has successfully been read yet. This is condition - when an error occurs, if nothing has been read/written, report the error code; otherwise, report the amount of bytes successfully transferred upto that point. This corner case can be exposed by performing readv(2) with the following iov. iov[0] = len0 @ ptr0 iov[1] = len1 @ NULL (or any other invalid pointer) iov[2] = len2 @ ptr2 When file size is enough, performing above readv(2) results in len0 bytes from file_pos @ ptr0 len2 bytes from file_pos + len0 @ ptr2 And the return value is len0 + len2. Test program is attached to this mail. This patch makes __generic_file_aio_read()'s error handling identical to other functions. #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/uio.h> #include <errno.h> #include <string.h> int main(int argc, char *argv) { const char path; struct stat stbuf; size_t len0, len1; void buf0, buf1; struct iovec iov[3]; int fd, i; ssize_t ret; if (argc < 2) { fprintf(stderr, "Usage: testreadv path (better be a " "small text file)\n"); return 1; } path = argv[1]; if (stat(path, &stbuf) < 0) { perror("stat"); return 1; } len0 = stbuf.st_size / 2; len1 = stbuf.st_size - len0; if (!len0 \|\| !len1) { fprintf(stderr, "Dude, file is too small\n"); return 1; } if ((fd = open(path, O_RDONLY)) < 0) { perror("open"); return 1; } if (!(buf0 = malloc(len0)) \|\| !(buf1 = malloc(len1))) { perror("malloc"); return 1; } memset(buf0, 0, len0); memset(buf1, 0, len1); iov[0].iov_base = buf0; iov[0].iov_len = len0; iov[1].iov_base = NULL; iov[1].iov_len = len1; iov[2].iov_base = buf1; iov[2].iov_len = len1; printf("vector "); for (i = 0; i < 3; i++) printf("%p:%zu ", iov[i].iov_base, iov[i].iov_len); printf("\n"); ret = readv(fd, iov, 3); if (ret < 0) perror("readv"); printf("readv returned %zd\nbuf0 = [%s]\nbuf1 = [%s]\n", ret, (char )buf0, (char )buf1); return 0; } Signed-off-by: Tejun Heo <htejun@gmail.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:22 -08:00
Paul Jackson	68860ec10b	[PATCH] cpusets: automatic numa mempolicy rebinding This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:22 -08:00
Paul Jackson	28a42b9ea7	[PATCH] cpusets: confine pdflush to its cpuset This patch keeps pdflush daemons on the same cpuset as their parent, the kthread daemon. Some large NUMA configurations put as much as they can of kernel threads and other classic Unix load in what's called a bootcpuset, keeping the rest of the system free for dedicated jobs. This effort is thwarted by pdflush, which dynamically destroys and recreates pdflush daemons depending on load. It's easy enough to force the originally created pdflush deamons into the bootcpuset, at system boottime. But the pdflush threads created later were allowed to run freely across the system, due to the necessary line in their startup kthread(): set_cpus_allowed(current, CPU_MASK_ALL); By simply coding pdflush to start its threads with the cpus_allowed restrictions of its cpuset (inherited from kthread, its parent) we can ensure that dynamically created pdflush threads are also kept in the bootcpuset. On systems w/o cpusets, or w/o a bootcpuset implementation, the following will have no affect, leaving pdflush to run on any CPU, as before. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:21 -08:00
Jan Kara	aaa4059bc2	[PATCH] ext3: Fix unmapped buffers in transaction's lists Fix the problem (BUG 4964) with unmapped buffers in transaction's t_sync_data list. The problem is we need to call filesystem's own invalidatepage() from block_write_full_page(). block_write_full_page() must call filesystem's invalidatepage(). Otherwise following nasty race can happen: proc 1 proc 2 ------ ------ - write some new data to 'offset' => bh gets to the transactions data list - starts truncate => i_size set to new size - mpage_writepages() - ext3_ordered_writepage() to 'offset' - block_write_full_page() - page->index > end_index+1 - block_invalidatepage() - discard_buffer() - clear_buffer_mapped() - commit triggers and finds unmapped buffer - BOOM! Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-30 17:37:17 -08:00
Nikita Danilov	b1459461f1	[PATCH] mm/filemap.c:filemap_populate(): move export. move EXPORT_SYMBOL(filemap_populate) to the proper place: just after function itself: it's easy to miss that function is exported otherwise. Signed-off-by: Nikita Danilov <nikita@clusterfs.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:45 -07:00
John Hawkes	2f96996de0	[PATCH] mm: wider use of for_each_*cpu() In 'mm' change the explicit use of a for-loop using NR_CPUS into the general for_each_cpu() constructs. This widens the scope of potential future optimizations of the general constructs, as well as takes advantage of the existing optimizations of first_cpu() and next_cpu(), which is advantageous when the true CPU count is much smaller than NR_CPUS. Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:45 -07:00
Christoph Lameter	5fcbb23050	[PATCH] Remove policy contextualization from mbind Policy contextualization is only useful for task based policies and not for vma based policies. It may be useful to define allowed nodes that are not accessible from this thread because other threads may have access to these nodes. Without this patch strange memory policy situations may cause an application to fail with out of memory. Example: Let's say we have two threads A and B that share the same address space and a huge array computational array X. Thread A is restricted by its cpuset to nodes 0 and 1 and thread B is restricted by its cpuset to nodes 2 and 3. Thread A now wants to restrict allocations to the first node and thus applies a BIND policy on X to node 0 and 2. The cpuset limits this to node 0. Thus pages for X must be allocated on node 0 now. Thread B now touches a page that has never been used in X and faults in a page. According to the BIND policy of the vma for X the page must be allocated on page 0. However, the cpuset of B does not allow allocation on 0 and 1. Now the application fails in alloc_pages with out of memory. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:45 -07:00
Christoph Lameter	8bccd85ffb	[PATCH] Implement sys_* do_* layering in the memory policy layer. - Do a separation between do_xxx and sys_xxx functions. sys_xxx functions take variable sized bitmaps from user space as arguments. do_xxx functions take fixed sized nodemask_t as arguments and may be used from inside the kernel. Doing so simplifies the initialization code. There is no fs = kernel_ds assumption anymore. - Split up get_nodes into get_nodes (which gets the node list) and contextualize_policy which restricts the nodes to those accessible to the task and updates cpusets. - Add comments explaining limitations of bind policy Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:45 -07:00
Dave Hansen	61b13993a8	[PATCH] memory hotplug: call setup_per_zone_pages_min after hotplug From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> > I found the tests does not work well with Dave's patchset. > I've found the followings: > > - setup_per_zone_pages_min() calls should be added in > capture_page_range() and online_pages() > - lru_add_drain() should be called before try_to_migrate_pages() The following patch deals with the first item. Signed-off-by: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	0b0acbec1b	[PATCH] memory hotplug: move section_mem_map alloc to sparse.c This basically keeps up from having to extern __kmalloc_section_memmap(). The vaddr_in_vmalloc_area() helper could go in a vmalloc header, but that header gets hard to work with, because it needs some arch-specific macros. Just stick it in here for now, instead of creating another header. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Lion Vollnhals <webmaster@schiggl.de> Signed-off-by: Jiri Slaby <xslaby@fi.muni.cz> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	3947be1969	[PATCH] memory hotplug: sysfs and add/remove functions This adds generic memory add/remove and supporting functions for memory hotplug into a new file as well as a memory hotplug kernel config option. Individual architecture patches will follow. For now, disable memory hotplug when swsusp is enabled. There's a lot of churn there right now. We'll fix it up properly once it calms down. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	bdc8cb9845	[PATCH] memory hotplug locking: zone span seqlock See the "fixup bad_range()" patch for more information, but this actually creates a the lock to protect things making assumptions about a zone's size staying constant at runtime. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	208d54e551	[PATCH] memory hotplug locking: node_size_lock pgdat->node_size_lock is basically only neeeded in one place in the normal code: show_mem(), which is the arch-specific sysrq-m printing function. Strictly speaking, the architectures not doing memory hotplug do no need this locking in show_mem(). However, they are all included for completeness. This should also make any future consolidation of all of the implementations a little more straightforward. This lock is also held in the sparsemem code during a memory removal, as sections are invalidated. This is the place there pfn_valid() is made false for a memory area that's being removed. The lock is only required when doing pfn_valid() operations on memory which the user does not already have a reference on the page, such as in show_mem(). Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	c6a57e19e4	[PATCH] memory hotplug prep: fixup bad_range() When doing memory hotplug operations, the size of existing zones can obviously change. This means that zone->zone_{start_pfn,spanned_pages} can change. There are currently no locks that protect these structure members. However, they are rarely accessed at runtime. Outside of swsusp, the only place that I can find is bad_range(). So, split bad_range() up into two pieces: one that needs to be locked and anther that doesn't. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	4ca644d970	[PATCH] memory hotplug prep: __section_nr helper A little helper that we use in the hotplug code. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Dave Hansen	ed8ece2ec8	[PATCH] memory hotplug prep: break out zone initialization If a zone is empty at boot-time and then hot-added to later, it needs to run the same init code that would have been run on it at boot. This patch breaks out zone table and per-cpu-pages functions for use by the hotplug code. You can almost see all of the free_area_init_core() function on one page now. :) Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:44 -07:00
Andrea Arcangeli	1a44e14908	[PATCH] .text page fault SMP scalability optimization We had a problem on ppc64 where with more than 4 threads a large system wouldn't scale well while faulting in the .text (most of the time was spent in the kernel despite it was an userland compute intensive app). The reason is the useless overwrite of the same pte from all cpu. I fixed it this way (verified on an older kernel but the forward port is almost identical). This will benefit all archs not just ppc64. Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:43 -07:00
Adam Litke	4c88726597	[PATCH] hugetlb: demand fault handler Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page(). Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:43 -07:00
Hugh Dickins	b8072f099b	[PATCH] mm: update comments to pte lock Updated several references to page_table_lock in common code comments. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:42 -07:00
Hugh Dickins	f412ac08c9	[PATCH] mm: fix rss and mmlist locking A couple of oddities were guarded by page_table_lock, no longer properly guarded when that is split. The mm_counters of file_rss and anon_rss: make those an atomic_t, or an atomic64_t if the architecture supports it, in such a case. Definitions by courtesy of Christoph Lameter: who spent considerable effort on more scalable ways of counting, but found insufficient benefit in practice. And adding an mm with swap to the mmlist for swapoff: the list is well- guarded by its own lock, but the list_empty check now has to be repeated inside it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:42 -07:00
Hugh Dickins	4c21e2f244	[PATCH] mm: split page table lock Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:42 -07:00
Hugh Dickins	deceb6cd17	[PATCH] mm: follow_page with inner ptlock Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:41 -07:00
Hugh Dickins	c34d1b4d16	[PATCH] mm: kill check_user_page_readable check_user_page_readable is a problematic variant of follow_page. It's used only by oprofile's i386 and arm backtrace code, at interrupt time, to establish whether a userspace stackframe is currently readable. This is problematic, because we want to push the page_table_lock down inside follow_page, and later split it; whereas oprofile is doing a spin_trylock on it (in the i386 case, forgotten in the arm case), and needs that to pin perhaps two pages spanned by the stackframe (which might be covered by different locks when we split). I think oprofile is going about this in the wrong way: it doesn't need to know the area is readable (neither i386 nor arm uses read protection of user pages), it doesn't need to pin the memory, it should simply __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've not got around to devising the sparse __user annotations for this. Then we can eliminate check_user_page_readable, and return to a single follow_page without the __follow_page variants. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:41 -07:00
Hugh Dickins	c0718806cf	[PATCH] mm: rmap with inner ptlock rmap's page_check_address descend without page_table_lock. First just pte_offset_map in case there's no pte present worth locking for, then take page_table_lock for the full check, and pass ptl back to caller in the same style as pte_offset_map_lock. __xip_unmap, page_referenced_one and try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also. page_check_address reformatted to avoid progressive indentation. No use is made of its one error code, return NULL when it fails. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:41 -07:00
Hugh Dickins	67b02f119d	[PATCH] mm: xip_unmap ZERO_PAGE fix Small fix to the PageReserved patch: the mips ZERO_PAGE(address) depends on address, so __xip_unmap is wrong to initialize page with that before address is initialized; and in fact must re-evaluate it each iteration. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:41 -07:00
Hugh Dickins	508034a32b	[PATCH] mm: unmap_vmas with inner ptlock Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:41 -07:00
Hugh Dickins	8f4f8c164c	[PATCH] mm: unlink vma before pagetables In most places the descent from pgd to pud to pmd to pte holds mmap_sem (exclusively or not), which ensures that free_pgtables cannot be freeing page tables from any level at the same time. But truncation and reverse mapping descend without mmap_sem. No problem: just make sure that a vma is unlinked from its prio_tree (or nonlinear list) and from its anon_vma list, after zapping the vma, but before freeing its page tables. Then neither vmtruncate nor rmap can reach that vma whose page tables are now volatile (nor do they need to reach it, since all its page entries have been zapped by this stage). The i_mmap_lock and anon_vma->lock already serialize this correctly; but the locking hierarchy is such that we cannot take them while holding page_table_lock. Well, we're trying to push that down anyway. So in this patch, move anon_vma_unlink and unlink_file_vma into free_pgtables, at the same time as moving page_table_lock around calls to unmap_vmas. tlb_gather_mmu and tlb_finish_mmu then fall outside the page_table_lock, but we made them preempt_disable and preempt_enable earlier; and a long source audit of all the architectures has shown no problem with removing page_table_lock from them. free_pgtables doesn't need page_table_lock for itself, nor for what it calls; tlb->mm->nr_ptes is usually protected by page_table_lock, but partly by non-exclusive mmap_sem - here it's decremented with exclusive mmap_sem, or mm_users 0. update_hiwater_rss and vm_unacct_memory don't need page_table_lock either. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	705e87c0c3	[PATCH] mm: pte_offset_map_lock loops Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	8f4e2101fd	[PATCH] mm: page fault handler locking On the page fault path, the patch before last pushed acquiring the page_table_lock down to the head of handle_pte_fault (though it's also taken and dropped earlier when a new page table has to be allocated). Now delete that line, read "entry = *pte" without it, and go off to this or that page fault handler on the basis of this unlocked peek. Usually the handler can proceed without the lock, relying on the subsequent locked pte_same or pte_none test to back out when necessary; though do_wp_page needs the lock immediately, and do_file_page doesn't check (if there's a race, install_page just zaps the entry and reinstalls it). But on those architectures (notably i386 with PAE) whose pte is too big to be read atomically, if SMP or preemption is enabled, do_swap_page and do_file_page might cause irretrievable damage if passed a Frankenstein entry stitched together from unrelated parts. In those configs, "pte_unmap_same" has to take page_table_lock, validate orig_pte still the same, and drop page_table_lock before unmapping, before proceeding. Use pte_offset_map_lock and pte_unmap_unlock throughout the handlers; but lock avoidance leaves more lone maps and unmaps than elsewhere. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	c74df32c72	[PATCH] mm: ptd_alloc take ptlock Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	1bb3630e89	[PATCH] mm: ptd_alloc inline and out It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only calling out-of-line __pud_alloc __pmd_alloc if allocation needed, pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does add a little to kernel size, change them to macros testing inline, calling __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them as fastcalls, leave that to CONFIG_REGPARM or not. It also seems more natural for the out-of-line functions to leave the offset calculation and map to the inline, which has to do it anyway for the common case. At least mremap move wants __pte_alloc without _map. Macros rather than inline functions, certainly to avoid the header file issues which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any architectures I haven't built would have other such problems. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	872fec16d9	[PATCH] mm: init_mm without ptlock First step in pushing down the page_table_lock. init_mm.page_table_lock has been used throughout the architectures (usually for ioremap): not to serialize kernel address space allocation (that's usually vmlist_lock), but because pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it. Reverse that: don't lock or unlock init_mm.page_table_lock in any of the architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take and drop it when allocating a new one, to check lest a racing task already did. Similarly no page_table_lock in vmalloc's map_vm_area. Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle user mms, which are converted only by a later patch, for now they have to lock differently according to whether or not it's init_mm. If sources get muddled, there's a danger that an arch source taking init_mm.page_table_lock will be mixed with common source also taking it (or neither take it). So break the rules and make another change, which should break the build for such a mismatch: remove the redundant mm arg from pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13). Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64 used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64 map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free took page_table_lock for no good reason. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:40 -07:00
Hugh Dickins	46dea3d092	[PATCH] mm: ia64 use expand_upwards ia64 has expand_backing_store function for growing its Register Backing Store vma upwards. But more complete code for this purpose is found in the CONFIG_STACK_GROWSUP part of mm/mmap.c. Uglify its #ifdefs further to provide expand_upwards for ia64 as well as expand_stack for parisc. The Register Backing Store vma should be marked VM_ACCOUNT. Implement the intention of growing it only a page at a time, instead of passing an address outside of the vma to handle_mm_fault, with unknown consequences. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	365e9c87a9	[PATCH] mm: update_hiwaters just in time update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	861f2fb8e7	[PATCH] mm: zap_pte out of line There used to be just one call to zap_pte, but it shouldn't be inline now there are two. Check for the common case pte_none before calling, and move its rss accounting up into install_page or install_file_pte - which helps the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	d0de32d9b7	[PATCH] mm: do_mremap current mm Cleanup: relieve do_mremap from its surfeit of current->mms. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	9e9bef07ce	[PATCH] mm: do_swap_page race major Small adjustment: do_swap_page should report its !pte_same race as a major fault if it had to read into swap cache, because whatever raced with it will have found page already in cache and reported minor fault. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	86d912f41d	[PATCH] mm: zap_pte_range dec rss Small adjustment: zap_pte_range decrement its rss counts from 0 then finally add, avoiding negations - we don't have or need a sub_mm_rss. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	8c10376271	[PATCH] mm: copy_one_pte inc rss Small adjustment, following Nick's suggestion: it's more straightforward for copy_pte_range to let copy_one_pte do the rss incrementation, than use an index it passed back. Saves a #define, and 16 bytes of .text. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Nick Piggin	b5810039a5	[PATCH] core remove PageReserved Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:39 -07:00
Hugh Dickins	ae85976233	[PATCH] mm: batch updating mm_counters tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be worthwhile if the mm is contended, and would reduce atomic operations if the counts were atomic. Let zap_pte_range now batch its updates to file_rss and anon_rss, per page-table in case we drop the lock outside; and copy_pte_range batch them too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:38 -07:00
Hugh Dickins	4294621f41	[PATCH] mm: rss = file_rss + anon_rss I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:38 -07:00
Hugh Dickins	fc2acab31b	[PATCH] mm: tlb_finish_mmu forget rss zap_pte_range has been counting the pages it frees in tlb->freed, then tlb_finish_mmu has used that to update the mm's rss. That got stranger when I added anon_rss, yet updated it by a different route; and stranger when rss and anon_rss became mm_counters with special access macros. And it would no longer be viable if we're relying on page_table_lock to stabilize the mm_counter, but calling tlb_finish_mmu outside that lock. Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own business, just decrement the rss mm_counter in zap_pte_range (yes, there was some point to batching the update, and a subsequent patch restores that). And forget the anal paranoia of first reading the counter to avoid going negative - if rss does go negative, just fix that bug. Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use was being made of them. But arm26 alone was actually using the freed, in the way some others use need_flush: give it a need_flush. arm26 seems to prefer spaces to tabs here: respect that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	4d6ddfa924	[PATCH] mm: tlb_is_full_mm was obscure tlb_is_full_mm? What does that mean? The TLB is full? No, it means that the mm's last user has gone and the whole mm is being torn down. And it's an inline function because sparc64 uses a different (slightly better) "tlb_frozen" name for the flag others call "fullmm". And now the ptep_get_and_clear_full macro used in zap_pte_range refers directly to tlb->fullmm, which would be wrong for sparc64. Rather than correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change sparc64 to just use the same poor name as everyone else - is that okay? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	7be7a54699	[PATCH] mm: move_page_tables by extents Speeding up mremap's moving of ptes has never been a priority, but the locking will get more complicated shortly, and is already too baroque. Scrap the current one-by-one moving, do an extent at a time: curtailed by end of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided doesn't match this usage), and by latency considerations. One nice property of the old method is lost: it never allocated a page table unless absolutely necessary, so you could free empty page tables by mremapping to and fro. Whereas this way, it allocates a dst table wherever there was a src table. I keep diving in to reinstate the old behaviour, then come out preferring not to clutter how it now is. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	65500d234e	[PATCH] mm: page fault handlers tidyup Impose a little more consistency on the page fault handlers do_wp_page, do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their arguments in the same order, called the same names? break_cow is all very well, but what it did was inlined elsewhere: easier to compare if it's brought back into do_wp_page. do_file_page's fallback to do_no_page dates from a time when we were testing pte_file by using it wherever possible: currently it's peculiar to nonlinear vmas, so just check that. BUG_ON if not? Better not, it's probably page table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's use that for do_wp_page's invalid pfn too. Hah! Someone in the ppc64 world noticed pte_ERROR was unused so removed it: restored (and say "pud" not "pmd" in its pud_ERROR). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	7c1fd6b964	[PATCH] mm: exit_mmap need not reset exit_mmap resets various mm_struct fields, but the mm is well on its way out, and none of those fields matter by this point. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	a8fb5618da	[PATCH] mm: unlink_file_vma, remove_vma Divide remove_vm_struct into two parts: first anon_vma_unlink plus unlink_file_vma, to unlink the vma from the list and tree by which rmap or vmtruncate might find it; then remove_vma to close, fput and free. The intention here is to do the anon_vma_unlink and unlink_file_vma earlier, in free_pgtables before freeing any page tables: so we can be sure that any page tables traversed by rmap and vmtruncate are stable (and other, ordinary cases are stabilized by holding mmap_sem). This will be crucial to traversing pgd,pud,pmd without page_table_lock. But testing the split-out patch showed that lifting the page_table_lock is symbiotically necessary to make this change - the lock ordering is wrong to move those unlinks into free_pgtables while it's under ptlock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	2c0b381467	[PATCH] mm: remove_vma_list consolidation unmap_vma doesn't amount to much, let's put it inside unmap_vma_list. Except it doesn't unmap anything, unmap_region just did the unmapping: rename it to remove_vma_list. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	ab50b8ed81	[PATCH] mm: vm_stat_account unshackled The original vm_stat_account has fallen into disuse, with only one user, and only one user of vm_stat_unaccount. It's easier to keep track if we convert them all to __vm_stat_account, then free it from its __shackles. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:37 -07:00
Hugh Dickins	72866f6f27	[PATCH] mm: anon is already wrprotected do_anonymous_page's pte_wrprotect causes some confusion: in such a case, vm_page_prot must already be forcing COW, so must omit write permission, and so the pte_wrprotect is redundant. Replace it by a comment to that effect, and reword the comment on unuse_pte which also caused confusion. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Hugh Dickins	6237bcd948	[PATCH] mm: zap_pte_range dont dirty anon zap_pte_range already avoids wasting time to mark_page_accessed on anon pages: it can also skip anon set_page_dirty - the page only needs to be marked dirty if shared with another mm, but that will say pte_dirty too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Hugh Dickins	0c942a4539	[PATCH] mm: msync_pte_range progress Use latency breaking in msync_pte_range like that in copy_pte_range, instead of the ugly CONFIG_PREEMPT filemap_msync alternatives. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Hugh Dickins	e040f218bb	[PATCH] mm: copy_pte_range progress fix My latency breaking in copy_pte_range didn't work as intended: instead of checking at regularish intervals, after the first interval it checked every time around the loop, too impatient to be preempted. Fix that. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Christoph Lameter	09ad4bbc3a	[PATCH] slab: add additional debugging to detect slabs from the wrong node This patch adds some stack dumps if the slab logic is processing slab blocks from the wrong node. This is necessary in order to detect situations as encountered by Petr. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Lee Schermerhorn	c340010e4b	[PATCH] shrink_list(): skip anon pages if not may_swap Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the scan_control struct; and modified shrink_list() not to add anon pages to the swap cache if may_swap is not asserted. Ref: http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4 However, further down, if the page is mapped, shrink_list() calls try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon (). try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap cache. Martin says he never encountered this path in his testing, but agrees that it might happen. This patch modifies shrink_list() to skip anon pages that are not already in the swap cache when !may_swap, rather than just not adding them to the cache. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
OGAWA Hirofumi	b57b98d147	[PATCH] mm/msync.c cleanup This is not problem actually, but sync_page_range() is using for exported function to filesystems. The msync_xxx is more readable at least to me. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:36 -07:00
Andi Kleen	662f3a0b94	[PATCH] Remove near all BUGs in mm/mempolicy.c Most of them can never be triggered and were only for development. Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Andi Kleen	dfcd3c0dc4	[PATCH] Convert mempolicies to nodemask_t The NUMA policy code predated nodemask_t so it used open coded bitmaps. Convert everything to nodemask_t. Big patch, but shouldn't have any actual behaviour changes (except I removed one unnecessary check against node_online_map and one unnecessary BUG_ON) Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Seth, Rohit	e46a5e28c2	[PATCH] mm: set per-cpu-pages lower threshold to zero Set the low water mark for hot pages in pcp to zero. (akpm: for the life of me I cannot remember why we created pcp->low. Neither can Martin and the changelog is silent. Maybe it was just a brainfart, but I have this feeling that there was a reason. If not, we should remove the fields completely. We'll see.) Signed-off-by: Rohit Seth <rohit.seth@intel.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Seth, Rohit	ba56e91c94	[PATCH] mm: page_alloc: increase size of per-cpu-pages Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB. Over 100+ runs for a workload, the difference in mean is about 2%. The best results for both are almost same. Though the max variation in results with 1/2MB is only 2.2%, whereas with 1/4MB it is 12%. Signed-off-by: Rohit Seth <rohit.seth@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Rik Van Riel	fcdae29aa7	[PATCH] swaptoken tuning It turns out that the original swap token implementation, by Song Jiang, only enforced the swap token while the task holding the token is handling a page fault. This patch approximates that, without adding an additional flag to the mm_struct, by checking whether the mm->mmap_sem is held for reading, like the page fault code does. This patch has the effect of automatically, and gradually, disabling the enforcement of the swap token when there is little or no paging going on, and "turning up" the intensity of the swap token code the more the task holding the token is thrashing. Thanks to Song Jiang for pointing out this aspect of the token based thrashing control concept. The new code shows a slight degradation over the old swap token code, but still a big win over running without the swap token. 2.6.12+ swap token disabled $ for i in `seq 10` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 101.74user 23.13system 8:26.91elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (38597major+430315minor)pagefaults 0swaps 101.98user 24.91system 8:03.06elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (33939major+430457minor)pagefaults 0swaps 101.93user 22.12system 7:34.90elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (33166major+421267minor)pagefaults 0swaps 101.82user 22.38system 8:31.40elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (39338major+433262minor)pagefaults 0swaps 2.6.12+ swap token enabled, timeout 300 seconds $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 102.58user 16.08system 3:41.44elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19707major+285786minor)pagefaults 0swaps 102.07user 19.56system 4:00.64elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19012major+299259minor)pagefaults 0swaps 102.64user 18.25system 4:07.31elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (21990major+304831minor)pagefaults 0swaps 101.39user 19.41system 5:15.81elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (24850major+323321minor)pagefaults 0swaps 2.6.12+ with new swap token code, timeout 300 seconds $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done 101.87user 24.66system 5:53.20elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (26848major+363497minor)pagefaults 0swaps 102.83user 19.95system 4:17.25elapsed 47%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (19946major+305722minor)pagefaults 0swaps 102.09user 19.46system 5:12.57elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (25461major+334994minor)pagefaults 0swaps 101.67user 20.61system 4:52.97elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (22190major+329508minor)pagefaults 0swaps Signed-off-by: Rik Van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Christoph Lameter	930fc45a49	[PATCH] vmalloc_node This patch adds vmalloc_node(size, node) -> Allocate necessary memory on the specified node and get_vm_area_node(size, flags, node) and the other functions that it depends on. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-29 21:40:35 -07:00
Al Viro	260b23674f	[PATCH] gfp_t: the rest zone handling, mapping->flags handling Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-28 08:16:51 -07:00
Al Viro	6daa0e2862	[PATCH] gfp_t: mm/* (easy parts) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-28 08:16:47 -07:00
Al Viro	af4ca457ea	[PATCH] gfp_t: infrastructure Beginning of gfp_t annotations: - -Wbitwise added to CHECKFLAGS - old __bitwise renamed to __bitwise__ - __bitwise defined to either __bitwise__ or nothing, depending on __CHECK_ENDIAN__ being defined - gfp_t switched from __nocast to __bitwise__ - force cast to gfp_t added to __GFP_... constants - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts the result to int Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-28 08:16:46 -07:00
Magnus Damm	1c6fe94659	[PATCH] NUMA: broken per cpu pageset counters The NUMA counters in struct per_cpu_pageset (linux/mmzone.h) are never cleared today. This works ok for CPU 0 on NUMA machines because boot_pageset[] is already zero, but for other CPU:s this results in uninitialized counters. Signed-off-by: Magnus Damm <magnus@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-26 10:39:43 -07:00
Hugh Dickins	ac9b9c667c	[PATCH] Fix handling spurious page fault for hugetlb region This reverts commit `3359b54c8c` and replaces it with a cleaner version that is purely based on page table operations, so that the synchronization between inode size and hugetlb mappings becomes moot. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-20 09:02:07 -07:00
Yasunori Goto	281dd25cdc	[PATCH] swiotlb: make sure initial DMA allocations really are in DMA memory This introduces a limit parameter to the core bootmem allocator; The new parameter indicates that physical memory allocated by the bootmem allocator should be within the requested limit. We also introduce alloc_bootmem_low_pages_limit, alloc_bootmem_node_limit, alloc_bootmem_low_pages_node_limit apis, but alloc_bootmem_low_pages_limit is the only api used for swiotlb. The existing alloc_bootmem_low_pages() api could instead have been changed and made to pass right limit to the core allocator. But that would make the patch more intrusive for 2.6.14, as other arches use alloc_bootmem_low_pages(). We may be done that post 2.6.14 as a cleanup. With this, swiotlb gets memory within 4G for both x86_64 and ia64 arches. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Ravikiran G Thirumalai <kiran@scalex86.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 23:11:33 -07:00
Hugh Dickins	1c59827d1d	[PATCH] mm: hugetlb truncation fixes hugetlbfs allows truncation of its files (should it?), but hugetlb.c often forgets that: crashes and misaccounting ensue. copy_hugetlb_page_range better grab the src page_table_lock since we don't want to guess what happens if concurrently truncated. unmap_hugepage_range rss accounting must not assume the full range was mapped. follow_hugetlb_page must guard with page_table_lock and be prepared to exit early. Restyle copy_hugetlb_page_range with a for loop like the others there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 23:04:30 -07:00
Seth, Rohit	3359b54c8c	[PATCH] Handle spurious page fault for hugetlb region The hugetlb pages are currently pre-faulted. At the time of mmap of hugepages, we populate the new PTEs. It is possible that HW has already cached some of the unused PTEs internally. These stale entries never get a chance to be purged in existing control flow. This patch extends the check in page fault code for hugepages. Check if a faulted address falls with in size for the hugetlb file backing it. We return VM_FAULT_MINOR for these cases (assuming that the arch specific page-faulting code purges the stale entry for the archs that need it). Signed-off-by: Rohit Seth <rohit.seth@intel.com> [ This is apparently arguably an ia64 port bug. But the code won't hurt, and for now it fixes a real problem on some ia64 machines ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-19 13:56:27 -07:00
Linus Torvalds	3d80636a0d	Fix memory ordering bug in page reclaim As noticed by Nick Piggin, we need to make sure that we check the page count before we check for PageDirty, since the dirty check is only valid if the count implies that we're the only possible ones holding the page. We always did do this, but the code needs a read-memory-barrier to make sure that the orderign is also honored by the CPU. (The writer side is ordered due to the atomic decrement and test on the page count, see the discussion on linux-kernel) Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-16 17:36:06 -07:00
Hugh Dickins	f5154a98a1	[PATCH] Don't map the same page too much Refuse to install a page into a mapping if the mapping count is already ridiculously large. You probably cannot trigger this on 32-bit architectures, but on a 64-bit setup we should protect against it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-11 12:03:47 -07:00
Suzuki	1bef400329	[PATCH] madvise: Avoid returning error code -EBADF for anonymous mappings Revert this recent correctness change: Douglas Crosher <dcrosher@scieneer.com> reported that it broke an existing application, and that madvise() works without error on anonymous mappings on Solaris. This means that madvise() will remain non-standards-compliant: we should return -EBADF for all requests against non-file-backed vma's, but Linux only does this for MADV_WILLNEED requests. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-11 09:46:54 -07:00
Al Viro	dd0fc66fb3	[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-10-08 15:00:57 -07:00
Linus Torvalds	6e3254c4e2	Revert "x86-64: Reverse order of bootmem lists" As requested by Thomas Gleixner <tglx@linutronix.de>: "5d3d0f7704ed0bc7eaca0501eeae3e5da1ea6c87 breaks a couple of ARM boards, which depend on the historical bootmem allocation order. There is a cleaner solution around to remove the pgdat list completely, but this is a topic for post 2.6.14 Andi signalled ACK already." Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-30 12:38:27 -07:00
Alok N Kataria	5c38230087	[PATCH] kmalloc_node IRQ safety fix In kmalloc_node we are checking if the allocation is for the same node when interrupts are "on". This may lead to an allocation on another node than intended. This patch just shifts the check for the current node in __cache_alloc_node when interrupts are disabled. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-28 07:46:42 -07:00
Nick Piggin	8b1f312461	[PATCH] mm: move_pte to remap ZERO_PAGE Move the ZERO_PAGE remapping complexity to the move_pte macro in asm-generic, have it conditionally depend on __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS. For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes a noop. From: Hugh Dickins <hugh@veritas.com> Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch. The "pte" at that point may be a swap entry or a pte_file entry: we must check pte_present before perhaps corrupting such an entry. Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's mm/mremap.c, and more dangerous there since it's affecting all arches: I think the safest course is to send Nick's patch and Yoichi's build fix and this fix (build tested) on to Linus - so only MIPS can be affected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-28 07:46:40 -07:00
Andrew Morton	dbdb904500	[PATCH] revert oversized kmalloc check As davem points out, this wasn't such a great idea. There may be some code which does: size = 1024*1024; while (kmalloc(size, ...) == 0) size /= 2; which will now explode. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-23 13:35:37 -07:00
Rob Landley	f7b3a4359b	[PATCH] Fix bd_claim() error code. Problem: In some circumstances, bd_claim() is returning the wrong error code. If we try to swapon an unused block device that isn't swap formatted, we get -EINVAL. But if that same block device is already mounted, we instead get -EBUSY, even though it still isn't a valid swap device. This issue came up on the busybox list trying to get the error message from "swapon -a" right. If a swap device is already enabled, we get -EBUSY, and we shouldn't report this as an error. But we can't distinguish the two -EBUSY conditions, which are very different errors. In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy means "somebody other than sys_swapon has already claimed this", and _that_ means this block device can't be a valid swap device. So return -EINVAL there. Signed-off-by: Rob Landley <rob@landley.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:37 -07:00
Christoph Lameter	eafb42707b	[PATCH] __kmalloc: Generate BUG if size requested is too large. I had an issue on ia64 where I got a bug in kernel/workqueue because kzalloc returned a NULL pointer due to the task structure getting too big for the slab allocator. Usually these cases are caught by the kmalloc macro in include/linux/slab.h. Compilation will fail if a too big value is passed to kmalloc. However, kzalloc uses __kmalloc which has no check for that. This patch makes __kmalloc bug if a too large entity is requested. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:36 -07:00
Christoph Lameter	ff69416e63	[PATCH] slab: fix handling of pages from foreign NUMA nodes The numa slab allocator may allocate pages from foreign nodes onto the lists for a particular node if a node runs out of memory. Inspecting the slab->nodeid field will not reflect that the page is now in use for the slabs of another node. This patch fixes that issue by adding a node field to free_block so that the caller can indicate which node currently uses a slab. Also removes the check for the current node from kmalloc_cache_node since the process may shift later to another node which may lead to an allocation on another node than intended. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:35 -07:00
Ivan Kokshaysky	7243cc05ba	[PATCH] slab: alpha inlining fix It is essential that index_of() be inlined. But alpha undoes the gcc inlining hackery and index_of() ends up out-of-line. So fiddle with things to make that function inline again. Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-22 22:17:34 -07:00
Paolo 'Blaisorblade' Giarrusso	7e2cff42cf	[PATCH] mm: add a note about partially hardcoded VM_* flags Hugh made me note this line for permission checking in mprotect(): if ((newflags & ~(newflags >> 4)) & 0xf) { after figuring out what's that about, I decided it's nasty enough. Btw Hugh itself didn't like the 0xf. We can safely change it to VM_READ\|VM_WRITE\|VM_EXEC because we never change VM_SHARED, so no need to check that. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-21 10:11:55 -07:00
Paolo 'Blaisorblade' Giarrusso	f10df68604	[PATCH] fix locking comment in unmap_region() That comment is plain wrong (we even take the pagetable lock inside unmap_region()). Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-21 10:11:55 -07:00
Dave Hansen	f3519f9194	[PATCH] fix mm/Kconfig spelling Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-17 11:50:01 -07:00
Alok Kataria	c7e43c78ae	[PATCH] Fix slab BUG_ON() triggered by change in array cache size With the new changes that we made in the initialization of the slab allocator, we first setup the cache from which array caches are allocated, and then the cache, from which kmem_list3's are allocated. Now if the array cache comes from a cache in which objsize > 32, (in this instance size-64) then, first size-64 cache will be allocated and then the size-128 (if this is the cache from which kmem_list3's are going to be allocated). So with these new changes, we are not guaranteed that we will be initializing the malloc_sizes array in a serialized order. Thus there is a bug in __find_general_cachep, as we are checking whether the first cache_sizes ptr is NULL. This is replaced by checking whether the array-cache cache is initialized. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhitdayal.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-14 12:31:45 -07:00
Hugh Dickins	2fd4ef85e0	[PATCH] error path in setup_arg_pages() misses vm_unacct_memory() Pavel Emelianov and Kirill Korotaev observe that fs and arch users of security_vm_enough_memory tend to forget to vm_unacct_memory when a failure occurs further down (typically in setup_arg_pages variants). These are all users of insert_vm_struct, and that reservation will only be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't. So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into Committed_AS each time they're run. But don't add VM_ACCOUNT to them, it's inappropriate to reserve against the very unlikely case that gdb be used to COW a vDSO page - we ought to do something about that in do_wp_page, but there are yet other inconsistencies to be resolved. The safe and economical way to fix this is to let insert_vm_struct do the security_vm_enough_memory check when it finds VM_ACCOUNT is set. And the MIPS irix_brk has been calling security_vm_enough_memory before calling do_brk which repeats it, doubly accounting and so also leaking. Remove that, and all the fs and arch calls to security_vm_enough_memory: give it a less misleading name later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-14 11:18:13 -07:00
Randy Dunlap	9f1583339a	[PATCH] use add_taint() for setting tainted bit flags Use the add_taint() interface for setting tainted bit flags instead of doing it manually. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:29 -07:00
Andi Kleen	5b952b3c14	[PATCH] Fix MPOL_F_VERIFY There was a pretty bad bug in there that the code would always check the full VMA, not the range the user requested. When the VMA to be checked was merged with the previous VMA this could lead to spurious failures. Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:28 -07:00
Con Kolivas	8d0986e289	[PATCH] vm: kswapd cleanup: use pgdat Use the pgdat pointer we've already defined in wakeup_kswapd Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-13 08:22:28 -07:00
Andi Kleen	5d3d0f7704	[PATCH] x86-64: Reverse order of bootmem lists This leads to bootmem allocating first from node 0 instead of from the last node. This avoids swiotlb allocating on the last node, which doesn't really work on a machine with >4GB. Note: there is a better patch around from someone else that gets rid of the pgdat list completely. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-12 10:49:56 -07:00
Greg Ungerer	66aa2b4b1c	[PATCH] uclinux: add NULL check, 0 end valid check and some more exports to nommu.c Move call to get_mm_counter() in update_mem_hiwater() to be inside the check for tsk->mm being null. Otherwise you can be following a null pointer here. This patch submitted by Javier Herrero <jherrero@hvsistemas.es>. Modify the end check for munmap regions to allow for the legacy behavior of 0 being valid. Pretty much all current uClinux system libc malloc's pass in 0 as the end point. A hard check will fail on these, so change the check so that if it is non-zero it must be valid otherwise it fails. A passed in value will always succeed (as it used too). Also export a few more mm system functions - to be consistent with the VM code exports. Signed-off-by: Greg Ungerer <gerg@uclinux.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-11 20:43:47 -07:00
Nishanth Aravamudan	13e4b57f6a	[PATCH] mm: fix-up schedule_timeout() usage Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:37 -07:00
Renaud Lienhart	207f36eec9	[PATCH] remove invalid comment in mm/page_alloc.c free_pages_bulk() doesn't free the entire list if count == 0. Signed-off-by: Renaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:31 -07:00
Victor Fusco	9de75d110c	[PATCH] mm/swap_state: Fix "nocast type" warnings Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:28 -07:00
Victor Fusco	b2d550736f	[PATCH] mm/slab: fix sparse warnings Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:26 -07:00
Adrian Bunk	5ce7852cdf	[PATCH] mm/filemap.c: make two functions static With Nick Piggin <npiggin@suse.de> Give some things static scope. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-10 10:06:25 -07:00
Ingo Molnar	8d06afab73	[PATCH] timer initialization cleanup: DEFINE_TIMER Clean up timer initialization by introducing DEFINE_TIMER a'la DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been been in the -RT tree for some time. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 14:03:48 -07:00
Pekka Enberg	80e93effce	[PATCH] update kfree, vfree, and vunmap kerneldoc This patch clarifies NULL handling of kfree() and vfree(). I addition, wording of calling context restriction for vfree() and vunmap() are changed from "may not" to "must not." Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 14:03:43 -07:00
Christoph Lameter	e498be7daf	[PATCH] Numa-aware slab allocator V5 The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Shai Fultheim <Shai@Scalex86.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:48 -07:00
Stephen Smalley	570bc1c2e5	[PATCH] tmpfs: Enable atomic inode security labeling This patch modifies tmpfs to call the inode_init_security LSM hook to set up the incore inode security state for new inodes before the inode becomes accessible via the dcache. As there is no underlying storage of security xattrs in this case, it is not necessary for the hook to return the (name, value, len) triple to the tmpfs code, so this patch also modifies the SELinux hook function to correctly handle the case where the (name, value, len) pointers are NULL. The hook call is needed in tmpfs in order to support proper security labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With this change in place, we should then be able to remove the security_inode_post_create/mkdir/... hooks safely. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:28 -07:00
Mark Fasheh	fef266580e	[PATCH] update filesystems for new delete_inode behavior Update the file systems in fs/ implementing a delete_inode() callback to call truncate_inode_pages(). One implementation note: In developing this patch I put the calls to truncate_inode_pages() at the very top of those filesystems delete_inode() callbacks in order to retain the previous behavior. I'm guessing that some of those could probably be optimized. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-09 13:57:27 -07:00
Andi Kleen	d42c69972b	[PATCH] PCI: Run PCI driver initialization on local node Run PCI driver initialization on local node Instead of adding messy kmalloc_node()s everywhere run the PCI driver probe on the node local to the device. This would not have helped for IDE, but should for other more clean drivers that do more initialization in probe(). It won't help for drivers that do most of the work on first open (like many network drivers) Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2005-09-08 14:57:23 -07:00
Pekka J Enberg	dd3927105b	[PATCH] introduce and use kzalloc This patch introduces a kzalloc wrapper and converts kernel/ to use it. It saves a little program text. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:45 -07:00
Paul Jackson	ef08e3b498	[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:40 -07:00
Paul Jackson	9bf2229f88	[PATCH] cpusets: formalize intermediate GFP_KERNEL containment This patch makes use of the previously underutilized cpuset flag 'mem_exclusive' to provide what amounts to another layer of memory placement resolution. With this patch, there are now the following four layers of memory placement available: 1) The whole system (interrupt and GFP_ATOMIC allocations can use this), 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use), 3) The current tasks cpuset (GFP_USER allocations constrained to here), and 4) Specific node placement, using mbind and set_mempolicy. These nest - each layer is a subset (same or within) of the previous. Layer (2) above is new, with this patch. The call used to check whether a zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is extended to take a gfp_mask argument, and its logic is extended, in the case that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if placement is allowed. The definition of GFP_USER, which used to be identical to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous cpuset_gfp_hardwall_flag patch. GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks cpuset, so long as any node therein is not too tight on memory, but will escape to the larger layer, if need be. The intended use is to allow something like a batch manager to handle several jobs, each job in its own cpuset, but using common kernel memory for caches and such. Swapper and oom_kill activity is also constrained to Layer (2). A task in or below one mem_exclusive cpuset should not cause swapping on nodes in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a task in another such cpuset. Heavy use of kernel memory for i/o caching and such by one job should not impact the memory available to jobs in other non-overlapping mem_exclusive cpusets. This patch enables providing hardwall, inescapable cpusets for memory allocations of each job, while sharing kernel memory allocations between several jobs, in an enclosing mem_exclusive cpuset. Like Dinakar's patch earlier to enable administering sched domains using the cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag that had previously done nothing much useful other than restrict what cpuset configurations were allowed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:40 -07:00
Paul Jackson	a49335ccea	[PATCH] cpusets: oom_kill tweaks This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:39 -07:00
Ravikiran G Thirumalai	6c231b7bab	[PATCH] Additions to .data.read_mostly section Mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:33 -07:00
Steven Pratt	3b30bbd963	[PATCH] readahead: reset cache_hit earlier We don't reset the cache hit count until after readahead does a successful readahead. This seems to leave a corner case open where we miss in cache, but don't restart the readhead right away. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:25 -07:00
Christoph Hellwig	cdb3826b99	[PATCH] remove misleading comment above sys_brk Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:23 -07:00
Christoph Lameter	c3d8c14145	[PATCH] More __read_mostly variables Move some more frequently read variables that showed up during some of our performance tests as sometimes ending up in hot cachelines to the read_mostly section. Fix: Move the __read_mostly from before hpet_usec_quotient to follow the variable like the other uses of __read_mostly. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-07 16:57:18 -07:00
Stephen Smalley	f549d6c18c	[PATCH] Generic VFS fallback for security xattrs This patch modifies the VFS setxattr, getxattr, and listxattr code to fall back to the security module for security xattrs if the filesystem does not support xattrs natively. This allows security modules to export the incore inode security label information to userspace even if the filesystem does not provide xattr storage, and eliminates the need to individually patch various pseudo filesystem types to provide such access. The patch removes the existing xattr code from devpts and tmpfs as it is then no longer needed. The patch restructures the code flow slightly to reduce duplication between the normal path and the fallback path, but this should only have one user-visible side effect - a program may get -EACCES rather than -EOPNOTSUPP if policy denied access but the filesystem didn't support the operation anyway. Note that the post_setxattr hook call is not needed in the fallback case, as the inode_setsecurity hook call handles the incore inode security state update directly. In contrast, we do call fsnotify in both cases. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:52 -07:00
Martin Hicks	c07e02db76	[PATCH] VM: add page_state info to per-node meminfo Add page_state info to the per-node meminfo file in sysfs. This is mostly just for informational purposes. The lack of this information was brought up recently during a discussion regarding pagecache clearing, and I put this patch together to test out one of the suggestions. It seems like interesting info to have, so I'm submitting the patch. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:49 -07:00
Manfred Spraul	00e145b6d5	[PATCH] slab: removes local_irq_save()/local_irq_restore() pair Proposed by and based on a patch from Eric Dumazet <dada1@cosmosbay.com>: This patch removes unnecessary critical section in ksize() function, as cli/sti are rather expensive on modern CPUS. It additionally adds a docbook entry for ksize() and further simplifies the code. Signed-Off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:49 -07:00
Eric Dumazet	34342e863c	[PATCH] mm/slab.c: prefetchw the start of new allocated objects Mostobjects returned by __cache_alloc() will be written by the caller, (but not all callers want to write all the object, but just at the begining) prefetchw() tells the modern CPU to think about the future writes, ie start some memory transactions in advance. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Zachary Amsden	a600388d28	[PATCH] x86: ptep_clear optimization Add a new accessor for PTEs, which passes the full hint from the mmu_gather struct; this allows architectures with hardware pagetables to optimize away atomic PTE operations when destroying an address space. Removing the locked operation should allow better pipelining of memory access in this loop. I measured an average savings of 30-35 cycles per zap_pte_range on the first 500 destructions on Pentium-M, but I believe the optimization would win more on older processors which still assert the bus lock on xchg for an exclusive cacheline. Update: I made some new measurements, and this saves exactly 26 cycles over ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180 cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a full address space destruction. pte_clear_full is not yet used, but is provided for future optimizations (in particular, when running inside of a hypervisor that queues page table updates, the full hint allows us to avoid queueing unnecessary page table update for an address space in the process of being destroyed. This is not a huge win, but it does help a bit, and sets the stage for further hypervisor optimization of the mm layer on all architectures. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Kyle Moffett	fa5b08d5f8	[PATCH] sab: consolidate kmem_bufctl_t This is used only in slab.c and each architecture gets to define whcih underlying type is to be used. Seems a bit silly - move it to slab.c and use the same type for all architectures: unsigned int. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:48 -07:00
Adam Litke	7bf07f3d4b	[PATCH] hugetlb: move stale pte check into huge_pte_alloc() Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:46 -07:00
Deepak Saxena	fd195c49fb	[PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDER Version 6 of the ARM architecture introduces the concept of 16MB pages (supersections) and 36-bit (40-bit actually, but nobody uses this) physical addresses. 36-bit addressed memory and I/O and ARMv6 can only be mapped using supersections and the requirement on these is that both virtual and physical addresses be 16MB aligned. In trying to add support for ioremap() of 36-bit I/O, we run into the issue that get_vm_area() allows for a maximum of 512K alignment via the IOREMAP_MAX_ORDER constant. To work around this, we can: - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER)) and then align the pointer ourselves, but this ends up with 512K of wasted VM per ioremap(). - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit on top of this. I did this and it works but I don't like the idea adding another VM API just for this one case. - My preferred solution which is to allow the architecture to override the IOREMAP_MAX_ORDER constant with it's own version. Signed-off-by: Deepak Saxena <dsaxena@plexity.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:46 -07:00
Paolo 'Blaisorblade' Giarrusso	4944e76d81	[PATCH] mm: remove implied vm_ops check If !vma->vm-ops we already BUG above, so retesting it is useless. The compiler cannot optimize this because BUG is a macro and is not thus marked noreturn; that should possibly be fixed. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Paolo 'Blaisorblade' Giarrusso	d44ed4f868	[PATCH] shmem_populate: avoid an useless check, and some comments Either shmem_getpage returns a failure, or it found a page, or it was told it couldn't do any I/O. So it's useless to check nonblock in the else branch. We could add a BUG() there but I preferred to comment the offending function. This was taken out from one Ingo Molnar's old patch I'm resurrecting. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Martin Hicks	0abf40c1ac	[PATCH] vm: slab.c spelling correction Fix a small spelling mistake. subtile->subtle Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:45 -07:00
Hugh Dickins	836d5ffd34	[PATCH] mm: fix madvise vma merging Better late than never, I've at last reviewed the madvise vma merging going into 2.6.13. Remove a pointless check and fix two little bugs - a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed both mismerges in practice: though being madvise, neither was disastrous. 1. Correct placement of the success label in madvise_behavior: as in mprotect_fixup and mlock_fixup, it is necessary to update vm_flags when vma_merge succeeds (to handle the exceptional Case 8 noted in the comments above vma_merge itself). 2. Correct initial value of prev when starting part way into a vma: as in sys_mprotect and do_mlock, it needs to be set to vma in this case (vma_merge handles only that minimum of cases shown in its comments). 3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next, so it's pointless to make that same assignment again in sys_madvise. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Martin Hicks	53e9a6159f	[PATCH] VM: zone reclaim atomic ops cleanup Christoph Lameter and Marcelo Tosatti asked to get rid of the atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Martin Hicks	bce5f6ba34	[PATCH] VM: add capabilites check to set_zone_reclaim Add a capability check to sys_set_zone_reclaim(). This syscall is not something that should be available to a user. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	242e546862	[PATCH] mm: remove atomic This bitop does not need to be atomic because it is performed when there will be no references to the page (ie. the page is being freed). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	9a61c349b2	[PATCH] mm: remap ZERO_PAGE mappings filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it has no data page to map there: then if the hole in the file is later filled, __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that offset by the newly allocated file page, so that established mappings will see the newly written data. However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs, chosen for coloring by user virtual address; and if mremap has meanwhile been used to move a mapping containing a ZERO_PAGE, it will generally not match the ZERO_PAGE(address) __xip_unmap is looking for. To maintain XIP's established mappings correctly on MIPS, we need Nick's fix to mremap's move_one_page (originally presented as an optimization), to replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE appropriate to the new address. (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get corrupted, very bad. Now I think it's the other way round, that the established mappings will fail to see the newly written data: incorrect, but not corrupting everything else. Whether filemap_xip's technique is generally safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried to do that in tmpfs.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:44 -07:00
Nick Piggin	4d7670e0f6	[PATCH] mm: cleanup rmap Thanks to Bill Irwin for pointing this out. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Nick Piggin	2822c1aa57	[PATCH] mm: micro-optimise rmap Microoptimise page_add_anon_rmap. Although these expressions are used only in the taken branch of the if() statement, the compiler can't reorder them inside because atomic_inc_and_test is a barrier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Nick Piggin	c3dce2d89c	[PATCH] mm: comment rmap Just be clear that VM_RESERVED pages here are a bug, and the test is not there because they are expected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Christoph Lameter	6e21c8f145	[PATCH] /proc/<pid>/numa_maps to show on which nodes pages reside This patch was recently discussed on linux-mm: http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2 I inherited a large code base from Ray for page migration. There was a small patch in there that I find to be very useful since it allows the display of the locality of the pages in use by a process. I reworked that patch and came up with a /proc/<pid>/numa_maps that gives more information about the vma's of a process. numa_maps is indexes by the start address found in /proc/<pid>/maps. F.e. with this patch you can see the page use of the "getty" process: margin:/proc/12008 # cat maps 00000000-00004000 r--p 00000000 00:00 0 2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so 2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so 2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0 2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0 2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE 2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache 2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0 4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty 6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty 6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap] 60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0 60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack] a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso] cat numa_maps 2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2 2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16 2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3 2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3 2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2 2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1 4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2 6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 getty uses ld.so. The first vma is the code segment which is used by 43 other processes and the pages are evenly distributed over the 4 nodes. The second vma is the process specific data portion for ld.so. This is only one page. The display format is: <startaddress> Links to information in /proc/<pid>/map <memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}" MaxRef= <maximum reference to a page in this vma> Pages= <Nr of pages in use> Mapped= <Nr of pages with mapcount > Anon= <nr of anonymous pages> Nx= <Nr of pages on Node x> The content of the proc-file is self-evident. If this would be tied into the sparsemem system then the contents of this file would not be too useful. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:43 -07:00
Hugh Dickins	839b9685e8	[PATCH] rmap: don't test rss Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a time when testing rss was important to avoid a particular race between dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo- optimization, made even more obscure by the get_mm_counter macro. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	3279ffd97f	[PATCH] delete from_swap_cache BUG_ONs Three of the four BUG_ONs in delete_from_swap_cache are immediately repeated in __delete_from_swap_cache: delete those and add the one. But perhaps mm/ is altogether overprovisioned with historic BUGs? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	5d337b9194	[PATCH] swap: swap_lock replace list+device The idea of a swap_device_lock per device, and a swap_list_lock over them all, is appealing; but in practice almost every holder of swap_device_lock must already hold swap_list_lock, which defeats the purpose of the split. The only exceptions have been swap_duplicate, valid_swaphandles and an untrodden path in try_to_unuse (plus a few places added in this series). valid_swaphandles doesn't show up high in profiles, but swap_duplicate does demand attention. However, with the hold time in get_swap_pages so much reduced, I've not yet found a load and set of swap device priorities to show even swap_duplicate benefitting from the split. Certainly the split is mere overhead in the common case of a single swap device. So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock (generally we seem to prefer an _ in the name, and not hide in a macro). If someone can show a regression in swap_duplicate, then probably we should add a hashlock for the swap_map entries alone (shorts being anatomic), so as to help the case of the single swap device too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:42 -07:00
Hugh Dickins	048c27fd72	[PATCH] swap: scan_swap_map latency breaks The get_swap_page/scan_swap_map latency can be so bad that even those without preemption configured deserve relief: periodically cond_resched. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	52b7efdbe5	[PATCH] swap: scan_swap_map drop swap_device_lock get_swap_page has often shown up on latency traces, doing lengthy scans while holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map drop swap_device_lock before scanning the swap_map. While scanning for an empty cluster, don't worry that racing tasks may allocate what was free and free what was allocated; but when allocating an entry, check it's still free after retaking the lock. Avoid dropping the lock in the expected common path. No barriers beyond the locks, just let the cookie crumble; highest_bit limit is volatile, but benign. Guard against swapoff: must check SWP_WRITEOK before allocating, must raise SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to fall - just use schedule_timeout, we don't want to burden scan_swap_map itself, and it's very unlikely that anyone can really still be in scan_swap_map once swapoff gets this far. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	7dfad4183b	[PATCH] swap: scan_swap_map restyled Rewrite scan_swap_map to allocate in just the same way as before (taking the next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly empty cluster, falling back to lowest entry if none), but with a view towards dropping the lock in the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	fb4f88dcab	[PATCH] swap: get_swap_page drop swap_list_lock Rewrite get_swap_page to allocate in just the same sequence as before, but without holding swap_list_lock across its scan_swap_map. Decrement nr_swap_pages and update swap_list.next in advance, while still holding swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock contention when there are parallel swap devices of the same priority. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	89d09a2c80	[PATCH] swap: freeing update swap_list.next This makes negligible difference in practice: but swap_list.next should not be updated to a higher prio in the general helper swap_info_get, but rather in swap_entry_free; and then only in the case when entry is actually freed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	6eb396dc4a	[PATCH] swap: swap unsigned int consistency The swap header's unsigned int last_page determines the range of swap pages, but swap_info has been using int or unsigned long in some cases: use unsigned int throughout (except, in several places a local unsigned long is useful to avoid overflows when adding). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:41 -07:00
Hugh Dickins	53092a7402	[PATCH] swap: show span of swap extents The "Adding %dk swap" message shows the number of swap extents, as a guide to how fragmented the swapfile may be. But a useful further guide is what total extent they span across (sometimes scarily large). And there's no need to keep nr_extents in swap_info: it's unused after the initial message, so save a little space by keeping it on stack. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	11d31886db	[PATCH] swap: swap extent list is ordered There are several comments that swap's extent_list.prev points to the lowest extent: that's not so, it's extent_list.next which points to it, as you'd expect. And a couple of loops in add_swap_extent which go all the way through the list, when they should just add to the other end. Fix those up, and let map_swap_page search the list forwards: profiles shows it to be twice as quick that way - because prefetch works better on how the structs are typically kmalloc'ed? or because usually more is written to than read from swap, and swap is allocated ascendingly? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	4cd3bb10ff	[PATCH] swap: move destroy_swap_extents calls sys_swapon's call to destroy_swap_extents on failure is made after the final swap_list_unlock, which is faintly unsafe: another sys_swapon might already be setting up that swap_info_struct. Calling it earlier, before taking swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe, but likewise move it earlier, before taking the locks (once try_to_unuse has completed, nothing can be needing the swap extents). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00
Hugh Dickins	e2244ec2ef	[PATCH] swap: correct swapfile nr_good_pages If a regular swapfile lies on a filesystem whose blocksize is less than PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap pages; but sys_swapon's nr_good_pages was not expecting that. Also, setup_swap_extents takes no account of badpages listed in the swap header: not worth doing so, but ensure nr_badpages is 0 for a regular swapfile. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2005-09-05 00:05:40 -07:00

... 4 5 6 7 8 ...

676 Commits