Commit Graph

27 Commits

Author SHA1 Message Date
Tejun Heo
a70c691376 sparc64: implement page mapping percpu first chunk allocator
Implement page mapping percpu first chunk allocator as a fallback to
the embedding allocator.  The next patch will make the embedding
allocator check distances between units to determine whether it fits
within the vmalloc area so that this fallback can be used on such
cases.

sparc64 currently has relatively small vmalloc area which makes it
impossible to create any dynamic chunks on certain configurations
leading to percpu allocation failures.  This and the next patch should
allow those configurations to keep working until proper solution is
found.

While at it, mark pcpu_cpu_distance() with __init.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
2009-09-29 09:17:57 +09:00
Tejun Heo
bcb2107fdb sparc64: use embedding percpu first chunk allocator
sparc64 currently allocates a large page for each cpu and partially
remap them into vmalloc area much like what lpage first chunk
allocator did.  As a 4M page is used for each cpu, this results in
very large unit size and also adds TLB pressure due to the double
mapping of pages in the first chunk.

This patch converts sparc64 to use the embedding percpu first chunk
allocator which now knows how to handle NUMA configurations.  This
simplifies the code a lot, doesn't incur any extra TLB pressure and
results in better utilization of address space.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
2009-08-14 15:00:53 +09:00
Tejun Heo
fb435d5233 percpu: add pcpu_unit_offsets[]
Currently units are mapped sequentially into address space.  This
patch adds pcpu_unit_offsets[] which allows units to be mapped to
arbitrary offsets from the chunk base address.  This is necessary to
allow sparse embedding which might would need to allocate address
ranges and memory areas which aren't aligned to unit size but
allocation atom size (page or large page size).  This also simplifies
things a bit by removing the need to calculate offset from unit
number.

With this change, there's no need for the arch code to know
pcpu_unit_size.  Update pcpu_setup_first_chunk() and first chunk
allocators to return regular 0 or -errno return code instead of unit
size or -errno.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
2009-08-14 15:00:51 +09:00
Tejun Heo
fd1e8a1fe2 percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array.  For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem.  Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.

This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus.  pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.

pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.

* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
  help dealing with pcpu_alloc_info.

* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
  generalized and renamed to pcpu_build_alloc_info().
  @cpu_distance_fn may be NULL indicating that all cpus are of
  LOCAL_DISTANCE.

* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
  generalized and renamed to pcpu_dump_alloc_info().  It now also
  prints which group each alloc unit belongs to.

* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
  separate parameters.  All first chunk allocators are updated to use
  pcpu_build_alloc_info() to build alloc_info and call
  pcpu_setup_first_chunk() with it.  This has the side effect of
  packing units for sparse possible cpus.  ie. if cpus 0, 2 and 4 are
  possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.

* x86 setup_pcpu_lpage() is updated to deal with alloc_info.

* sparc64 setup_per_cpu_areas() is updated to build alloc_info.

Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus.  It mostly changes how information is passed among
initialization functions and makes room for more flexibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 15:00:51 +09:00
Tejun Heo
384be2b18a Merge branch 'percpu-for-linus' into percpu-for-next
Conflicts:
	arch/sparc/kernel/smp_64.c
	arch/x86/kernel/cpu/perf_counter.c
	arch/x86/kernel/setup_percpu.c
	drivers/cpufreq/cpufreq_ondemand.c
	mm/percpu.c

Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids.  As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-08-14 14:45:31 +09:00
Tejun Heo
74d46d6b2d percpu, sparc64: fix sparse possible cpu map handling
percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
is incorrect if cpu_possible_map contains holes.  This causes percpu
code to access beyond allocated memories and vmalloc areas.  On a
sparc64 machine with cpus 0 and 2 (u60), this triggers the following
warning or fails boot.

 WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
 Modules linked in:
 Call Trace:
  [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
  [00000000004b1840] map_vm_area+0x20/0x60
  [00000000004b1950] __vmalloc_area_node+0xd0/0x160
  [0000000000593434] deflate_init+0x14/0xe0
  [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
  [00000000005844f0] crypto_alloc_base+0x50/0xa0
  [000000000058b898] alg_test_comp+0x18/0x80
  [000000000058dad4] alg_test+0x54/0x180
  [000000000058af00] cryptomgr_test+0x40/0x60
  [0000000000473098] kthread+0x58/0x80
  [000000000042b590] kernel_thread+0x30/0x60
  [0000000000472fd0] kthreadd+0xf0/0x160
 ---[ end trace 429b268a213317ba ]---

This patch fixes generic percpu functions and sparc64
setup_per_cpu_areas() so that they handle sparse cpu_possible_map
properly.

Please note that on x86, cpu_possible_map() doesn't contain holes and
thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
2009-08-14 13:20:53 +09:00
Tejun Heo
2f39e637ea percpu: allow non-linear / sparse cpu -> unit mapping
Currently cpu and unit are always identity mapped.  To allow more
efficient large page support on NUMA and lazy allocation for possible
but offline cpus, cpu -> unit mapping needs to be non-linear and/or
sparse.  This can be easily implemented by adding a cpu -> unit
mapping array and using it whenever looking up the matching unit for a
cpu.

The only unusal conversion is in pcpu_chunk_addr_search().  The passed
in address is unit0 based and unit0 might not be in use so it needs to
be converted to address of an in-use unit.  This is easily done by
adding the unit offset for the current processor.

[ Impact: allows non-linear/sparse cpu -> unit mapping, no visible change yet ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-07-04 08:11:00 +09:00
Tejun Heo
ce3141a277 percpu: drop pcpu_chunk->page[]
percpu core doesn't need to tack all the allocated pages.  It needs to
know whether certain pages are populated and a way to reverse map
address to page when freeing.  This patch drops pcpu_chunk->page[] and
use populated bitmap and vmalloc_to_page() lookup instead.  Using
vmalloc_to_page() exclusively is also possible but complicates first
chunk handling, inflates cache footprint and prevents non-standard
memory allocation for percpu memory.

pcpu_chunk->page[] was used to track each page's allocation and
allowed asymmetric population which happens during failure path;
however, with single bitmap for all units, this is no longer possible.
Bite the bullet and rewrite (de)populate functions so that things are
done in clearly separated steps such that asymmetric population
doesn't happen.  This makes the (de)population process much more
modular and will also ease implementing non-standard memory usage in
the future (e.g. large pages).

This makes @get_page_fn parameter to pcpu_setup_first_chunk()
unnecessary.  The parameter is dropped and all first chunk helpers are
updated accordingly.  Please note that despite the volume most changes
to first chunk helpers are symbol renames for variables which don't
need to be referenced outside of the helper anymore.

This change reduces memory usage and cache footprint of pcpu_chunk.
Now only #unit_pages bits are necessary per chunk.

[ Impact: reduced memory usage and cache footprint for bookkeeping ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-07-04 08:11:00 +09:00
Tejun Heo
38a6be5254 percpu: simplify pcpu_setup_first_chunk()
Now that all first chunk allocator helpers allocate and map the first
chunk themselves, there's no need to have optional default alloc/map
in pcpu_setup_first_chunk().  Drop @populate_pte_fn and only leave
@dyn_size optional and make all other params mandatory.

This makes it much easier to follow what pcpu_setup_first_chunk() is
doing and what actual differences tweaking each parameter results in.

[ Impact: drop unused code path ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
2009-07-04 08:10:59 +09:00
Stephen Rothwell
6ac5c61082 sparc: replace uses of CPU_MASK_ALL_PTR
CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:

	#define CPU_MASK_ALL (cpumask_t) { { ... } }

Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:

	#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)

Which formalizes this practice.  One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).

[Description by Rusty Russell]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:55 -07:00
Hong H. Pham
280ff97494 sparc64: fix and optimize irq distribution
irq_choose_cpu() should compare the affinity mask against cpu_online_map
rather than CPU_MASK_ALL, since irq_select_affinity() sets the interrupt's
affinity mask to cpu_online_map "and" CPU_MASK_ALL (which ends up being
just cpu_online_map).  The mask comparison in irq_choose_cpu() will always
fail since the two masks are not the same.  So the CPU chosen is the first CPU
in the intersection of cpu_online_map and CPU_MASK_ALL, which is always CPU0.
That means all interrupts are reassigned to CPU0...

Distributing interrupts to CPUs in a linearly increasing round robin fashion
is not optimal for the UltraSPARC T1/T2.  Also, the irq_rover in
irq_choose_cpu() causes an interrupt to be assigned to a different
processor each time the interrupt is allocated and released.  This may lead
to an unbalanced distribution over time.

A static mapping of interrupts to processors is done to optimize and balance
interrupt distribution.  For the T1/T2, interrupts are spread to different
cores first, and then to strands within a core.

The following is some benchmarks showing the effects of interrupt
distribution on a T2.  The test was done with iperf using a pair of T5220
boxes, each with a 10GBe NIU (XAUI) connected back to back.

  TCP     | Stock       Linear RR IRQ  Optimized IRQ
  Streams | 2.6.30-rc5  Distribution   Distribution
          | GBits/sec   GBits/sec      GBits/sec
  --------+-----------------------------------------
    1       0.839       0.862          0.868
    8       1.16        4.96           5.88
   16       1.15        6.40           8.04
  100       1.09        7.28           8.68

Signed-off-by: Hong H. Pham <hong.pham@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:28 -07:00
David S. Miller
4fd78a5f1e sparc64: Use new dynamic per-cpu allocator.
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:27 -07:00
David S. Miller
0c243ad81f sparc64: Only allocate per-cpu areas for possible cpus.
This gets us real close to the generic implementation of
setup_per_cpu_areas() except:

1) We store the per-cpu offset into the trap_block[], whereas
   the generic code has it's own static array.

2) We have to initialize the %g5 register to hold the boot cpu's
   per-cpu area offset.

3) The OBP/MDESC cpu info scan is performed at the end.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:25 -07:00
David S. Miller
73fffc037e sparc64: Get rid of real_setup_per_cpu_areas().
Now that we defer the cpu_data() initializations to the end of per-cpu
setup, we can get rid of this local hack we had to setup the per-cpu
areas eary.

This is a necessary step in order to support HAVE_DYNAMIC_PER_CPU_AREA
since the per-cpu setup must run when page structs are available.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:23 -07:00
David S. Miller
b696fdc259 sparc64: Defer cpu_data() setup until end of per-cpu data initialization.
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:22 -07:00
David S. Miller
5a5488d3bb sparc64: Store per-cpu offset in trap_block[]
Surprisingly this actually makes LOAD_PER_CPU_BASE() a little
more efficient.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:11 -07:00
David S. Miller
557fe0e884 sparc64: Reclaim trap_block[]->hdesc
This really isn't necessary at all, a local variable suits the
job just fine.

This frees up 8 bytes in the trap_block[] that we can use later
to store the per-cpu base addresses.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-16 04:56:08 -07:00
David S. Miller
8e255baa44 sparc64: Fix smp_callin() locking.
Interrupts must be disabled when taking the IPI lock.

Caught by lockdep.

Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-04-14 17:08:56 -07:00
David S. Miller
ed223129a3 Merge branch 'master' of ssh://master.kernel.org/home/ftp/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask-for-sparc
Conflicts:
	arch/sparc/kernel/smp_64.c
2009-03-29 15:44:22 -07:00
David S. Miller
f9384d41c0 sparc64: Fix MM refcount check in smp_flush_tlb_pending().
As explained by Benjamin Herrenschmidt:

> CPU 0 is running the context, task->mm == task->active_mm == your
> context. The CPU is in userspace happily churning things.
>
> CPU 1 used to run it, not anymore, it's now running fancyfsd which
> is a kernel thread, but current->active_mm still points to that
> same context.
>
> Because there's only one "real" user, mm_users is 1 (but mm_count is
> elevated, it's just that the presence on CPU 1 as active_mm has no
> effect on mm_count().
>
> At this point, fancyfsd decides to invalidate a mapping currently mapped
> by that context, for example because a networked file has changed
> remotely or something like that, using unmap_mapping_ranges().
>
> So CPU 1 goes into the zapping code, which eventually ends up calling
> flush_tlb_pending(). Your test will succeed, as current->active_mm is
> indeed the target mm for the flush, and mm_users is indeed 1. So you
> will -not- send an IPI to the other CPU, and CPU 0 will continue happily
> accessing the pages that should have been unmapped.

To fix this problem, check ->mm instead of ->active_mm, and this
means:

> So if you test current->mm, you effectively account for mm_users == 1,
> so the only way the mm can be active on another processor is as a lazy
> mm for a kernel thread. So your test should work properly as long
> as you don't have a HW that will do speculative TLB reloads into the
> TLB on that other CPU (and even if you do, you flush-on-switch-in should
> get rid of any crap here).

And therefore we should be OK.

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-27 01:09:17 -07:00
Rusty Russell
81f1adf012 cpumask: use mm_cpumask() wrapper: sparc
Makes code futureproof against the impending change to mm->cpu_vm_mask.

It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-16 14:40:39 +10:30
Rusty Russell
f46df02a57 cpumask: arch_send_call_function_ipi_mask: sparc
We're weaning the core code off handing cpumask's around on-stack.
This introduces arch_send_call_function_ipi_mask(), and by defining
it, the old arch_send_call_function_ipi is defined by the core code.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2009-03-16 14:40:22 +10:30
Rusty Russell
fd8e18e9f4 cpumask: Use smp_call_function_many(): sparc64
Impact: Use new API

Change smp_call_function_mask() callers to smp_call_function_many().

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
2009-03-16 14:40:22 +10:30
Sam Ravnborg
9018113649 sparc64: Use unsigned long long for u64.
Andrew Morton wrote:

    People keep on doing

            printk("%llu", some_u64);

    testing it only on x86_64 and this generates a warning storm on
    powerpc, sparc64, etc.  Because they use `long', not `long long'.

    Quite a few 64-bit architectures are using `long' for their
    s64/u64 types.  We should convert them all to `long long'.

Update types.h so we use unsigned long long for u64 and
fix all warnings in sparc64 code.
Tested with an allnoconfig, defconfig and allmodconfig builds.

This patch introduces additional warnings in several drivers.
These will be dealt with in separate patches.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-06 13:19:28 -08:00
Linus Torvalds
b840d79631 Merge branch 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
  x86: export vector_used_by_percpu_irq
  x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
  sched: nominate preferred wakeup cpu, fix
  x86: fix lguest used_vectors breakage, -v2
  x86: fix warning in arch/x86/kernel/io_apic.c
  sched: fix warning in kernel/sched.c
  sched: move test_sd_parent() to an SMP section of sched.h
  sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
  sched: activate active load balancing in new idle cpus
  sched: bias task wakeups to preferred semi-idle packages
  sched: nominate preferred wakeup cpu
  sched: favour lower logical cpu number for sched_mc balance
  sched: framework for sched_mc/smt_power_savings=N
  sched: convert BALANCE_FOR_xx_POWER to inline functions
  x86: use possible_cpus=NUM to extend the possible cpus allowed
  x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
  x86: update io_apic.c to the new cpumask code
  x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
  x86: xen: use smp_call_function_many()
  x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
  ...

Fixed up trivial conflict in kernel/time/tick-sched.c manually
2009-01-02 11:44:09 -08:00
Rusty Russell
8e757281de sparc: replace for_each_cpu_mask_nr with for_each_cpu
Simple replacement, now the _nr is redundant.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-08 01:10:08 -08:00
Sam Ravnborg
a88b5ba8bd sparc,sparc64: unify kernel/
o Move all files from sparc64/kernel/ to sparc/kernel
  - rename as appropriate
o Update sparc/Makefile to the changes
o Update sparc/kernel/Makefile to include the sparc64 files

NOTE: This commit changes link order on sparc64!

Link order had to change for either of sparc32 and sparc64.
And assuming sparc64 see more testing than sparc32 change link
order on sparc64 where issues will be caught faster.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-04 09:17:21 -08:00