Commit Graph

8028 Commits

Author SHA1 Message Date
Andrew Morton
8f286c33f1 stop using DMA_xxBIT_MASK
Now that we have DMA_BIT_MASK(), these macros are pointless.

Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:21 -07:00
Borislav Petkov
34c6538413 unify DMA_..BIT_MASK definitions: v3.1
Remove redundant DMA_..BIT_MASK definitions across two drivers.  The
computation of the majority of the bitmasks is done by the compiler.  The
initial split of the patch touching each a different file got removed due
to possible git bisect breakage.

Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Reviewed-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:21 -07:00
Tony Breeds
2c62214831 Fix discrepancy between VDSO based gettimeofday() and sys_gettimeofday().
On platforms that copy sys_tz into the vdso (currently only x86_64, soon to
include powerpc), it is possible for the vdso to get out of sync if a user
calls (admittedly unusual) settimeofday(NULL, ptr).

This patch adds a hook for architectures that set
CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also
updatee their copy in the vdso.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:20 -07:00
Alexey Dobriyan
6212e3a388 Remove struct task_struct::io_wait
Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
during 2.6.9 development.  Commit introduced io_wait field which remained
write-only than and still remains write-only.

Also garbage collect macros which "use" io_wait.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:20 -07:00
Rafael J. Wysocki
c7e0831d38 Hibernation: Check if ACPI is enabled during restore in the right place
The following scenario leads to total confusion of the platform firmware on
some boxes (eg. HPC nx6325):
* Hibernate with ACPI enabled
* Resume passing "acpi=off" to the boot kernel

To prevent this from happening it's necessary to check if ACPI is enabled (and
enable it if that's not the case) _right_ _after_ control has been transfered
from the boot kernel to the image kernel, before device_power_up() is called
(ie.  with interrupts disabled).   Enabling ACPI after calling
device_power_up() turns out to be insufficient.

For this reason, introduce new hibernation callback ->leave() that will be
executed before device_power_up() by the restored image kernel.   To make it
work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c
(it's name is changed to "create_image", which is more up to the point).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:20 -07:00
Andres Salomon
8f4ce8c32f serial: turn serial console suspend a boot rather than compile time option
Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop
the serial console from being suspended when the rest of the machine goes
to sleep.  This is incredibly useful for debugging power management-related
things; however, having it as a compile-time option has proved to be
incredibly inconvenient for us (OLPC).  There are plenty of times that we
want serial console to not suspend, but for the most part we'd like serial
console to be suspended.

This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel
boot parameter (no_console_suspend).  By default, the serial console will
be suspended along with the rest of the system; by passing
'no_console_suspend' to the kernel during boot, serial console will remain
alive during suspend.

For now, this is pretty serial console specific; further fixes could be
applied to make this work for things like netconsole.

Signed-off-by: Andres Salomon <dilinger@debian.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:19 -07:00
Rafael J. Wysocki
e42837bcd3 freezer: introduce freezer-friendly waiting macros
Introduce freezer-friendly wrappers around wait_event_interruptible() and
wait_event_interruptible_timeout(), originally defined in <linux/wait.h>, to
be used in freezable kernel threads.  Make some of the freezable kernel
threads use them.

This is necessary for the freezer to stop sending signals to kernel threads,
which is implemented in the next patch.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:19 -07:00
Rafael J. Wysocki
b3dac3b304 PM: Rename hibernation_ops to platform_hibernation_ops
Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in
analogy with 'struct platform_suspend_ops'.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Rafael J. Wysocki
74f270af0c PM: Rework struct hibernation_ops
During hibernation we also need to tell the ACPI core that we're going to put
the system into the S4 sleep state.  For this reason, an additional method in
'struct hibernation_ops' is needed, playing the role of set_target() in
'struct platform_suspend_operations'.  Moreover, the role of the .prepare()
method is now different, so it's better to introduce another method, that in
general may be different from .prepare(), that will be used to prepare the
platform for creating the hibernation image (.prepare() is used anyway to
notify the platform that we're going to enter the low power state after the
image has been saved).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Rafael J. Wysocki
f242d9196f PM: Make suspend_ops static
The variable suspend_ops representing the set of global platform-specific
suspend-related operations, used by the PM core, need not be exported outside
of kernel/power/main.c .   Make it static.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Rafael J. Wysocki
e6c5eb9541 PM: Rework struct platform_suspend_ops
There is no reason why the .prepare() and .finish() methods in 'struct
platform_suspend_ops' should take any arguments, since architectures don't use
these methods' argument in any practically meaningful way (ie.  either the
target system sleep state is conveyed to the platform by .set_target(), or
there is only one suspend state supported and it is indicated to the PM core
by .valid(), or .prepare() and .finish() aren't defined at all).   There also
is no reason why .finish() should return any result.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Rafael J. Wysocki
26398a70ea PM: Rename struct pm_ops and related things
The name of 'struct pm_ops' suggests that it is related to the power
management in general, but in fact it is only related to suspend.   Moreover,
its name should indicate what this structure is used for, so it seems
reasonable to change it to 'struct platform_suspend_ops'.   In that case, the
name of the global variable of this type used by the PM core and the names of
related functions should be changed accordingly.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Rafael J. Wysocki
95d9ffbe01 PM: Move definition of struct pm_ops to suspend.h
Move the definition of 'struct pm_ops' and related functions from <linux/pm.h>
to <linux/suspend.h> .

There are, at least, the following reasons to do that:
* 'struct pm_ops' is specifically related to suspend and not to the power
  management in general.
* As long as 'struct pm_ops' is defined in <linux/pm.h>, any modification of it
  causes the entire kernel to be recompiled, which is unnecessary and annoying.
* Some suspend-related features are already defined in <linux/suspend.h>, so it
  is logical to move the definition of 'struct pm_ops' into there.
* 'struct hibernation_ops', being the hibernation-related counterpart of
  'struct pm_ops', is defined in <linux/suspend.h> .

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:18 -07:00
Aneesh Kumar K.V
d8dd0b4543 ext4: Convert ext4_extent_idx.ei_leaf to ext4_extent_idx.ei_leaf_lo
Convert ext4_extent_idx.ei_leaf  ext4_extent_idx.ei_leaf_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:03 -04:00
Aneesh Kumar K.V
b377611d11 ext4: Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
Convert ext4_extent.ee_start to ext4_extent.ee_start_lo
This helps in finding BUGs due to direct partial access of
these split 48 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:03 -04:00
Aneesh Kumar K.V
308ba3ece7 ext4: Convert s_r_blocks_count and s_free_blocks_count
Convert s_r_blocks_count and s_free_blocks_count to
s_r_blocks_count_lo and s_free_blocks_count_lo

This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
6bc9feff14 ext4: Convert s_blocks_count to s_blocks_count_lo
Convert s_blocks_count to s_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
5272f83727 ext4: Convert bg_inode_bitmap and bg_inode_table
Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
and bg_inode_table_lo.  This helps in finding BUGs due to
direct partial access of these split 64 bit values

Also fix one direct partial access

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
3a14589cce ext4: Convert bg_block_bitmap to bg_block_bitmap_lo
Convert bg_block_bitmap to bg_block_bitmap_lo
This helps in catching some BUGS due to direct
partial access of these split fields.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:01 -04:00
Jose R. Santos
ce42158179 ext4: FLEX_BG Kernel support v2.
This feature relaxes check restrictions on where each block groups meta
data is located within the storage media.  This allows for the allocation
of bitmaps or inode tables outside the block group boundaries in cases
where bad blocks forces us to look for new blocks which the owning block
group can not satisfy.  This will also allow for new meta-data allocation
schemes to improve performance and scalability.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Aneesh Kumar K.V
c1bddad949 ext4: Fix sparse warnings
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Andreas Dilger
717d50e497 Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use.  This is this the most time consuming part
of the filesystem check.  The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.

With this feature, there is a a high water mark of used inodes for each block
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem.  In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */

bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.

bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:00 -04:00
Eric Sandeen
4074fe3736 ext4: remove #ifdef CONFIG_EXT4_INDEX
CONFIG_EXT4_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext4_fs.h.  tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code.  Remove
it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:00 -04:00
Coly Li
f077d0d7ea ext4: Remove (partial, never completed) fragment support
Fragment support in ext2/3/4 was never implemented, and it probably will
never be implemented.   So remove it from ext4.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:59 -04:00
Mingming Cao
cd02ff0b14 jbd2: JBD_XXX to JBD2_XXX naming cleanup
change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:58 -04:00
Mingming Cao
2d917969bc JBD2: replace jbd_kmalloc with kmalloc directly.
This patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:57 -04:00
Mingming Cao
a5005da204 JBD: replace jbd_kmalloc with kmalloc directly
This patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:57 -04:00
Mingming Cao
af1e76d6b3 JBD2: jbd2 slab allocation cleanups
JBD2: Replace slab allocations with page allocations

JBD2 allocate memory for committed_data and frozen_data from slab. However
JBD2 should not pass slab pages down to the block layer. Use page allocator
pages instead. This will also prepare JBD for the large blocksize patchset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:56 -04:00
Mingming Cao
c089d490df JBD: JBD slab allocation cleanups
JBD: Replace slab allocations with page allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2007-10-17 18:49:56 -04:00
Pierre Ossman
727c26ed78 net: libertas sdio driver
Add driver for Marvell's Libertas 8385 and 8686 wifi chips.

Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Acked-by: Dan Williams <dcbw@redhat.com>
2007-10-17 22:51:13 +02:00
Linus Torvalds
e6d5a11dad Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: fix new task startup crash
  sched: fix !SYSFS build breakage
  sched: fix improper load balance across sched domain
  sched: more robust sd-sysctl entry freeing
2007-10-17 09:11:18 -07:00
Linus Torvalds
c548f08a4f Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (24 commits)
  [POWERPC] Fix vmemmap warning in init_64.c
  [POWERPC] Fix 64 bits vDSO DWARF info for CR register
  [POWERPC] Add 1TB workaround for PA6T
  [POWERPC] Enable NO_HZ and high res timers for pseries and ppc64 configs
  [POWERPC] Quieten cache information at boot
  [POWERPC] Quieten clockevent printk
  [POWERPC] Enable SLUB in *_defconfig
  [POWERPC] Fix 1TB segment detection
  [POWERPC] Fix iSeries_hpte_insert prototype
  [POWERPC] Fix copyright symbol
  [POWERPC] ibmebus: Move to of_device and of_platform_driver, match eHCA and eHEA drivers
  [POWERPC] ibmebus: Add device creation and bus probing based on of_device
  [POWERPC] ibmebus: Remove bus match/probe/remove functions
  [POWERPC] Move of_device allocation into of_device.[ch]
  [POWERPC] mpc52xx: device tree changes for FEC and MDIO
  [POWERPC] bestcomm: GenBD task support
  [POWERPC] bestcomm: FEC task support
  [POWERPC] bestcomm: ATA task support
  [POWERPC] bestcomm: core bestcomm support for Freescale MPC5200
  [POWERPC] mpc52xx: Update mpc52xx_psc structure with B revision changes
  ...
2007-10-17 09:05:55 -07:00
Linus Torvalds
5c8e191e84 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-x86setup:
  Remove magic macros for screen_info structure members
  [x86] remove uses of magic macros for boot_params access
2007-10-17 09:00:30 -07:00
Adrian Bunk
cbfee34520 security/ cleanups
This patch contains the following cleanups that are now possible:
- remove the unused security_operations->inode_xattr_getsuffix
- remove the no longer used security_operations->unregister_security
- remove some no longer required exit code
- remove a bunch of no longer used exports

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Serge E. Hallyn
b53767719b Implement file posix capabilities
Implement file posix capabilities.  This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php.  For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
	Nov 27:
	Incorporate fixes from Andrew Morton
	(security-introduce-file-caps-tweaks and
	security-introduce-file-caps-warning-fix)
	Fix Kconfig dependency.
	Fix change signaling behavior when file caps are not compiled in.

	Nov 13:
	Integrate comments from Alexey: Remove CONFIG_ ifdef from
	capability.h, and use %zd for printing a size_t.

	Nov 13:
	Fix endianness warnings by sparse as suggested by Alexey
	Dobriyan.

	Nov 09:
	Address warnings of unused variables at cap_bprm_set_security
	when file capabilities are disabled, and simultaneously clean
	up the code a little, by pulling the new code into a helper
	function.

	Nov 08:
	For pointers to required userspace tools and how to use
	them, see http://www.friedhoff.org/fscaps.html.

	Nov 07:
	Fix the calculation of the highest bit checked in
	check_cap_sanity().

	Nov 07:
	Allow file caps to be enabled without CONFIG_SECURITY, since
	capabilities are the default.
	Hook cap_task_setscheduler when !CONFIG_SECURITY.
	Move capable(TASK_KILL) to end of cap_task_kill to reduce
	audit messages.

	Nov 05:
	Add secondary calls in selinux/hooks.c to task_setioprio and
	task_setscheduler so that selinux and capabilities with file
	cap support can be stacked.

	Sep 05:
	As Seth Arnold points out, uid checks are out of place
	for capability code.

	Sep 01:
	Define task_setscheduler, task_setioprio, cap_task_kill, and
	task_setnice to make sure a user cannot affect a process in which
	they called a program with some fscaps.

	One remaining question is the note under task_setscheduler: are we
	ok with CAP_SYS_NICE being sufficient to confine a process to a
	cpuset?

	It is a semantic change, as without fsccaps, attach_task doesn't
	allow CAP_SYS_NICE to override the uid equivalence check.  But since
	it uses security_task_setscheduler, which elsewhere is used where
	CAP_SYS_NICE can be used to override the uid equivalence check,
	fixing it might be tough.

	     task_setscheduler
		 note: this also controls cpuset:attach_task.  Are we ok with
		     CAP_SYS_NICE being used to confine to a cpuset?
	     task_setioprio
	     task_setnice
		 sys_setpriority uses this (through set_one_prio) for another
		 process.  Need same checks as setrlimit

	Aug 21:
	Updated secureexec implementation to reflect the fact that
	euid and uid might be the same and nonzero, but the process
	might still have elevated caps.

	Aug 15:
	Handle endianness of xattrs.
	Enforce capability version match between kernel and disk.
	Enforce that no bits beyond the known max capability are
	set, else return -EPERM.
	With this extra processing, it may be worth reconsidering
	doing all the work at bprm_set_security rather than
	d_instantiate.

	Aug 10:
	Always call getxattr at bprm_set_security, rather than
	caching it at d_instantiate.

[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Alexey Dobriyan
57c521ce61 ifdef struct task_struct::security
For those who don't care about CONFIG_SECURITY.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
James Morris
20510f2f4e security: Convert LSM into a static interface
Convert LSM into a static interface, as the ability to unload a security
module is not required by in-tree users and potentially complicates the
overall security architecture.

Needlessly exported LSM symbols have been unexported, to help reduce API
abuse.

Parameters for the capability and root_plug modules are now specified
at boot.

The SECURITY_FRAMEWORK_VERSION macro has also been removed.

In a nutshell, there is no safe way to unload an LSM.  The modular interface
is thus unecessary and broken infrastructure.  It is used only by out-of-tree
modules, which are often binary-only, illegal, abusive of the API and
dangerous, e.g.  silently re-vectoring SELinux.

[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: USB Kconfig fix]
[randy.dunlap@oracle.com: fix LSM kernel-doc]
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:07 -07:00
Dave Hansen
ce8d2cdf3d r/o bind mounts: filesystem helpers for custom 'struct file's
Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.  In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of filesystems
writable.  It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems.  This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated.  I've been using the following script to test that the feature is
working as desired.  It takes a directory and makes a regular bind and a r/o
bind mount of it.  It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Bjorn Helgaas
402b310cb6 PNP: remove null pointer checks
Remove some null pointer checks.  Null pointers in these areas indicate
programming errors, and I think it's better to oops immediately rather than
return an error that is easily ignored.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:04 -07:00
Adrian Bunk
5ebf2c1260 bitmap.h: remove dead artifacts
bitmap_active() no longer exists and BITMAP_ACTIVE is no longer used.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:03 -07:00
Martin J. Bligh
a686cd898b ext2 reservations
Val's cross-port of the ext3 reservations code into ext2.

[mbligh@mbligh.org: Small type error for printk
[akpm@linux-foundation.org: fix types, sync with ext3]
[mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3]
[akpm@linux-foundation.org: kill noisy printk]
[akpm@linux-foundation.org: remember to dirty the gdp's block]
[akpm@linux-foundation.org: cross-port the missed 5dea5176e5]
[akpm@linux-foundation.org: cross-port e6022603b9]
[akpm@linux-foundation.org: Port the omitted 08fb306fe6]
[akpm@linux-foundation.org: Backport the missed 20acaa18d0]
[akpm@linux-foundation.org: fixes]
[cmm@us.ibm.com: fix reservation extension]
[bunk@stusta.de: make ext2_get_blocks() static]
[hugh@veritas.com: fix hang]
[hugh@veritas.com: ext2_new_blocks should reset the reservation window size]
[hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end]
[hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such]
[hugh@veritas.com: rbtree usage cleanup]
[pbadari@us.ibm.com: Fix for ext2 reservation]
[bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()]
[hugh@veritas.com: ext2 balloc: use io_error label]
Cc: "Martin J. Bligh" <mbligh@mbligh.org>
Cc: Valerie Henson <val_henson@linux.intel.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Joern Engel
1c0eeaf569 introduce I_SYNC
I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect.  One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel <joern@wohnheim.fh-wedel.de>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu
2e6883bdf4 writeback: introduce writeback_control.more_io to indicate more io
After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays.  But sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out.  The big dirty file is actually still sitting in s_more_io.  We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty.  It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
not necessarily mean that "all data are written".  This patch introduces a
flag writeback_control.more_io to indicate this situation.  With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.

Cc: David Chinner <dgc@sgi.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Fengguang Wu
08d8e9749e writeback: fix ntfs with sb_has_dirty_inodes()
NTFS's if-condition on dirty inodes is not complete.  Fix it with
sb_has_dirty_inodes().

Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Ken Chen
0e0f4fc22e writeback: fix periodic superblock dirty inode flushing
Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.

The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode.  If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again.  We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.

So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:02 -07:00
Ingo Molnar
4749252776 printk: add KERN_CONT annotation
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).
This useful for multiple calls of hand-crafted printk output done by
early debug code or similar.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Ulrich Drepper
22d2b35b20 F_DUPFD_CLOEXEC implementation
One more small change to extend the availability of creation of file
descriptors with FD_CLOEXEC set.  Adding a new command to fcntl() requires
no new system call and the overall impact on code size if minimal.

If this patch gets accepted we will also add this change to the next
revision of the POSIX spec.

To test the patch, use the following little program.  Adjust the value of
F_DUPFD_CLOEXEC appropriately.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef F_DUPFD_CLOEXEC
# define F_DUPFD_CLOEXEC 12
#endif

int
main (int argc, char *argv[])
{
  if  (argc > 1)
    {
      if (fcntl (3, F_GETFD) == 0)
	{
	  puts ("descriptor not closed");
	  exit (1);
	}
      if (errno != EBADF)
	{
	  puts ("error not EBADF");
	  exit (1);
	}

      exit (0);
    }
  int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
  if (fd == -1 && errno == EINVAL)
    {
      puts ("F_DUPFD_CLOEXEC not supported");
      return 0;
    }
  if (fd != 3)
    {
      puts ("program called with descriptors other than 0,1,2");
      return 1;
    }

  execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
  puts ("execl failed");
  return 1;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Alexey Dobriyan
18796aa002 task_struct: move ->fpu_counter and ->oomkilladj
There is nice 2 byte hole after struct task_struct::ioprio field
into which we can put two 1-byte fields: ->fpu_counter and ->oomkilladj.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Davide Libenzi
96358de6bc rename signalfd_siginfo fields
For Michael Kerrisk request, the following patch renames signalfd_siginfo
fields in order to keep them consistent with the siginfo_t ones.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00
Eric Sandeen
059590f495 ext3: remove #ifdef CONFIG_EXT3_INDEX
CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
unconditionally defined in ext3_fs.h.  tune2fs is already able to turn off
dir indexing, so at this point it's just cluttering up the code.  Remove
it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:43:01 -07:00