In ia64 kernel, the O_LARGEFILE flag is forced when opening a file. This
is problematic for execution of 32 bit processes, which are not largefile
aware, either by SW emulation or by HW execution.
For such processes, the problem is two-fold:
1) When trying to open a file that is larger than 4G
the operation should fail, but it's not
2) Writing to offset larger than 4G should fail, but
it's not
The proposed patch takes advantage of the way 32 bit processes are
identified in ia64 systems. Such processes have PER_LINUX32 for their
personality. With the patch, the ia64 kernel will not enforce the
O_LARGEFILE flag if the current process has PER_LINUX32 set. The behavior
for all other architectures remains unchanged.
Signed-off-by: Yoav Zach <yoav.zach@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes O(n^2) super block loops in sync_inodes(),
sync_filesystems() etc. in favour of using __put_super_and_need_restart()
which I introduced earlier. We faced a noticably long freezes on sb
syncing when there are thousands of super blocks in the system.
Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In a duplicate of lookup_create in the af_unix code Al commented what's
going on nicely, so let's bring that over to lookup_create before the copy
is going away (I'll send a patch soon)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Henrik Grubbstrom noted:
The 2.6.10 ext3_get_parent attempts to use ext3_find_entry to look up the
entry "..", which fails for dx directories since ".." is not present in the
directory hash table. The patch below solves this by looking up the dotdot
entry in the dx_root block.
Typical symptoms of the above bug are intermittent claims by nfsd that
files or directories are missing on exported ext3 filesystems.
cf https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=3D150759 and
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=3D144556
ext3_get_parent() is IMHO the wrong place to fix this bug as it introduces
a lot of internals from htree into that function. Instead, I think this
should be fixed in ext3_find_entry() as in the below patch. This has the
added advantage that it works for any callers of ext3_find_entry() and not
just ext3_lookup_parent().
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Henrik Grubbstrom <grubba@grubba.org>
Cc: <ext2-devel@lists.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new `suid_dumpable' sysctl:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.
(akpm:
> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?
No problem to me.
> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?
Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)
> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.
Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.
)
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use lookup_one_len instead of opencoding a simplified lookup using
lookup_hash with a fake hash.
Also there's no need anymore for the d_invalidate as we have a completely
valid dentry now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move some code duplicated in both callers into vfs_quota_on_mount
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Various filesystem drivers have grown a get_dentry() function that's a
duplicate of lookup_one_len, except that it doesn't take a maximum length
argument and doesn't check for \0 or / in the passed in filename.
Switch all these places to use lookup_one_len.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Greg KH <greg@kroah.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Patch to add check to get_chrdev_list and get_blkdev_list to prevent reads
of /proc/devices from spilling over the provided page if more than 4096
bytes of string data are generated from all the registered character and
block devices in a system
Signed-off-by: Neil Horman <nhorman@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on analysis and a patch from Russ Weight <rweight@us.ibm.com>
There is a race condition that can occur if an inode is allocated and then
released (using iput) during the ->fill_super functions. The race
condition is between kswapd and mount.
For most filesystems this can only happen in an error path when kswapd is
running concurrently. For isofs, however, the error can occur in a more
common code path (which is how the bug was found).
The logic here is "we want final iput() to free inode *now* instead of
letting it sit in cache if fs is going down or had not quite come up". The
problem is with kswapd seeing such inodes in the middle of being killed and
happily taking over.
The clean solution would be to tell kswapd to leave those inodes alone and
let our final iput deal with them. I.e. add a new flag
(I_FORCED_FREEING), set it before write_inode_now() there and make
prune_icache() leave those alone.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Request RDATTR_ERROR as an attribute in readdir to distinguish between a
directory being within an absent filesystem or one (or more) of its entries.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the lock blocks, the server may send us a GRANTED message that
races with the reply to our LOCK request. Make sure that we catch
the GRANTED by queueing up our request on the nlm_blocked list
before we send off the first LOCK rpc call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Basically copies the VFS's method for tracking writebacks and applies
it to the struct nfs_page.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Unless we're doing O_APPEND writes, we really don't care about revalidating
the file length. Just make sure that we catch any page cache invalidations.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of looking at whether or not the file is open for writes before
we accept to update the length using the server value, we should rather
be looking at whether or not we are currently caching any writes.
Failure to do so means in particular that we're not updating the file
length correctly after obtaining a POSIX or BSD lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we do not hold a valid stateid that is open for writes, there is little
point in doing an extra open of the file, as the RFC does not appear to
mandate this...
Make setattr use the correct stateid if we're holding mandatory byte
range locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv3 currently returns the unsigned 64-bit cookie directly to
userspace. The following patch causes the kernel to generate
loff_t offsets for the benefit of userland.
The current server-generated READDIR cookie is cached in the
nfs_open_context instead of in filp->f_pos, so we still end up work
correctly under directory insertions/deletion.
Signed-off-by: Olivier Galibert <galibert@pobox.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The changeset "trond.myklebust@fys.uio.no|ChangeSet|20050322152404|16979"
(RPC: Ensure XDR iovec length is initialized correctly in call_header)
causes the NFSv4 callback code to BUG() due to an incorrectly initialized
scratch buffer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Older gcc's don't like this.
fs/nfs/nfs4proc.c:2194: field `data' has incomplete type
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The Coverity checker noticed that such a simplification was possible.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Pointer arithmetic bug: p is in word units. This fixes a memory
corruption with big acls.
* Initialize pg_class to prevent a NULL pointer access.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attach acls to inodes in the icache to avoid unnecessary GETACL RPC
round-trips. As long as the client doesn't retrieve any acls itself, only the
default acls of exiting directories and the default and access acls of new
directories will end up in the cache, which preserves some memory compared to
always caching the access and default acl of all files.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv3 has no concept of a umask on the server side: The client applies
the umask locally, and sends the effective permissions to the server.
This behavior is wrong when files are created in a directory that has a
default ACL. In this case, the umask is supposed to be ignored, and
only the default ACL determines the file's effective permissions.
Usually its the server's task to conditionally apply the umask. But
since the server knows nothing about the umask, we have to do it on the
client side. This patch tries to fetch the parent directory's default
ACL before creating a new file, computes the appropriate create mode to
send to the server, and finally sets the new file's access and default
acl appropriately.
Many thanks to Buck Huppmann <buchk@pobox.com> for sending the initial
version of this patch, as well as for arguing why we need this change.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This adds acl support fo nfs clients via the NFSACL protocol extension, by
implementing the getxattr, listxattr, setxattr, and removexattr iops for the
system.posix_acl_access and system.posix_acl_default attributes. This patch
implements a dumb version that uses no caching (and thus adds some overhead).
(Another patch in this patchset adds caching as well.)
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This adds functions for encoding and decoding POSIX ACLs for the NFSACL
protocol extension, and the GETACL and SETACL RPCs. The implementation is
compatible with NFSACL in Solaris.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add the missing NFS3ERR_NOTSUPP error code (defined in NFSv3) to the
system-to-protocol-error table in nfsd. The nfsacl extension uses this error
code.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently we return -ENOMEM for every single failure to create a new auth.
This is actually accurate for auth_null and auth_unix, but for auth_gss it's a
bit confusing.
Allow rpcauth_create (and the ->create methods) to return errors. With this
patch, the user may sometimes see an EINVAL instead. Whee.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add nfs4_acl field to the nfs_inode, and use it to cache acls. Only cache
acls of size up to a page. Also prepare for up to a page of acl data even
when the user doesn't pass in a buffer, as when they want to get the acl
length to decide what size buffer to allocate.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side write support for NFSv4 ACLs.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 acls: xdr encoding and decoding routines for
writing acls
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 ACLs. Exports the raw xdr code via the
system.nfs4_acl extended attribute. It is up to userspace to decode the acl
(and to provide correctly xdr'd acls on setxattr), and to convert to/from
POSIX ACLs if desired.
This patch provides only the read support.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 acls: xdr encoding and decoding routines for
reading acls
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make nfs4 fattr size calculations more explicit, revising them downward a
bit in the process.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add {get,set,list}xattr methods for nfs4. The new methods are no-ops, to be
used by subsequent ACL patch.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
ACL support will require supporting additional inode operations in v4
(getxattr, setxattr, listxattr). This patch allows different protocol versions
to support different inode operations by adding a file_inode_ops to the
nfs_rpc_ops (to match the existing dir_inode_ops).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we fix up the missing fields in the nfs_mount_data with
sane defaults for older versions of mount, and return errors in the
cases where we cannot.
Convert a bunch of annoying warnings into dprintks()
Return -EPROTONOSUPPORT rather than EIO if mount() tries to set NFSv3
without it actually being compiled in.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we don't create an RPC client without checking that the server
does indeed support the RPC program + version that we are trying to set up.
This enables us to immediately return an error to "mount" if it turns out
that the server is only supporting NFSv2, when we requested NFSv3 or NFSv4.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The current isofs treatment of hidden files is flawed in two ways. First,
it does not provide sufficient granularity; it hides both 'hidden' files
and 'associated' files (resource fork for Mac files). Second, the default
behavior to completely strip hidden files, while an admirable
implementation of the spec, is a poor choice given the real world use of
hidden files as a poor mans copy protection scheme for MSDOS and Windows
based systems. A longer description of this is available here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.3/0267.html
This patch was originally built after a few private conversations with Alan
Cox; I shamefully failed to persist in seeing it go forward, I hope to make
amends now.
This patch introduces granularity by allowing explicit control for both
hidden and associated files. It also reverses the default so that by
default, hidden files are treated as regular files on the iso9660 file
system.
This allow Wine to process Windows CDs, including those that are hybrid
Mac/Windows CDs properly and completely, without our having to go muck up
peoples fstabs as we do now. (I have tested this with such a hybrid +
hidden CD and have verified that this patch works as claimed).
Signed-off-by: Jeremy White <jwhite@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle the case where the variable-sized part of a rock-ridge directory entry
overhangs the end of the buffer which we allocated for it.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The silly thing does:
struct foo { ... };
...
#define foo 42
so you can no longer refer to `struct foo' in C code.
Rename the structures.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The bug in rock.c is that it's totally trusting of the contents of the
directories. If the directory says there's a continuation 10000 bytes into
this 4k block then we cheerily poke around in memory we don't own and oops.
So change rock_continue() to apply various sanity checks, at least ensuring
that the offset+length remain within the bounds for the header part of a
struct rock_ridge directory entry.
Note that the kernel can still overindex the buffer due to the variable size
of the rock-ridge directory entries. We cannot check that in rock_continue()
unless we go parse the directory entry's signature and work out its size.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
So we have a couple of rock-ridge bugs. First up, rotoroot the poor thing
into something which it is possible to work on.
Feed rock.h through Lindent, tidy a couple of things by hand.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- remove the MAYBE_CONTINUE macro
- kfree(NULL) is OK.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Remove the SETUP_ROCK_RIDGE macro.
- In rock_ridge_symlink_readpage(), rename raw_inode to raw_de. It points
at a directory entry, not an inode.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix stuff which Lindent got wrong, rework a few deeply-nested blocks.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trying to turn rock.c into something which humans can read so we can fix some
bugs.
Start out by feeding it through scripts/Lindent.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For browsable autofs maps, a mount request that arrives at the same time an
expire is happening can fail to perform the needed mount.
This happens becuase the directory exists and so the revalidate succeeds when
we need it to fail so that lookup is called on the same dentry to do the
mount. Instead lookup is called on the next path component which should be
whithin the mount, but the parent isn't mounted.
The solution is to allow the revalidate to continue and perform the mount as
no directory creation (at mount time) is needed for browsable mount entries.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At the tail end of an expire it's possible for a process to enter
autofs4_wait, with a waitq type of NFY_NONE but find that the expire is
finished. In this cause autofs4_wait will try to create a new wait but not
notify the daemon leading to a hang. As the wait type is meant to delay mount
requests from revalidate or lookup during an expire and the expire is done all
we need to do is check if the dentry is a mountpoint. If it's not then we're
done.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While this is not a solution to bind and move mounts on autofs owned
directories it is necessary to fix the trady error handling.
At least it avoids the kernel panic I observed checking out bug #4589.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a memory leak during mount when CONFIG_SECURITY is enabled and
mount options are specified.
Signed-off-by: Gerald Schaefer <geraldsc@de.ibm.com>
Acked-by: James Morris <jmorris@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0. The following patch removes the argument and updates all the
calls to try_to_free_pages.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.
The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.
The problem is twofold:
1) the free_area_cache is used to continue a search for memory where
the last search ended. Before the change new areas were always
searched from the base address on.
So now new small areas are cluttering holes of all sizes
throughout the whole mmap-able region whereas before small holes
tended to close holes near the base leaving holes far from the base
large and available for larger requests.
2) the free_area_cache also is set to the location of the last
munmap-ed area so in scenarios where we allocate e.g. five regions of
1K each, then free regions 4 2 3 in this order the next request for 1K
will be placed in the position of the old region 3, whereas before we
appended it to the still active region 1, placing it at the location
of the old region 2. Before we had 1 free region of 2K, now we only
get two free regions of 1K -> fragmentation.
The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache. If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.
The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.
Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.
Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.
Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add /proc/zoneinfo file to display information about memory zones. Useful
to analyze VM behaviour.
Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.
The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.
Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.
In the new code, there are two externally visible symbols:
- smp_processor_id(): debug variant.
- raw_smp_processor_id(): nondebug variant. Replaces all existing
uses of _smp_processor_id() and __smp_processor_id(). Defined
by every SMP architecture in include/asm-*/smp.h.
There is one new internal symbol, dependent on DEBUG_PREEMPT:
- debug_smp_processor_id(): internal debug variant, mapped to
smp_processor_id().
Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file. All related comments got updated and/or
clarified.
I have build/boot tested the following 8 .config combinations on x86:
{SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
architectures are untested, but should work just fine.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's a much smaller patch to simply disable devfs from the build. If
this goes well, and there are no complaints for a few weeks, I'll resend
my big "devfs-die-die-die" series of patches that rip the whole thing
out of the kernel tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
direct I/O locking changes it is just a wrapper around
VOP_FLUSHINVAL_PAGES, so it's not nessecary anymore. Keep a simplified
version for kernels < 2.4.22, as these don't have the changed direct I/O
locking.
SGI-PV: 938064
SGI-Modid: xfs-linux:xfs-kern:194420a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Without this change I can't set an attribute exactly PAGE_SIZE in
length. There is no need for zero termination because the interface
uses lengths.
From: Jon Smirl <jonsmirl@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
o Following patch sets the attributes for newly allocated inodes for sysfs
objects. If the object has non-default attributes, inode attributes are
set as saved in sysfs_dirent->s_iattr, pointer to struct iattr.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
o This adds ->i_op->setattr VFS method for sysfs inodes. The changed
attribues are saved in the persistent sysfs_dirent structure as a pointer
to struct iattr. The struct iattr is allocated only for those sysfs_dirent's
for which default attributes are getting changed. Thanks to Jon Smirl for
this suggestion.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
o The following patch makes sure to attach sysfs_dirent to the dentry before
allocation a new inode through sysfs_create(). This change is done as
preparatory work for implementing ->i_op->setattr() functionality for
sysfs objects.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Based on the discussion about spufs attributes, this is my suggestion
for a more generic attribute file support that can be used by both
debugfs and spufs.
Simple attribute files behave similarly to sequential files from
a kernel programmers perspective in that a standard set of file
operations is provided and only an open operation needs to
be written that registers file specific get() and set() functions.
These operations are defined as
void foo_set(void *data, u64 val); and
u64 foo_get(void *data);
where data is the inode->u.generic_ip pointer of the file and the
operations just need to make send of that pointer. The infrastructure
makes sure this works correctly with concurrent access and partial
read calls.
A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify
using the attributes.
This patch already contains the changes for debugfs to use attributes
for its internal file operations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: if attribute does not implement show or store method
read/write should return -EIO instead of 0 or -EINVAL.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The ELF core dump code has one use of off_t when writing out segments.
Some of the segments may be passed the 2GB limit of an off_t, even on a
32-bit system, so it's important to use loff_t instead. This fixes a
corrupted core dump in the bigcore test in GDB's testsuite.
Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes a data corruption error for mail delivery applications that
expect to be able to do posix locking and then append writes on NFS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
and add_to_page_cache fails.
Thanks to Shaggy for pointing out the fix.
Signed-off-by: Steve French (sfrench@us.ibm.com)
Signed-off-by: Shaggy (shaggy@us.ibm.com)
We should never apply a lookup intent to anything other than the last
path component in an open(), create() or access() call.
Introduce the helper nfs_lookup_check_intent() which always returns
zero if LOOKUP_CONTINUE or LOOKUP_PARENT are set, and returns the
intent flags if we're on the last component of the lookup.
By doing so, we fix a bug in open(O_EXCL), where we may end up
optimizing away a real lookup of the parent directory.
Problem noticed by Linda Dunaphant <linda.dunaphant@ccur.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure that binfmt_flat passes the correct flags into do_mmap(). nommu's
validate_mmap_request() will simple return -EINVAL if we try and pass it a
flags value of zero.
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__do_follow_link() passes potentially worng vfsmount to touch_atime(). It
matters only in (currently impossible) case of symlink mounted on something,
but it's trivial to fix and that actually makes more sense.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conditional mntput() moved into __do_follow_link(). There it collapses with
unconditional mntget() on the same sucker, closing another too-early-mntput()
race.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Getting rid of sloppy logics:
a) in do_follow_link() we have the wrong vfsmount dropped if our symlink
had been mounted on something. Currently it worls only because we never
get such situation (modulo filesystem playing dirty tricks on us). And
it obfuscates already convoluted logics...
b) same goes for open_namei().
c) in __link_path_walk() we have another "it should never happen" sloppiness -
out_dput: there does double-free on underlying vfsmount and leaks the covering
one if we hit it just after crossing a mountpoint. Again, wrong vfsmount
getting dropped.
d) another too-early-mntput() race - in do_follow_mount() we need to postpone
conditional mntput(path->mnt) until after dput(path->dentry). Again, this one
happens only in it-currently-never-happens-unless-some-fs-plays-dirty
scenario...
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
shifted conditional mntput() into do_follow_link() - all callers were doing
the same thing.
Obviously equivalent transformation.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In open_namei() exit_dput: we have mntput() done in the wrong order -
if nd->mnt != path.mnt we end up doing
mntput(nd->mnt);
nd->mnt = path.mnt;
dput(nd->dentry);
mntput(nd->mnt);
which drops nd->dentry too late. Fixed by having path.mnt go first.
That allows to switch O_NOFOLLOW under if (__follow_mount(...)) back
to exit_dput, while we are at it.
Fix for early-mntput() race + equivalent transformation.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In open_namei() we take mntput(nd->mnt);nd->mnt=path.mnt; out of the if
(__follow_mount(...)), making it conditional on nd->mnt != path.mnt instead.
Then we shift the result downstream.
Equivalent transformations.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In open_namei(), __follow_down() loop turned into __follow_mount().
Instead of
if we are on a mountpoint dentry
if O_NOFOLLOW checks fail
drop path.dentry
drop nd
return
do equivalent of follow_mount(&path.mnt, &path.dentry)
nd->mnt = path.mnt
we do
if __follow_mount(path) had, indeed, traversed mountpoint
/* now both nd->mnt and path.mnt are pinned down */
if O_NOFOLLOW checks fail
drop path.dentry
drop path.mnt
drop nd
return
mntput(nd->mnt)
nd->mnt = path.mnt
Now __follow_down() can be folded into follow_down() - no other callers left.
We need to reorder dput()/mntput() there - same problem as in follow_mount().
Equivalent transformation + fix for a bug in O_NOFOLLOW handling - we used to
get -ELOOP if we had the same fs mounted on /foo and /bar, had something bound
on /bar/baz and tried to open /foo/baz with O_NOFOLLOW. And fix of
too-early-mntput() race in follow_down()
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
New helper: __follow_mount(struct path *path). Same as follow_mount(), except
that we do *not* do mntput() after the first lookup_mnt().
IOW, original path->mnt stays pinned down. We also take care to do dput()
before mntput() in the loop body (follow_mount() also needs that reordering,
but that will be done later in the series).
The following are equivalent, assuming that path.mnt == x:
(1)
follow_mount(&path.mnt, &path.dentry)
(2)
__follow_mount(&path);
if (path->mnt != x)
mntput(x);
(3)
if (__follow_mount(&path))
mntput(x);
Callers of follow_mount() in __link_path_walk() converted to (2).
Equivalent transformation + fix for too-late-mntput() race in __follow_mount()
loop.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In open_namei() we never use path.mnt or path.dentry after exit: or ok:.
Assignment of path.dentry in case of LAST_BIND is dead code and only
obfuscates already convoluted function; assignment of path.mnt after
__do_follow_link() can be moved down to the place where we set path.dentry.
Obviously equivalent transformations, just to clean the air a bit in that
region.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The first argument of __do_follow_link() switched to struct path *
(__do_follow_link(path->dentry, ...) -> __do_follow_link(path, ...)).
All callers have the same calls of mntget() right before and dput()/mntput()
right after __do_follow_link(); these calls have been moved inside.
Obviously equivalent transformations.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mntget(path->mnt) in do_follow_link() moved down to right before the
__do_follow_link() call and rigth after loop: resp.
dput()+mntput() on non-ELOOP branch moved up to right after __do_follow_link()
call.
resulting
loop:
mntget(path->mnt);
path_release(nd);
dput(path->mnt);
mntput(path->mnt);
replaced with equivalent
dput(path->mnt);
path_release(nd);
Equivalent transformations - the reason why we have that mntget() is that
__do_follow_link() can drop a reference to nd->mnt and that's what holds
path->mnt. So that call can happen at any point prior to __do_follow_link()
touching nd->mnt. The rest is obvious.
NOTE: current tree relies on symlinks *never* being mounted on anything. It's
not hard to get rid of that assumption (actually, that will come for free
later in the series). For now we are just not making the situation worse than
it is.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fix for too early mntput() in open_namei() - we pin path.mnt down for the
duration of __do_follow_link(). Otherwise we could get the fs where our
symlink lived unmounted while we were in __do_follow_link(). That would end
up with dentry of symlink staying pinned down through the fs shutdown.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
path.mnt in open_namei() set to mirror nd->mnt.
nd->mnt is set in 3 places in that function - path_lookup() in the beginning,
__follow_down() loop after do_last: and __do_follow_link() call after
do_link:.
We set path.mnt to nd->mnt after path_lookup() and __do_follow_link(). In
__follow_down() loop we use &path.mnt instead of &nd->mnt and set nd->mnt to
path.mnt immediately after that loop.
Obviously equivalent transformation.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replaced struct dentry *dentry in namei with struct path path. All uses of
dentry replaced with path.dentry there.
Obviously equivalent transformation.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All callers of do_follow_link() do mntget() right before it and
dput()+mntput() right after. These calls are moved inside do_follow_link()
now.
Obviously equivalent transformation.
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OK, here comes a patch series that hopefully should close all
too-early-mntput() races in fs/namei.c. Entire area is convoluted as hell, so
I'm splitting that series into _very_ small chunks.
Patches alread in the tree close only (very wide) races in following symlinks
(see "busy inodes after umount" thread some time ago). Unfortunately, quite a
few narrower races of the same nature were not closed. Hopefully this should
take care of all of them.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When fsync() runs wait_on_page_writeback_range() it only inspects pages which
are actually under I/O (PAGECACHE_TAG_WRITEBACK). If a page completed I/O
prior to wait_on_page_writeback_range() looking at it, it is supposed to have
recorded its I/O error state in the address_space.
But mpage_mpage_end_io_write() forgot to set the address_space error flag in
this case.
Signed-off-by: Qu Fuping <fs@ercist.iscas.ac.cn>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fs/jfs/jfs_logmgr.c: In function `jfs_flush_journal':
fs/jfs/jfs_logmgr.c:1632: warning: unused variable `mp'
Some debug code in jfs_flush_journal does nothing when CONFIG_JFS_DEBUG
is not defined. Place the whole code segment within an ifdef to avoid
unnecessary code to be compiled and the warning to be issued.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Fix a bug in list scanning that can cause us to skip the last buffer on the
checkpoint list (and hence fail to do any progress under some rather
unfavorable conditions).
The problem is we first do jh=next_jh and then test
} while (jh!=last_jh);
Hence we skip the last buffer on the list (if it was not the only buffer on
the list). As we already do jh=next_jh; in the beginning of the loop we
are safe to just remove the assignment in the end. It can happen that 'jh'
will be freed at the point we test jh != last_jh but that does not matter
as we never *dereference* the pointer.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix possible false assertion failure in log_do_checkpoint(). We might fail
to detect that we actually made a progress when cleaning up the checkpoint
lists if we don't retry after writing something to disk. The patch was
confirmed to fix observed assertion failures for several users.
When we flushed some buffers we need to retry scanning the list.
Otherwise we can fail to detect our progress.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
add_missing_indices() must set tlck->type to tlckBTROOT when modifying
a root btree root to avoid a trap in txRelease()
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
This cleans up the /proc/device-tree representation of the Open Firmware
device-tree on ppc and ppc64. It does the following things:
- Workaround an issue in some Apple device-trees where a property may
exist with the same name as a child node of the parent. We now
simply "drop" the property instead of creating duplicate entries in
/proc with random result...
- Do not try to chop off the "@0" at the end of a node name whose unit
address is 0. This is not useful, inconsistent, and the code was
buggy and didn't always work anyway.
- Do not create symlinks for the short name and unit address parts of a
node. These were never really used, bloated the memory footprint of
the device-tree with useless struct proc_dir_entry and their matching
dentry and inode cache bloat.
This results in smaller code, smaller memory footprint, and a more
accurate view of the tree presented to userland.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
in fs/udf/udftime.c the global array '__mon_yday' is not static, and it
conflicts with the glibc one when the kernel is compiled as user mode.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>