Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().
Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This one was pointed out on the MOKB site:
http://kernelfun.blogspot.com/2006/11/mokb-09-11-2006-linux-26x-ext2checkpage.html
If a directory's i_size is corrupted, ext2_find_entry() will keep
processing pages until the i_size is reached, even if there are no more
blocks associated with the directory inode. This patch puts in some
minimal sanity-checking so that we don't keep checking pages (and issuing
errors) if we know there can be no more data to read, based on the block
count of the directory inode.
This is somewhat similar in approach to the ext3 patch I sent earlier this
year.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When igrab() is calling __iget() on an inode it should check if
clear_inode() has been called on the inode already. Otherwise there is a
race window between clear_inode() and destroy_inode() where igrab() calls
__iget() which leads to already free inodes on the inode lists.
Signed-off-by: Vandana Rungta <vandana@novell.com>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I added IS_NOATIME(inode) macro definition in include/linux/fs.h, true if
the inode superblock is marked readonly or noatime.
This new macro is then used in touch_atime() instead of separatly testing
MS_RDONLY and MS_NOATIME
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As pointed out by Hugh, ramfs would also benefit from using the new
set_page_dirty aop method for memory backed file systems.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Values are available via ZVC sums.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some partitioning systems create special partitions that
span the entire disk. One example are Sun partitions, and
this whole-disk partition exists to tell the firmware the
extent of the entire device so it can load the boot block
and do other things.
Such partitions should not be treated as normal partitions,
because all the other partitions overlap this whole-disk one.
So we'd see multiple instances of the same UUID etc. which
we do not want. udev and friends can thus search for this
'whole_disk' attribute and use it to decide to ignore the
partition.
Signed-off-by: Fabio Massimo Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kmap() is inefficient and does not scale well. kmap_atomic() is a better
choice. Use the generic wrapper function instead of open coding the
kmap-memset-dcache flush-kunmap stuff.
SGI-PV: 960904
SGI-Modid: xfs-linux-melb:xfs-kern:28041a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960897
SGI-Modid: xfs-linux-melb:xfs-kern:28038a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
It makes it incrementally clearer to read the code when the top of a macro
spaghetti-pile only receives the 3 arguments it uses, rather than 2 extra
ones which are not used. Also when you start pulling this thread out of
the sweater (i.e. remove unused args from XFS_BTREE_*_ADDR), a couple
other third arms etc fall off too. If they're not used in the macro, then
they sometimes don't need to be passed to the function calling the macro
either, etc....
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960197
SGI-Modid: xfs-linux-melb:xfs-kern:28037a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
xfs_mac.h and xfs_cap.h provide definitions and macros that aren't used
anywhere in XFS at all. They are left-overs from "to be implement at some
point in the future" functionality that Irix XFS has. If this
functionality ever goes into Linux, it will be provided at a different
layer, most likely through the security hooks in the kernel so we will
never need this functionality in XFS.
Patch provided by Eric Sandeen (sandeen@sandeen.net).
SGI-PV: 960895
SGI-Modid: xfs-linux-melb:xfs-kern:28036a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Fixes a few small issues (mostly cosmetic) that were picked up during the
review cycle for the last set of freeze path changes.
SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28035a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The firstblock argument to xfs_bmap_finish is not used by that function.
Remove it and cleanup the code a bit.
Patch provided by Eric Sandeen.
SGI-PV: 960196
SGI-Modid: xfs-linux-melb:xfs-kern:28034a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Use the the generic VFS attr flags where appropriate instead of open
coding them to the same values.
Patch provided by Eric Sandeen.
SGI-PV: 960868
SGI-Modid: xfs-linux-melb:xfs-kern:28033a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
wake_up's implementation does an implicit memory barrier so the explicit
memory barrier is not needed in vfs_sync_worker.
Patch provided by Ralf Baechle.
SGI-PV: 960867
SGI-Modid: xfs-linux-melb:xfs-kern:28032a
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Removes unneeded sysctl insert at head behaviour. Cleans up sysctl
definitions to use C99 initialisers. Patch provided by Eric W. Biederman.
SGI-PV: 960192
SGI-Modid: xfs-linux-melb:xfs-kern:28031a
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The problem is the two callers of xfs_iozero() are rounding out the range
to be zeroed to the end of a fsb and in some cases this extends past the
new eof. The call to commit_write() in xfs_iozero() will cause the Linux
inode's file size to be set too high.
SGI-PV: 960788
SGI-Modid: xfs-linux-melb:xfs-kern:28013a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
record.
The current Linux XFS freeze code is a mess. We flush the metadata buffers
out while we are still allowing new transactions to start and then fail to
flush the dirty buffers back out before writing the unmount and dummy
records to the log.
This leads to problems when the frozen filesystem is used for snapshots -
we do log recovery on a readonly image and often it appears that the log
image in the snapshot is not correct. Hence we end up with hangs, oops and
mount failures when trying to mount a snapshot image that has been created
when the filesystem has not been correctly frozen.
To fix this, we need to move th metadata flush to after we wait for all
current transactions to complete in teh second stage of the freeze. This
means that when we write the final log records, the log should be clean
and recovery should never occur on a snapshot image created from a frozen
filesystem.
SGI-PV: 959267
SGI-Modid: xfs-linux-melb:xfs-kern:28010a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
When writing less than a filesystem block of data into an unwritten extent
via buffered I/O, __xfs_get_blocks fails to set the buffer new flag. As a
result, the generic code will not zero either edge of the block resulting
in garbage being written to disk either side of the real data. Set the
buffer new state on bufferd writes to unwritten extents to ensure that
zeroing occurs.
SGI-PV: 960328
SGI-Modid: xfs-linux-melb:xfs-kern:28000a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
After filesystem recovery the superblock is re-read to bring in any
changes. If the per-cpu superblock counters are not re-initialized from
the superblock then the next time the per-cpu counters are disabled they
might overwrite the global counter with a bogus value.
SGI-PV: 957348
SGI-Modid: xfs-linux-melb:xfs-kern:27999a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27940a
Signed-off-by: Kevin Jamieson <kjamieson@bycast.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The block reservation mechanism has been broken since the per-cpu
superblock counters were introduced. Make the block reservation code work
with the per-cpu counters by syncing the counters, snapshotting the amount
of available space and then doing a modifcation of the counter state
according to the result. Continue in a loop until we either have no space
available or we reserve some space.
SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27895a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The free block modification code has a 32bit interface, limiting the size
the filesystem can be grown even on 64 bit machines. On 32 bit machines,
there are other 32bit variables in transaction structures and interfaces
that need to be expanded to allow this to work.
SGI-PV: 959978
SGI-Modid: xfs-linux-melb:xfs-kern:27894a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
functions, but they
a) ignore the flags parameter completely, and b) are never called
directly, only via the flag-less defines anyway
So, drop the #define indirection, and rename mraccessf to mraccess, etc.
SGI-PV: 959138
SGI-Modid: xfs-linux-melb:xfs-kern:27711a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The existing per-cpu superblock counter code uses the global superblock
spin lock when we approach ENOSPC for global synchronisation. On larger
machines than this code was originally tested on this can still get
catastrophic spinlock contention due increasing rebalance frequency near
ENOSPC.
By introducing a sleeping lock that is used to serialise balances and
modifications near ENOSPC we prevent contention from needlessly from
wasting the CPU time of potentially hundreds of CPUs.
To reduce the number of balances occuring, we separate the need rebalance
case from the slow allocate case. Now, a counter running dry will trigger
a rebalance during which counters are disabled. Any thread that sees a
disabled counter enters a different path where it waits on the new mutex.
When it gets the new mutex, it checks if the counter is disabled. If the
counter is disabled, then we _know_ that we have to use the global counter
and lock and it is safe to do so immediately. Otherwise, we drop the mutex
and go back to trying the per-cpu counters which we know were re-enabled.
SGI-PV: 952227
SGI-Modid: xfs-linux-melb:xfs-kern:27612a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
gcc-4.1 and more recent aggressively inline static functions which
increases XFS stack usage by ~15% in critical paths. Prevent this from
occurring by adding noinline to the STATIC definition.
Also uninline some functions that are too large to be inlined and were
causing problems with CONFIG_FORCED_INLINING=y.
Finally, clean up all the different users of inline, __inline and
__inline__ and put them under one STATIC_INLINE macro. For debug kernels
the STATIC_INLINE macro uninlines those functions.
SGI-PV: 957159
SGI-Modid: xfs-linux-melb:xfs-kern:27585a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The {test,set,clear}_bit() operations take a bit index for the bit to
operate on. The XBT_* flags are defined as bit fields which is incorrect,
not to mention the way the bit fields are enumerated is broken too. This
was only working by chance.
Fix the definitions of the flags and make the code using them use the
{test,set,clear}_bit() operations correctly.
SGI-PV: 958639
SGI-Modid: xfs-linux-melb:xfs-kern:27565a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The message buffer used by cmn_err() is only 256 bytes and some CXFS
messages were exceeding this length. Since we were using vsprintf() and
not checking for buffer overruns we were clobbering memory beyond the
buffer. The size of the buffer has been increased to 1024 bytes so we can
capture these larger messages and we are now using vsnprintf() to prevent
overrunning the buffer size.
SGI-PV: 958599
SGI-Modid: xfs-linux-melb:xfs-kern:27561a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
At the last stage of a freeze, we flush the buftarg synchronously over and
over again until it succeeds twice without skipping any buffers.
The delwri list flush skips pinned buffers, but tries to flush all others.
It removes the buffers from the delwri list, then tries to lock them one
at a time as it traverses the list to issue the I/O. It holds them locked
until we issue all of the I/O and then unlocks them once we've waited for
it to complete.
The problem is that during a freeze, the filesystem may still be doing
stuff - like flushing delalloc data buffers - in the background and hence
we can be trying to lock buffers that were on the delwri list at the same
time. Hence we can get ABBA deadlocks between threads doing allocation and
the buftarg flush (freeze) thread.
Fix it by skipping locked (and pinned) buffers as we traverse the delwri
buffer list.
SGI-PV: 957195
SGI-Modid: xfs-linux-melb:xfs-kern:27535a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
The XFS quiet mount logic was inverted making quiet mounts noisy and vice
versa. Fix it.
SGI-PV: 958469
SGI-Modid: xfs-linux-melb:xfs-kern:27520a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
remap() the region we get from mmap() to mark the fact that we are
using all of the available slack space. Any slack space is used
to form a simple brk region, and potentially more stack space than
requested at load time.
Any searches of the vma chain may well fail looking for
stack (and especially arg) addresses if the remaping is not done.
The simplest example is /proc/<pid>/cmdline, since the args
are pretty much always at the top of the data/bss/stack region.
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a double free of "dfid" introduced by commit
da977b2c7e and spotted by the Coverity
checker.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".
drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.
Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a fix of regression, which triggered by ~2.6.16.
Patch with name ufs-directory-and-page-cache-from-blocks-to-pages.patch: in
additional to conversation from block to page cache mechanism added new
checks of directory integrity, one of them that directory entry do not
across directory chunks.
But some kinds of UFS: OpenStep UFS and Apple UFS (looks like these are the
same filesystems) have different directory chunk size, then common
UFSes(BSD and Solaris UFS).
So this patch adds ability to works with variable size of directory chunks,
and set it for ufstype=openstep to right size.
Tested on darwin ufs.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nowadays MTD supports an MTD_OOB_AUTO option which allows users
to access free bytes in NAND's OOB as a contiguous buffer, although
it may be highly discontinuous.
This patch teaches JFFS2 to use this nice feature instead of the
old MTD_OOB_PLACE option. This for example caused problems with
OneNAND. Now JFFS2 does not care how are the free bytes situated.
This may change position of the clean marker on some flashes,
but this is not a problem. JFFS2 will just re-erase the empty
eraseblocks and write the new (correct) clean marker.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
If jffs2_sum_init() fails, c->blocks is not freed neither in
jffs2_do_mount_fs() nor in jffs2_do_fill_super().
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/mfasheh/ocfs2: (22 commits)
configfs: Zero terminate data in configfs attribute writes.
[PATCH] ocfs2 heartbeat: clean up bio submission code
ocfs2: introduce sc->sc_send_lock to protect outbound outbound messages
[PATCH] ocfs2: drop INET from Kconfig, not needed
ocfs2_dlm: Add timeout to dlm join domain
ocfs2_dlm: Silence some messages during join domain
ocfs2_dlm: disallow a domain join if node maps mismatch
ocfs2_dlm: Ensure correct ordering of set/clear refmap bit on lockres
ocfs2: Binds listener to the configured ip address
ocfs2_dlm: Calling post handler function in assert master handler
ocfs2: Added post handler callable function in o2net message handler
ocfs2_dlm: Cookies in locks not being printed correctly in error messages
ocfs2_dlm: Silence a failed convert
ocfs2_dlm: wake up sleepers on the lockres waitqueue
ocfs2_dlm: Dlm dispatch was stopping too early
ocfs2_dlm: Drop inflight refmap even if no locks found on the lockres
ocfs2_dlm: Flush dlm workqueue before starting to migrate
ocfs2_dlm: Fix migrate lockres handler queue scanning
ocfs2_dlm: Make dlmunlock() wait for migration to complete
ocfs2_dlm: Fixes race between migrate and dirty
...
Attributes in configfs are text files. As such, most handlers expect to be
able to call functions like simple_strtoul() without checking the bounds
of the buffer. Change the call to zero terminate the buffer before calling
the client's ->store() method. This does reduce the attribute size from
PAGE_SIZE to PAGE_SIZE-1.
Also, change get_zeroed_page() to alloc_page(), as we are handling the
termination.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
As was already pointed out Mathieu Avila on Thu, 07 Sep 2006 03:15:25 -0700
that OCFS2 is expecting bio_add_page() to add pages to BIOs in an easily
predictable manner.
That is not true, especially for devices with own merge_bvec_fn().
Therefore OCFS2's heartbeat code is very likely to fail on such devices.
Move the bio_put() call into the bio's bi_end_io() function. This makes the
whole idea of trying to predict the behaviour of bio_add_page() unnecessary.
Removed compute_max_sectors() and o2hb_compute_request_limits().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
When there is a lot of multithreaded I/O usage, two threads can collide
while sending out a message to the other nodes. This is due to the lack of
locking between threads while sending out the messages.
When a connected TCP send(), sendto(), or sendmsg() arrives in the Linux
kernel, it eventually comes through tcp_sendmsg(). tcp_sendmsg() protects
itself by acquiring a lock at invocation by calling lock_sock().
tcp_sendmsg() then loops over the buffers in the iovec, allocating
associated sk_buff's and cache pages for use in the actual send. As it does
so, it pushes the data out to tcp for actual transmission. However, if one
of those allocation fails (because a large number of large sends is being
processed, for example), it must wait for memory to become available. It
does so by jumping to wait_for_sndbuf or wait_for_memory, both of which
eventually cause a call to sk_stream_wait_memory(). sk_stream_wait_memory()
contains a code path that calls sk_wait_event(). Finally, sk_wait_event()
contains the call to release_sock().
The following patch adds a lock to the socket container in order to
properly serialize outbound requests.
From: Zhen Wei <zwei@novell.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
OCFS2: drop 'depends on INET' since local mounts are now allowed.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Currently the ocfs2 dlm has no timeout during dlm join domain. While this is
not a problem in normal operation, this does become an issue if, say, the
other node is refusing to let the node join the domain because of a stuck
recovery. This patch adds a 90 sec timeout.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
These messages can easily be activated using the mlog infrastructure
and don't need to be enabled by default.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
There is a small window where a joining node may not see the node(s) that
just died but are still part of the domain. To fix this, we must disallow
join requests if the joining node has a different node map.
A new field node_map is added to dlm_query_join_request to send the current
nodes nodemap along with join request. On the receiving end the nodes that
are part of the cluster verifies if this new node sees all the nodes that
are still part of the cluster. They disallow the join if the maps mismatch.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Eventhough the set refmap bit message is sent before the clear refmap
message, currently there is no guarentee that the set message will be
handled before the clear. This patch prevents the clear refmap to be
processed while the node is sending assert master messages to other
nodes. (The set refmap message is sent as a response to the assert
master request).
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch binds the o2net listener to the configured ip address
instead of INADDR_ANY for security. Fixes oss.oracle.com bugzilla#814.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch prevents the dlm from sending the clear refmap message
before the set refmap. We use the newly created post function handler
routine to accomplish the task.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Currently o2net allows one handler function per message type. This
patch adds the ability to call another function to be called after
the handler has returned the message to the other node.
Handlers are now given the option of returning a context (in the form of a
void **) which will be passed back into the post message handler function.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The dlm encodes the node number and a sequence number in the lock cookie.
It also stores the cookie in the lockres in the big endian format to avoid
swapping 8 bytes on each lock request. The bug here was that it was assuming
the cookie to be in the cpu format when decoding it for printing the error
message. This patch swaps the bytes before the print.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
When the lockres is in migrate or recovery state, all convert requests
are denied with the appropriate error status that is handled on the
requester node. This patch silences the erroneous error message printed
on the master node.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The dlm was not waking up threads waiting on the lockres wait queue,
waiting for the lockres to be no longer be in the DLM_LOCK_RES_IN_PROGRESS
and the DLM_LOCK_RES_MIGRATING states.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
dlm_dispatch_work was not processing the queued up tasks at
the first sign of the node leaving the domain leading to not
only incompleted tasks but also a mismatch in the dlm refcnt.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This is to prevent the condition in which a previously queued
up assert master asserts after we start the migration. Now
migration ensures the workqueue is flushed before proceeding
with migrating the lock to another node. This condition is
typically encountered during parallel umounts.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The migrate lockres handler was only searching for its lock on
migrated lockres on the expected queue. This could be problematic
as the new master could have also issued a convert request
during the migration and thus moved the lock to the convert queue.
We now search for the lock on all three queues.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <Sunil.Mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
dlmunlock() was not waiting for migration to complete before releasing locks
on locally mastered locks.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <Sunil.Mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
dlmthread was removing lockres' from the dirty list
and resetting the dirty flag before shuffling the list.
This patch retains the dirty state flag until the lists
are shuffled.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Sunil Mushran <Sunil.Mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch makes some needlessly global functions static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This was previously broken and migration of some locks had to be temporarily
disabled. We use a new (and backward-incompatible) set of network messages
to account for all references to a lock resources held across the cluster.
once these are all freed, the master node may then free the lock resource
memory once its local references are dropped.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*.
What I want is a separate /sys/class/net directory in sysfs for each
network namespace, and I want to name each of them /sys/class/net.
I looked and the VFS actually allows that. All that is needed is
for /sys/class/net to implement a follow link method to redirect
lookups to the real directory you want.
Implementing a follow link method that is sensitive to the current
network namespace turns out to be 3 lines of code so it looks like a
clean approach. Modifying sysfs so it doesn't get in my was is a bit
trickier.
I am calling the concept of multiple directories all at the same path
in the filesystem shadow directories. With the directory entry really
at that location the shadow master.
The following patch modifies sysfs so it can handle a directory
structure slightly different from the kobject tree so I can implement
the shadow directories for handling /sys/class/net/.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
if a driver returns an error in fill_read_buffer(), the buffer will be
marked as filled. Subsequent reads will return eof. But there is
no data because of an error, not because it has been read.
Not marking the buffer filled is the obvious fix.
Signed-off-by: Oliver Neukum <oliver@neukum.name>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch prevents a race between IO and removing a file from sysfs.
It introduces a list of sysfs_buffers associated with a file at the inode.
Upon removal of a file the list is walked and the buffers marked orphaned.
IO to orphaned buffers fails with -ENODEV. The driver can safely free
associated data structures or be unloaded.
Signed-off-by: Oliver Neukum <oliver@neukum.name>
Acked-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If we allow NULL as the new parent in device_move(), we need to make sure
that the device is placed into the same place as it would if it was
newly registered:
- Consider the device virtual tree. In order to be able to reuse code,
setup_parent() has been tweaked a bit.
- kobject_move() can fall back to the kset's kobject.
- sysfs_move_dir() uses the sysfs root dir as fallback.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
JFS: Remove incorrect kgdb define
JFS: call io_schedule() instead of schedule() to avoid deadlock
JFS: Add lockdep annotations
JFS: Avoid BUG() on a damaged file system
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (57 commits)
[GFS2] make gfs2_writepages() static
[GFS2] Unlock page on prepare_write try lock failure
[GFS2] nfsd readdirplus assertion failure
[DLM] fix softlockup in dlm_recv
[DLM] zero new user lvbs
[DLM/GFS2] indent help text
[GFS2] Fix unlink deadlocks
[GFS2] Put back semaphore to avoid umount problem
[GFS2] more CURRENT_TIME_SEC
[GFS2/DLM] fix GFS2 circular dependency
[GFS2/DLM] use sysfs
[GFS2] make lock_dlm drop_count tunable in sysfs
[GFS2] increase default lock limit
[GFS2] Fix list corruption in lops.c
[GFS2] Fix recursive locking attempt with NFS
[DLM] can miss clearing resend flag
[DLM] saved dlm message can be dropped
[DLM] Make sock_sem into a mutex
[GFS2] Fix typo in glock.c
[GFS2] use CURRENT_TIME_SEC instead of get_seconds in gfs2
...
On Mon, Jan 29, 2007 at 08:45:28PM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc6-mm2:
>...
> git-gfs2-nmw.patch
>...
> git trees
>...
This patch makes the needlessly global gfs2_writepages() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the try lock of the glock failed in prepare_write we were
incorrectly exiting this function with the page still locked.
This was resulting in further I/O to this page hanging.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Minor cleanup
[CIFS] Missing free in error path
[CIFS] Reduce cifs stack space usage
[CIFS] lseek polling returned stale EOF
Glock assertion failure found in '07 NFS connectathon. One of the NFSDs
is doing a "readdirplus" procedure call. It passes the logic into
gfs2_readdir() where it obtains its directory inode glock. This is then
followed by filehandle construction that invokes lookup code. It hits
the assertion failure while trying to obtain the inode glock again
inside gfs2_drevalidate().
This patch bypasses the recursive glock call if caller already holds the
lock.
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch stops the dlm_recv workqueue from busy-waiting when a node
disconnects. This can cause soft lockup errors on debug systems and bad
performance generally.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A new lvb for a userland lock wasn't being initialized to zero.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Move the glock acquisition to outside of the transactions.
Lock odering must be preserved in order to prevent ABBA
deadlocks. The current gfs2_change_nlink code would tries
to grab the glock after having started a transaction and thus is holding
the log lock. This is inconsistent with other code paths in
gfs that grab the resource group glock prior to staring
a tranactions.
One problem with this fix is that the resource group
lock is always grabbed now even if the inode still has
ref count and can not be marked for unlink.
Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Dave Teigland fixed this bug a while back, but I managed to mistakenly
remove the semaphore during later development. It is required to avoid
the list of inodes changing during an invalidate_inodes call. I have
made it an rwsem since the read side will be taken frequently during
normal filesystem operation. The write site will only happen during
umount of the file system.
Also the bug only triggers when using the DLM lock manager and only then
under certain conditions as its timing related.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Whoops, quilt user error, missed this one in the previous patch.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
On Sun, Jan 28, 2007 at 11:08:18AM +0100, Jiri Slaby wrote:
> Andrew Morton napsal(a):
> >Temporarily at
> >
> > http://userweb.kernel.org/~akpm/2.6.20-rc6-mm1/
>
> Unable to select IPV6. Menuconfig doesn't offer it when INET is selected.
> When it's not it appears in the menu, but after state change it gets away.
> The same behaviour in xconfig, gconfig.
>
> $ mkdir ../a/tst
> $ make O=../a/tst menuconfig
> HOSTCC scripts/basic/fixdep
> [...]
> HOSTLD scripts/kconfig/mconf
> scripts/kconfig/mconf arch/i386/Kconfig
> Warning! Found recursive dependency: INET GFS2_FS_LOCKING_DLM SYSFS
> OCFS2_FS INET
>
> Maybe this is the problem?
Yes, patch below.
> regards,
cu
Adrian
<-- snip -->
This patch fixes a circular dependency by letting GFS2_FS_LOCKING_DLM
and DLM depend on instead of select SYSFS.
Since SYSFS depends on EMBEDDED this change shouldn't cause any problems
for users.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With CONFIG_DLM=m, CONFIG_PROC_FS=n, and CONFIG_SYSFS=n, kernel build
fails with:
WARNING: "kernel_subsys" [fs/gfs2/locking/dlm/lock_dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/dlm/dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/configfs/configfs.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
Since fs/dlm/lockspace.c and fs/gfs2/locking/dlm/sysfs.c use
kernel_subsys, they should either DEPEND on it or SELECT it.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We want to be able to change or disable the default drop_count (number at
which the dlm asks gfs to limit the the number of locks it's holding).
Add it to the collection of sysfs tunables for an fs.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Increase the number of locks at which point the dlm begins asking gfs to
reduce its lock usage. The default value is largely arbitrary, but the
current value of 50,000 ends up limiting performance unnecessarily for too
many users.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The patch below appears to fix the list corruption that we are seeing on
occasion. Although the transaction structure is private to a single
thread, when the queued structures are dismantled during an in-core
commit, its possible for a different thread to be trying to add the same
structure to another, new, transaction at the same time.
To avoid this, this patch takes the log spinlock during this operation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In certain cases, its possible for NFS to call the lookup code while
holding the glock (when doing a readdirplus operation) so we need to
check for that and not try and lock the glock twice. This also fixes a
typo in a previous NFS related GFS2 patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A long, complicated sequence of events, beginning with the RESEND flag not
being cleared on an lkb, can result in an unlock never completing.
- lkb on waiters list for remote lookup
- the remote node is both the dir node and the master node, so
it optimizes the lookup into a request and sends a request
reply back
- the request reply is saved on the requestqueue to be processed
after recovery
- recovery runs dlm_recover_waiters_pre() which sets RESEND flag
so the lookup will be resent after recovery
- end of recovery: process_requestqueue takes saved request reply
which removes the lkb off the waitesr list, _without_ clearing
the RESEND flag
- end of recovery: dlm_recover_waiters_post() doesn't do anything
with the now completed lookup lkb (would usually clear RESEND)
- later, the node unmounts, unlocks this lkb that still has RESEND
flag set
- the lkb is on the waiters list again, now for unlock, when recovery
occurs, dlm_recover_waiters_pre() shows the lkb for unlock with RESEND
set, doesn't do anything since the master still exists
- end of recovery: dlm_recover_waiters_post() takes this lkb off
the waiters list because it has the RESEND flag set, then reports
an error because unlocks are never supposed to be handled in
recover_waiters_post().
- later, the unlock reply is received, doesn't find the lkb on
the waiters list because recover_waiters_post() has wrongly
removed it.
- the unlock operation has been lost, and we're left with a
stray granted lock
- unmount spins waiting for the unlock to complete
The visible evidence of this problem will be a node where gfs umount is
spinning, the dlm waiters list will be empty, and the dlm locks list will
show a granted lock.
The fix is simply to clear the RESEND flag when taking an lkb off the
waiters list.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
dlm_receive_message() returns 0 instead of returning 'error'. What would
happen is that process_requestqueue would take a saved message off the
requestqueue and call receive_message on it. receive_message would then
see that recovery had been aborted, set error to EINTR, and 'goto out',
expecting that the error would be returned. Instead, 0 was always
returned, so process_requestqueue would think that the message had been
processed and delete it instead of saving it to process next time. This
means the message (usually an unlock in my tests) would be lost.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Now that there can be multiple dlm_recv threads running we need to prevent two
recvs running for the same connection - it's unlikely but it can happen and it
causes message corruption.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I was looking something else up and came across this...
I don't honestly have a good reason to change it other than to make it
like every other Linux filesystem in this regard. ;-) It doesn't
functionally change anything, but makes some lines shorter. :)
I'm also curious; why does gfs2 have 64-bits of on-disk timestamps, but
not in timespec_t format, and only stores second resolutions? Seems like
you're halfway to sub-second resolutions already.
I suppose if that gets implemented then all of the below should
instead be CURRENT_TIME not CURRENT_TIME_SEC.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This function is not longer required since we do not do recursive
locking in the glock layer. As a result all its callers can be
replaceed with list_empty() calls.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a bug whereby data on a newly accepted connection would be
ignored if it arrived soon after the accept.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch doesn't make any changes to the ordering of the various
operations related to glocking, but it does tidy up the calls to the
glops.c functions to make the structure more obvious.
The two functions: gfs2_glock_xmote_th() and gfs2_glock_drop_th() can be
made static within glock.c since they are called by every set of glock
operations. The xmote_th and drop_th glock operations are then made
conditional upon those two routines existing and called from the
previously mentioned functions in glock.c respectively.
Also it can be seen that the go_sync operation isn't needed since it can
easily be replaced by calls to xmote_bh and drop_bh respectively. This
results in no longer (confusingly) calling back into routines in glock.c
from glops.c and also reducing the glock operations by one member.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes some redundant fields from the connection structure and adds
some lockdep annotation to remove spurious warnings.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is a patch for GFS2 to remove the local exclusive flag. In
the places it was used, mutex's are always held earlier in the
call path, so it appears redundant in the LM_ST_SHARED case.
Also, the GFS2 holders were setting local exclusive in any case where
the requested lock was LM_ST_EXCLUSIVE. So the other places in the glock
code where the flag was tested have been replaced with tests for the
lock state being LM_ST_EXCLUSIVE in order to ensure the logic is the
same as before (i.e. LM_ST_EXCLUSIVE is always locally exclusive as well
as globally exclusive).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The "greedy" code was an attempt to retain glocks for a minimum length
of time when they relate to mmap()ed files. The current implementation
of this feature is not, however, ideal in that it required allocating
memory in order to do this and its overly complicated.
It also misses the mark by ignoring the other I/O operations which are
just as likely to suffer from the same problem. So the plan is to remove
this now and then add the functionality back as part of the glock state
machine at a later date (and thus take into account all the possible
users of this feature)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is something I spotted (while looking for something entirely
different) the other day.
Rather than using a completion in each and every struct gfs2_holder,
this removes it in favour of hashed wait queues, thus saving a
considerable amount of memory both on the stack (where a number of
gfs2_holder structures are allocated) and in particular in the
gfs2_inode which has 8 gfs2_holder structures embedded within it.
As a result on x86_64 the gfs2_inode shrinks from 2488 bytes to
1912 bytes, a saving of 576 bytes per inode (no thats not a typo!).
In actual practice we get a much better result than that since
now that a gfs2_inode is under the 2048 byte barrier, we get two
per 4k slab page effectively halving the amount of memory required
to store gfs2_inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes the extra filldir callback which gfs2 was using to
enclose an attempt at readahead for inodes during readdir. The
code was too complicated and also hurts performance badly in the
case that the getdents64/readdir call isn't being followed by
stat() and it wasn't even getting it right all the time when it
was.
As a result, on my test box an "ls" of a directory containing 250000
files fell from about 7mins (freshly mounted, so nothing cached) to
between about 15 to 25 seconds. When the directory content was cached,
the time taken fell from about 3mins to about 4 or 5 seconds.
Interestingly in the cached case, running "ls -l" once reduced the time
taken for subsequent runs of "ls" to about 6 secs even without this
patch. Now it turns out that there was a special case of glocks being
used for prefetching the metadata, but because of the timeouts for these
locks (set to 10 secs) the metadata was being timed out before it was
being used and this the prefetch code was constantly trying to prefetch
the same data over and over.
Calling "ls -l" meant that the inodes were brought into memory and once
the inodes are cached, the glocks are not disposed of until the inodes
are pushed out of the cache, thus extending the lifetime of the glocks,
and thus bringing down the time for subsequent runs of "ls"
considerably.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It occurred to me that although a gfs2 specific writepages for ordered
writes and journaled data would be tricky, by hooking writepages only
for "data=writeback" mounts we could take advantage of not needing
buffer heads (we don't use them on the read side, nor have we for some
time) and create much larger I/Os for the block layer.
Using blktrace both before and after, its possible to see that for large
I/Os, most of the requests generated through writepages are now 1024
sectors after this patch is applied as opposed to 8 sectors before.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If master recovery happens on an rsb in one recovery sequence, then that
sequence is aborted before lock recovery happens, then in the next
sequence, we rely on the previous master recovery (which may now be
invalid due to another node ignoring a lookup result) and go on do to the
lock recovery where we get stuck due to an invalid master value.
recovery cycle begins: master of rsb X has left
nodes A and B send node C an rcom lookup for X to find the new master
C gets lookup from B first, sets B as new master, and sends reply back to B
C gets lookup from A next, and sends reply back to A saying B is master
A gets lookup reply from C and sets B as the new master in the rsb
recovery cycle on A, B and C is aborted to start a new recovery
B gets lookup reply from C and ignores it since there's a new recovery
recovery cycle begins: some other node has joined
B doesn't think it's the master of X so it doesn't rebuild it in the directory
C looks up the master of X, no one is master, so it becomes new master
B looks up the master of X, finds it's C
A believes that B is the master of X, so it sends its lock to B
B sends an error back to A
A resends
this repeats forever, the incorrect master value on A is never corrected
The fix is to do master recovery on an rsb that still has the NEW_MASTER
flag set from an earlier recovery sequence, and therefore didn't complete
lock recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a user process exits, we clear all the locks it holds. There is a
problem, though, with locks that the process had begun unlocking before it
exited. We couldn't find the lkb's that were in the process of being
unlocked remotely, to flag that they are DEAD. To solve this, we move
lkb's being unlocked onto a new list in the per-process structure that
tracks what locks the process is holding. We can then go through this
list to flag the necessary lkb's when clearing locks for a process when it
exits.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch converts the DLM TCP lowcomms to use workqueues rather than using its
own daemon functions. Simultaneously removing a lot of code and making it more
scalable on multi-processor machines.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc3-mm1:
>...
> git-gfs2-nmw.patch
>...
> git trees
>...
This patch makes the needlessly globlal gfs2_change_nlink_i() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is for Red Hat bugzilla bug bz #222302:
Moving a virtual IP from node to node between two NFS-over-GFS2
servers was causing one of the GFS2 servers to become confused and
reference a deleted inode. The problem was due to vfs dentries that did
not reference the gfs2_dops and therefore didn't call the gfs2 revalidate
code to revalidate a dentry after a directory had been deleted & recreated.
This patch is a crosswrite from a RHEL4 bug found in GFS1 as
bz #190756 and it is against the latest -nmw git tree.
Signed-off-by: Robert Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Make the dlm_config_info values readable and writeable via configfs
entries.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a new dlm_config_info field to enable log_debug output and change
log_debug() to use it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a "ci_" prefix to the fields in the dlm_config_info struct so that we
can use macros to add configfs functions to access them (in a later
patch). No functional changes in this patch, just naming changes.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some common, non-error messages should use log_debug instead of log_error
so they can be turned off.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Second round of gfs2_rename lock re-ordering to allow Anaconda adding
root partition on top of gfs2. Previous to this patch the recursive
lock detector in glock.c can be triggered due to attempting to lock
the rgrp twice. This fixes it by checking to see whether the rgrp
is already locked.
This fixes Red Hat bugzilla #221237
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the quilt header comments to match the
code changes.
Change gfs2_lookup_simple to return an error in the case
of a NULL inode.
The callers of gfs2_lookup_simple do not check for NULL
in the no entry case and such would end up dereferencing a NULL ptr.
This fixes:
http://projects.info-pull.com/mokb/MOKB-15-11-2006.html
Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In case of unlinked files with dirty pages GFS2 wasn't clearing
the pages in quite the right order. This patch clears the pages
earlier (before the qlock_dq) to avoid the situation that the
release of the glock results in attempting to write back data that
has already been deallocated.
This fixes Red Hat bugzilla: #220117
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I just noticed this message when testing some other changes I'd made to
lowcomms (to use workqueues) but the problem seems to be in the current
git trees too. I'm amazed no-one has seen it.
BUG: spinlock already unlocked on CPU#1, dlm_recoverd/16868
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I was a little over-enthusiastic turning schedule() calls int cond_sched() when fixing the DLM for Andrew Morton.
These four should really be calls to schedule() or the dlm can busy-wait.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Bugzilla 215088
Fix deadlock in gfs2_change_nlink() while installing RHEL5 into GFS2
partition. The gfs2_rename() apparently needs block allocation for the
new name (into the directory) where it requires rg locks. At the same
time, while updating the nlink count for the replaced file,
gfs2_change_nlink() tries to return the inode meta-data back to resource
group where it needs rg locks too. Our logic doesn't allow process to
acquire these locks recursively by the same process (RHEL installer)
that results a BUG call. This only happens within rename code path and
only if the destination file exists before the rename operation.
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is partially derrived from a patch written by Russell Cattelan.
It fixes a bug where there is a race between readpages and truncate
by ignoring readpages for stuffed files. This is ok because a stuffed
file will never be more than one block (minus sizeof(struct gfs2_dinode))
in size and block size is always less than page size, so we do not lose
anything efficiency-wise by not doing readahead for stuffed files. They
will have already been "read ahead" by the action of reading the inode
in, in the first place.
This is the remaining part of the fix for Red Hat bugzilla #218966
which had not yet made it upstream.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Russell Cattelan <cattelan@redhat.com>
This patch fixes Red Hat bugzilla #212627 in which a deadlock occurs
due to trying to take the i_mutex while holding a glock. The correct
locking order is defined as i_mutex -> glock in all cases.
I've left dealing with allocating writes. I know that we need to do
that, but for now this should do the trick. We don't need to take the
i_mutex on write, because the VFS has already taken it for us. On read
we don't need it since the glock is enough protection. The reason that
I've made some of the checks into a separate function is that we'll need
to do the checks again in the allocating write case eventually, so this
is partly in preparation for this. Likewise the return value test of !=
1 might look a bit odd and thats because we'll need a third return value
in case of requiring an allocation.
I've made the change to deferred mode on the glock to ensure flushing
read caches on other nodes. I notice that (using blktrace to look at
whats going on) we appear to do a better job of large I/Os than ext3
after this patch (in terms of not splitting up the I/Os).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Wendy Cheng <wcheng@redhat.com>
Remove the following unused functions:
- lowcomms_send_message()
- lowcomms_max_buffer_size()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the dlm fakes an unlock/cancel reply from a failed node using a stub
message struct, it wasn't setting the flags in the stub message. So, in
the process of receiving the fake message the lkb flags would be updated
and cleared from the zero flags in the message. The problem observed in
tests was the loss of the USER flag which caused the dlm to think a user
lock was a kernel lock and subsequently fail an assertion checking the
validity of the ast/callback field.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
LVB's are not sent as part of new requests, but the code receiving the
request was copying data into the lvb anyway. The space in the message
where it mistakenly thought the lvb lived actually contained the resource
name, so it wound up incorrectly copying this name data into the lvb. Fix
is to just create the lvb, not copy junk into it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The send_args() function is used to copy parameters into a message for a
number different message types. Only some of those types are set up
beforehand (in create_message) to include space for sending lvb data.
send_args was wrongly copying the lvb for all message types as long as the
lock had an lvb. This means that the lvb data was being written past the
end of the message into unknown space.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Check if we receive a message from another lockspace member running a
version of the dlm with an incompatible inter-node message protocol.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A reply to a recovery message will often be received after the relevant
recovery sequence has aborted and the next recovery sequence has begun.
We need to ignore replies to these old messages from the previous
recovery. There's already a way to do this for synchronous recovery
requests using the rc_id number, but not for async.
Each recovery sequence already has a locally unique sequence number
associated with it. This patch adds a field to the rcom (recovery
message) structure where this recovery sequence number can be placed,
rc_seq. When a node sends a reply to a recovery request, it copies the
rc_seq number it received into rc_seq_reply. When the first node receives
the reply to its recovery message, it will check whether rc_seq_reply
matches the current recovery sequence number, ls_recover_seq, and if not
then it ignores the old reply.
An old, inadequate approach to filtering out old replies (checking if the
current stage of recovery has moved back to the start) has been removed
from two spots.
The protocol version number is changed to reflect the different rcom
structures.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There's a chance the new master of resource hasn't learned it's the new
master before another node sends it a lock during recovery. The node
sending the lock needs to resend if this happens.
- A sends a master lookup for resource R to C
- B sends a master lookup for resource R to C
- C receives A's lookup, assigns A to be master of R and
sends a reply back to A
- C receives B's lookup and sends a reply back to B saying
that A is the master
- B receives lookup reply from C and sends its lock for R to A
- A receives lock from B, doesn't think it's the master of R
and sends an error back to B
- A receives lookup reply from C and becomes master of R
- B gets error back from A and resends its lock back to A
(this resending is what this patch does)
- A receives lock from B, it now sees it's the master of R
and takes the lock
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If an fs has already been shut down, a lockfs callback should do nothing.
An fs that's been shut down can't acquire locks or do anything with
respect to the cluster.
Also, remove FIXME comment in withdraw function. The missing bits of the
withdraw procedure are now all done by user space.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The tk_pid field is an unsigned short. The proper print format specifier for
that type is %5u, not %4d.
Also clean up some miscellaneous print formatting nits.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Addresses the regression noted in
http://bugzilla.linux-nfs.org/show_bug.cgi?id=134
Also mark a couple of other regressions as requiring fixing.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_lookup_revalidate and friends are not serialised, so it is currently
quite possible for the dentry to be revalidated, and then have the
updated verifier replaced with an older value by another process.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the fileid of the cached dentry fails to match that returned by
the readdir call, then we should also d_drop. Try to take into account the
fact that on NFSv4, readdir may return the "mounted_on_fileid" by looking
for submounts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make sure that nfs_readdir_lookup() handles negative dentries correctly.
If d_lookup() returns a negative dentry, then we need to d_drop() that
since readdir shows that it should be positive.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When a file is being scheduled for deletion by means of the sillyrename
mechanism, it makes sense to start out writeback of the dirty data as
soon as possible in order to ensure that the delete can occur. Examples of
cases where this is an issue include "rm -rf", which will busy-wait until
the file is closed, and the sillyrename completes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The filehandle that is passed into nfs4_create_referral_server is
not initialised. The expectation is that nfs4_create_referral_server will
initialise it, and return it to the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
rpc_call_async() will always call rpc_release_calldata(), so it is an
error for __nlm_async_call() to do so as well.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Andrew Vasquez is reporting as-iosched oopses and a 65% throughput
slowdown due to the recent special-casing of direct-io against
blockdevs. We don't know why either of these things are occurring.
The patch minimally reverts us back to the 2.6.19 code for a 2.6.20
release.
Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An AIO bug was reported that sleeping function is being called in softirq
context:
BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
Call Trace:
[<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0
[<a000000100577ba0>] mutex_lock+0x20/0x40
[<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0
[<a00000010018c0c0>] __put_ioctx+0xc0/0x240
[<a00000010018d470>] aio_complete+0x2f0/0x420
[<a00000010019cc80>] finished_one_bio+0x200/0x2a0
[<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200
[<a00000010019d260>] dio_bio_end_aio+0x60/0x80
[<a00000010014acd0>] bio_endio+0x110/0x1c0
[<a0000001002770e0>] __end_that_request_first+0x180/0xba0
[<a000000100277b90>] end_that_request_chunk+0x30/0x60
[<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod]
[<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod]
[<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod]
[<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod]
[<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod]
[<a000000100277d20>] blk_done_softirq+0x160/0x1c0
[<a000000100083e00>] __do_softirq+0x200/0x240
[<a000000100083eb0>] do_softirq+0x70/0xc0
See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2
flush_workqueue() is not allowed to be called in the softirq context.
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and triggers bug. It is simply
incorrect to perform ioctx freeing from aio_complete.
The bug is trigger-able from a race between io_destroy() and aio_complete().
A possible scenario:
cpu0 cpu1
io_destroy aio_complete
wait_for_all_aios { __aio_put_req
... ctx->reqs_active--;
if (!ctx->reqs_active)
return;
}
...
put_ioctx(ioctx)
put_ioctx(ctx);
__put_ioctx
bam! Bug trigger!
The real problem is that the condition check of ctx->reqs_active in
wait_for_all_aios() is incorrect that access to reqs_active is not
being properly protected by spin lock.
This patch adds that protective spin lock, and at the same time removes
all duplicate ref counting for each kiocb as reqs_active is already used
as a ref count for each active ioctx. This also ensures that buggy call
to flush_workqueue() in softirq context is eliminated.
Signed-off-by: "Ken Chen" <kenchen@google.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: <stable@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The two cifs functions that used the most stack according
to "make checkstack" have been changed to use less stack.
Thanks to jra and Shaggy for helpful ideas
Signed-off-by: Steve French <sfrench@us.ibm.com>
cc: jra@samba.org
cc: shaggy@us.ibm.com
Commit 592282cf2e fixed some missing directory
c/mtime updates in part by introducing a dinode update in ocfs2_add_entry().
Unfortunately, ocfs2_link() (which didn't update the directory inode before)
is now missing a single journal credit. Fix this by doubling the number of
inode updates expected during hard link creation.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Fixes Samba bug 4362
Discovered by Jeremy Allison
Clipper database polls on EOF via lseek and can get stale EOF
when file is open on different client
Signed-off-by: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The KM_BIO_SRC_IRQ kmap slot requires local irq protection.
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
But keep it as a dprintk
The message can be generated in a quite normal situation:
If a 'lock' request is interrupted, then the lock client needs to
record that the server has the lock, incase it does.
When we come the unlock, the server might say it doesn't, even
though we think it does (or might) and this generates the message.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In blocks reallocation function sometimes does not update some of
buffer_head::b_blocknr, which may and cause data damage.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During ufs_trunc_direct which is subroutine of ufs::truncate, we try the first
of all free parts of block and then whole blocks. But we calculate size of
block's part to free in the wrong way.
This may cause bad update of used blocks and fragments statistic, and you can
got report that you have free 32T on 1Gb partition.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These series of patches result of UFS1 write support stress testing, like
running fsx-linux, untar and build linux kernel etc
We pass from ufs::get_block_t to levels below: pointer to the current page, to
make possible things like reallocation of blocks on the fly, and we also uses
this pointer for indication, what actually we allocate data block or meta data
block, but currently we make decision about what we allocate on the wrong
level, this may and cause oops if we allocate blocks in some special order.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The BUG in fuse_ctl_add_dentry() could be triggered if the control
filesystem was unmounted and mounted again while one or more fuse
filesystems were present.
The fix is to reset the dentry counter in fuse_ctl_kill_sb().
Bug reported by Florent Mertens.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also remove {NFSD,RPC}_PARANOIA as having the defines doesn't really add
anything.
The printks covered by RPC_PARANOIA were triggered by badly formatted
packets and so should be ratelimited.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds missing newlines to dprintk's.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix UML hostfs mknod(): userspace has differernt dev_t size and encoding
than kernel, so extract major/minor and reencode using glibc makedev()
macro.
Signed-off-by: Johannes Stezenbach <js@linuxtv.org>
Acked-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix commit ecdfc9787f
Not to put too fine a point on it, but in a nutshell...
__set_page_dirty_buffers() | try_to_free_buffers()
---------------------------+---------------------------
| spin_lock(private_lock);
| drop_bufers()
| spin_unlock(private_lock);
spin_lock(private_lock) |
!page_has_buffers() |
spin_unlock(private_lock) |
SetPageDirty() |
| cancel_dirty_page()
oops!
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug which was introduced when I synced up ocfs2_fs.h with ocfs2-tools.
We can't do u64/u32 in kernel.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Proposed patch to fix#5 in
http://www.isec.pl/vulnerabilities/isec-0017-binfmt_elf.txt
aka
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2004-1073
To reproduce, do
* grab poc at the end of advisory.
* add line "eph.p_memsz = 4096;" after "eph.p_filesz = 4096;"
where first "4096" is something equal to or greater than 4096.
* ./poc /usr/bin/sudo && ls -l
Here I get with 2.6.20-rc5:
-rw------- 1 ad ad 102400 2007-01-15 19:17 core
---s--x--x 2 root root 101820 2007-01-15 19:15 /usr/bin/sudo
Check for MAY_READ like binfmt_misc.c does.
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nfsd defines a type 'encode_dent_fn' which is much like 'filldir_t' except
that the first pointer is 'struct readdir_cd *' rather than 'void *'. It
then casts encode_dent_fn points to 'filldir_t' as needed. This hides any
other type mismatches between the two such as the fact that the 'ino' arg
recently changed from ino_t to u64.
So: get rid of 'encode_dent_fn', get rid of the cast of the function type,
change the first arg of various functions from 'struct readdir_cd *' to
'void *', and live with the fact that we have a little less type checking
on the calling of these functions now. Less internal (to nfsd) checking
offset by more external checking, which is more important.
Thanks to Gabriel Paubert <paubert@iram.es> for discovering this and
providing an initial patch.
Signed-off-by: Gabriel Paubert <paubert@iram.es>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We weren't properly NULL terminating protocol error strings for our debug
printk resulting in garbage being included in the output when debug was
enabled.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running dbench multithreaded exposed a race condition where fid structures
were removed while in use. This patch adds semaphores to meta-data operations
to protect the fid structure. Some cleanup of error-case handling in the
inode operations is also included.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9p doesn't handle renames between directories -- however, we were returning
EPERM instead of EXDEV when we detected this case.
Signed-off-by: Eric Van Hensbergren <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a simple logic error in init_v9fs - the return code checks are
reversed. This patch fixes the return code and adds some messages to prevent
module initialization from failing silently.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFS V3 (and V4) support exclusive create by passing a 'cookie' which can get
stored with the file. If the file exists but has exactly the right cookie
stored, then we assume this is a retransmit and the exclusive create was
successful.
The cookie is 64bits and is traditionally stored in the mtime and atime
fields. This causes a problem with Solaris7 as negative mtime or atime
confuse it. So we moved two bits into the mode word instead.
But inherited ACLs sometimes overwrite the mode word on create, so this is a
problem.
So we give up and just store 62 of the 64 bits and assume that is close
enough.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSd assumes that largest number of pages that will be needed for a
request+response is 2+N where N pages is the size of the largest permitted
read/write request. The '2' are 1 for the non-data part of the request, and 1
for the non-data part of the reply.
However, when a read request is not page-aligned, and we choose to use
->sendfile to send it directly from the page cache, we may need N+1 pages to
hold the whole reply. This can overflow and array and cause an Oops.
This patch increases size of the array for holding pages by one and makes sure
that entry is NULL when it is not in use.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to silly typos, if the nfs versions are explicitly set, no NFSACL versions
get enabled.
Also improve an error message that would have made this bug a little easier to
find.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes core dumps to include the vDSO vma, which is left out now.
It removes the special-case core writing macros, which were not doing the
right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the
vma; there is no need for the fixmap page to be installed. It handles the
CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from
get_gate_vma after real vmas in the same way the /proc/PID/maps code does.
This changes core dumps so they no longer include the non-PT_LOAD phdrs from
the vDSO. I made the change to add them in the first place, but in turned out
that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner
to leave them out, and just let the phdrs inside the vDSO image speak for
themselves.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the VM_ALWAYSDUMP flag for vm_flags in vm_area_struct. This
provides a clean explicit way to have a vma always included in core dumps, as
is needed for vDSO's.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __writeback_single_inode(), when we find a locked inode and we're not
doing a data-integrity sync, we used to just skip writing entirely,
since we didn't want to wait for the inode to unlock.
However, there's really no reason to skip writing the data pages, which
are likely to be the the bulk of the dirty state anyway (and the main
reason why writeback was started for the non-data-integrity case, of
course!)
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@osdl.org>,
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not pretty, but it appears that ext3 with data=journal will clean
pages without ever actually telling the VM that they are clean. This,
in turn, will result in the VM (and balance_dirty_pages() in particular)
to never realize that the pages got cleaned, and wait forever for an
event that already happened.
Technically, this seems to be a problem with ext3 itself, but it used to
be hidden by 'try_to_free_buffers()' noticing this situation on its own,
and just working around the filesystem problem.
This commit re-instates that hack, in order to avoid a regression for
the 2.6.20 release. This fixes bugzilla 7844:
http://bugzilla.kernel.org/show_bug.cgi?id=7844
Peter Zijlstra points out that we should probably retain the debugging
code that this removes from cancel_dirty_page(), and I agree, but for
the imminent release we might as well just silence the warning too
(since it's not a new bug: anything that triggers that warning has been
around forever).
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jfs_debug.h uses an incorrect CONFIG_KERNEL_ASSERT ifdef to redefine the
assert macro for kgdb use. I believe the code worked a long time ago, but
today it's not a valid config option. Since I'm not aware of anybody
interested in debugging jfs with kgdb, it should just be removed.
Thanks to Robert P. J. Day for reporting this.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Prevent the call to invalidate_inode_pages2() from racing with file writes
by taking the inode->i_mutex across the page cache flush and invalidate.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix oops when Windows server sent bad domain name null terminator
[CIFS] cifs sprintf fix
[CIFS] Remove 2 unneeded kzalloc casts
[CIFS] Update CIFS version number
This patch fixes a confusion reiserfs has for a long time.
On release file operation reiserfs used to try to pack file data stored in
last incomplete page of some files into metadata blocks. After packing the
page got cleared with clear_page_dirty. It did not take into account that
the page may be mmaped into other process's address space. Recent
replacement for clear_page_dirty cancel_dirty_page found the confusion with
sanity check that page has to be not mapped.
The patch fixes the confusion by making reiserfs avoid tail packing if an
inode was ever mmapped. reiserfs_mmap and reiserfs_file_release are
serialized with mutex in reiserfs specific inode. reiserfs_mmap locks the
mutex and sets a bit in reiserfs specific inode flags.
reiserfs_file_release checks the bit having the mutex locked. If bit is
set - tail packing is avoided. This eliminates a possibility that mmapped
page gets cancel_page_dirty-ed.
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For large size DIO that needs multiple bio, one full page worth of data was
lost at the boundary of bio's maximum sector or segment limits. After a
bio is full and got submitted. The outer while (nbytes) { ... } loop will
allocate a new bio and just march on to index into next page. It just
forgets about the page that bio_add_page() rejected when previous bio is
full. Fix it by put the rejected page back to pvec so we pick it up again
for the next bio.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes RedHat bug 211672
Windows sends one byte (instead of two) of null to terminate final Unicode
string (domain name) in session setup response in some cases - this caused
cifs to misalign some informational strings (making it hard to convert
from UCS16 to UTF8).
Thanks to Shaggy for his help and Akemi Yagi for debugging/testing
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Get rid of some error prints in the ocfs2_iget() path from
ocfs2_get_dentry(). NFSD can easily cause us to read stale inodes.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2 wasn't updating c/mtime on directories during dirent
creation/deletion. Fix ocfs2_unlink(), ocfs2_rename() and
__ocfs2_add_entry() by adding the proper code to update the struct inode and
push the change out to disk.
This helps rename/unlink on nfs exported file systems in particular as those
clients compare directory time values to avoid a full re-reading a directory
which hasn't changed.
ocfs2_rename() loses some superfluous error handling as a result of this
patch.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We shouldn't print errors returned from vfs_follow_link(). This was causing
spurious errors to show up in the logs.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
- Fix deadlock in fs/ntfs/inode.c::ntfs_put_inode(). Thanks to Sergey
Vlasov for the report and detailed analysis of the deadlock. The fix
involved getting rid of ntfs_put_inode() altogether and hence NTFS no
longer has a ->put_inode super operation.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
The introduction of Jens Axboe's explicit i/o plugging patches introduced a
deadlock in jfs. This was caused by the process initiating I/O not
unplugging the queue before waiting on the commit thread. The commit
thread itself was waiting for that I/O to complete. Calling io_schedule()
rather than schedule() unplugs the I/O queue avoiding the deadlock, and it
appears to be the right function to call in any case.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Revert bd_mount_mutex back to a semaphore so that xfs_freeze -f /mnt/newtest;
xfs_freeze -u /mnt/newtest works safely and doesn't produce lockdep warnings.
(XFS unlocks the semaphore from a different task, by design. The mutex
code warns about this)
Signed-off-by: Dave Chinner <dgc@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NFS: Fix race in nfs_release_page()
invalidate_inode_pages2() may find the dirty bit has been set on a page
owing to the fact that the page may still be mapped after it was locked.
Only after the call to unmap_mapping_range() are we sure that the page
can no longer be dirtied.
In order to fix this, NFS has hooked the releasepage() method and tries
to write the page out between the call to unmap_mapping_range() and the
call to remove_mapping(). This, however leads to deadlocks in the page
reclaim code, where the page may be locked without holding a reference
to the inode or dentry.
Fix is to add a new address_space_operation, launder_page(), which will
attempt to write out a dirty page without releasing the page lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also, the bare SetPageDirty() can skew all sort of accounting leading to
other nasties.
[akpm@osdl.org: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Revert previous attempts at messing with the linux banner string and
simply use a separate format string for proc.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Jean Delvare <khali@linux-fr.org>
Cc: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't use ref->flash_offset directly in debugging code, use the ref_offset macro instead.
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Artem Bityutskiy <dedekind@infradead.org>
On Mon, 2006-12-18 at 19:51 +0100, Eric Sesterhenn wrote:
> hi,
>
> while playing around with fsfuzzer, i got the following oops with jfs:
>
> [ 851.804875] BUG at fs/jfs/jfs_xtree.c:760
> assert(!BT_STACK_FULL(btstack))
> [ 851.805179] ------------[ cut here ]------------
> [ 851.805238] kernel BUG at fs/jfs/jfs_xtree.c:760!
JFS should mark the superblock dirty and return an error rather than
calling BUG().
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
This reverts commit 59287c0913.
Hugh Dickins reports that it causes random failures on x86 with SuSE
10.2, and points out
"Isn't that randomization, anywhere from 0x10000 to ELF_ET_DYN_BASE,
sure to place the ET_DYN from time to time just where the comment
says it's trying to avoid? I assume that somehow results in the error
reported."
(where the comment in question is the existing comment in the source
code about mmap/brk clashes).
Suggested-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Marcus Meissner <meissner@suse.de>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Looks like this is the problem, which point Al Viro some time ago:
ufs's get_block callback allocates 16k of disk at a time, and links that
entire 16k into the file's metadata. But because get_block is called for only
a single buffer_head (a 2k buffer_head in this case?) we are only able to tell
the VFS that this 2k is buffer_new().
So when ufs_getfrag_block() is later called to map some more data in the file,
and when that data resides within the remaining 14k of this fragment,
ufs_getfrag_block() will incorrectly return a !buffer_new() buffer_head.
I don't see _right_ way to do nullification of whole block, if use inode
page cache, some pages may be outside of inode limits (inode size), and
will be lost; if use blockdev page cache it is possible to zero real data,
if later inode page cache will be used.
The simpliest way, as can I see usage of block device page cache, but not only
mark dirty, but also sync it during "nullification". I use my simple tests
collection, which I used for check that create,open,write,read,close works on
ufs, and I see that this patch makes ufs code 18% slower then before.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CVE-2006-5753 is for a case where an inode can be marked bad, switching
the ops to bad_inode_ops, which are all connected as:
static int return_EIO(void)
{
return -EIO;
}
#define EIO_ERROR ((void *) (return_EIO))
static struct inode_operations bad_inode_ops =
{
.create = bad_inode_create
...etc...
The problem here is that the void cast causes return types to not be
promoted, and for ops such as listxattr which expect more than 32 bits of
return value, the 32-bit -EIO is interpreted as a large positive 64-bit
number, i.e. 0x00000000fffffffa instead of 0xfffffffa.
This goes particularly badly when the return value is taken as a number of
bytes to copy into, say, a user's buffer for example...
I originally had coded up the fix by creating a return_EIO_<TYPE> macro
for each return type, like this:
static int return_EIO_int(void)
{
return -EIO;
}
#define EIO_ERROR_INT ((void *) (return_EIO_int))
static struct inode_operations bad_inode_ops =
{
.create = EIO_ERROR_INT,
...etc...
but Al felt that it was probably better to create an EIO-returner for each
actual op signature. Since so few ops share a signature, I just went ahead
& created an EIO function for each individual file & inode op that returns
a value.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix filenames on adfs discs being terminated at the first character greater
than 128 (adfs filenames are Latin 1). I saw this problem when using a
loopback adfs image on a 2.6.17-rc5 x86_64 machine, and the patch fixed it
there.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: export heartbeat thread pid via configfs
ocfs2: always unmap in ocfs2_data_convert_worker()
ocfs2: ignore NULL vfsmnt in ocfs2_should_update_atime()
ocfs2: Allow direct I/O read past end of file
ocfs2: don't print error in ocfs2_permission()
ramfs doesn't provide the .set_dirty_page a_op, and when the BLOCK layer is
not configured in, 'set_page_dirty' makes a call via a NULL pointer.
Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
lockdep found a AB BC CA lock inversion in retry-based AIO:
1) The task struct's alloc_lock (A) is acquired in process context with
interrupts enabled. An interrupt might arrive and call wake_up() which
grabs the wait queue's q->lock (B).
2) When performing retry-based AIO the AIO core registers
aio_wake_function() as the wake funtion for iocb->ki_wait. It is called
with the wait queue's q->lock (B) held and then tries to add the iocb to
the run list after acquiring the ctx_lock (C).
3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
alloc_lock (A) via lock_task() and unuse_mm(). Lockdep emits a warning
saying that we're trying to connect the irq-safe q->lock to the
irq-unsafe alloc_lock via ctx_lock.
This fixes the inversion by calling unuse_mm() in the AIO kick handing path
after we've released the ctx_lock. As Ben LaHaise pointed out __put_ioctx
could set ctx->mm to NULL, so we must only access ctx->mm while we have the
lock.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch allows the ocfs2 heartbeat thread to prioritize I/O which may
help cut down on spurious fencing. Most of this will be in the tools -
we can have a pid configfs attribute and let userspace (ocfs2_hb_ctl)
calls the ioprio_set syscall after starting heartbeat, but only cfq
scheduler supports I/O priorities now.
Signed-off-by: Zhen Wei <zwei@novell.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Mmap-heavy clustered workloads were sometimes finding stale data on mmap
reads. The solution is to call unmap_mapping_range() on any down convert of
a data lock.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
ocfs2_direct_IO_get_blocks() was incorrectly returning -EIO for a direct I/O
read whose start block was past the end of the file allocation tree. Fix
things so that we return a hole instead. do_direct_IO() will then notice
that the range start is past eof and return a short read.
While there, remove the unused vbo_max variable.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This also adds he required page "writeback" flag handling, that cifs
hasn't been doing and that the page dirty flag changes made obvious.
Acked-by: Steve French <smfltc@us.ibm.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks to Len Brown for testing this fix, since while they have in the
past, none of my machines run reiserfs at the moment.
Cc: Vladimir V. Saveliev <vs@namesys.com>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the current jbd code, if a buffer on BJ_SyncData list is dirty and not
locked, the buffer is refiled to BJ_Locked list, submitted to the IO and
waited for IO completion.
But the fsstress test showed the case that when a buffer was already
submitted to the IO just before the buffer_dirty(bh) check, the buffer was
not waited for IO completion.
Following patch solves this problem. If it is assumed that a buffer is
submitted to the IO before the buffer_dirty(bh) check and still being
written to disk, this buffer is refiled to BJ_Locked list.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Jan Kara <jack@ucw.cz>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel. The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage. Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark JFFS as broken and provide a warning to users that it is deprecated
and scheduled for removal in 2.6.21
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trevor found a file size problem in eCryptfs in recent kernels, and he
tracked it down to an fsstack change.
This was the eCryptfs copy_attr_all:
> -void ecryptfs_copy_attr_all(struct inode *dest, const struct inode *src)
> -{
> - dest->i_mode = src->i_mode;
> - dest->i_nlink = src->i_nlink;
> - dest->i_uid = src->i_uid;
> - dest->i_gid = src->i_gid;
> - dest->i_rdev = src->i_rdev;
> - dest->i_atime = src->i_atime;
> - dest->i_mtime = src->i_mtime;
> - dest->i_ctime = src->i_ctime;
> - dest->i_blkbits = src->i_blkbits;
> - dest->i_flags = src->i_flags;
> -}
This is the fsstack copy_attr_all:
> +void fsstack_copy_attr_all(struct inode *dest, const struct inode *src,
> + int (*get_nlinks)(struct inode *))
> +{
> + if (!get_nlinks)
> + dest->i_nlink = src->i_nlink;
> + else
> + dest->i_nlink = (*get_nlinks)(dest);
> +
> + dest->i_mode = src->i_mode;
> + dest->i_uid = src->i_uid;
> + dest->i_gid = src->i_gid;
> + dest->i_rdev = src->i_rdev;
> + dest->i_atime = src->i_atime;
> + dest->i_mtime = src->i_mtime;
> + dest->i_ctime = src->i_ctime;
> + dest->i_blkbits = src->i_blkbits;
> + dest->i_flags = src->i_flags;
> +
> + fsstack_copy_inode_size(dest, src);
> +}
The addition of copy_inode_size breaks eCryptfs, since eCryptfs needs to
interpolate the file sizes (eCryptfs has extra space in the lower file for
the header). The setting of the upper inode size occurs elsewhere in
eCryptfs, and the new copy_attr_all now undoes what eCryptfs was doing
right beforehand.
I see three ways of going forward from here. (1) Something like this patch
needs to go in (assuming it jives with Unionfs), (2) we need to make a
change to the fsstack API for more fine-grained control over copying
attributes (e.g., by also including a callback function for calculating the
right file size, which will require some more work on both eCryptfs and
Unionfs), or (3) the fsstack patch on eCryptfs (commit
0cc72dc7f0 made on Fri Dec 8 02:36:31 2006
-0800) needs to be yanked in 2.6.20.
I think the simplest solution, from eCryptfs' perspective, is to just
remove the inode size copy.
Remove inode size copy in general fsstack attr copy code. Stacked
filesystems may need to interpolate the inode size, since the file
size in the lower file may be different than the file size in the
stacked layer.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add proper prototypes for sysv_{init,destroy}_icache() in sysv.h
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
XFS appears to call clear_page_dirty to get the mapping tree dirty tag
set correctly at the same time the page dirty flag is cleared. I note
that this can be done by set_page_writeback() if we clear the dirty flag
on the page first when we are writing back the entire page.
Hence it seems to me that the XFS call to clear_page_dirty() could
easily be substituted by clear_page_dirty_for_io() followed by a call to
set_page_writeback() to get the mapping tree tags set correctly after
the page has been marked clean.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The use by FUSE was just a remnant of an optimization from the time
when writable mappings were supported.
Now FUSE never actually allows the creation of dirty pages, so this
invocation of clear_page_dirty() is effectively a no-op.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes some questionable code that attempted to make a
no-longer-used page easier to reclaim.
Calling metapage_writepage against such a page will not result in any
I/O being performed, so removing this code shouldn't be a big deal.
[ It's likely that we could have just replaced the "clear_page_dirty()"
call with a call to "cancel_dirty_page()" instead, but in the
meantime this is cleaner and simpler anyway, so unless there is some
overriding reason (and Dave implies there isn't) I'll just use this
patch as-is. - Linus ]
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They were horribly easy to mis-use because of their tempting naming, and
they also did way more than any users of them generally wanted them to
do.
A dirty page can become clean under two circumstances:
(a) when we write it out. We have "clear_page_dirty_for_io()" for
this, and that function remains unchanged.
In the "for IO" case it is not sufficient to just clear the dirty
bit, you also have to mark the page as being under writeback etc.
(b) when we actually remove a page due to it becoming inaccessible to
users, notably because it was truncate()'d away or the file (or
metadata) no longer exists, and we thus want to cancel any
outstanding dirty state.
For the (b) case, we now introduce "cancel_dirty_page()", which only
touches the page state itself, and verifies that the page is not mapped
(since cancelling writes on a mapped page would be actively wrong as it
is still accessible to users).
Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
ReiserFS, XFS all use the old confusing functions, and will be fixed
separately in subsequent commits (with some of them just removing the
offending logic, and others using clear_page_dirty_for_io()).
This was confirmed by Martin Michlmayr to fix the apt database
corruption on ARM.
Cc: Martin Michlmayr <tbm@cyrius.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrei Popa <andrei.popa@i-neo.ro>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Gordon Farquharson <gordonfarquharson@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>