commit 6321e3ed2a caused
the full bmv_count's worth of getbmapx structures to get
allocated; telling it to do MAXEXTNUM was a bit insane,
resulting in ENOMEM every time.
Chop it down to something reasonable, the number of slots
in the caller's input buffer. If this is too large the
caller may get ENOMEM but the reason should not be a
mystery, and they can try again with something smaller.
We add 1 to the value because in the normal getbmap
world, bmv_count includes the header and xfs_getbmap does:
nex = bmv->bmv_count - 1;
if (nex <= 0)
return XFS_ERROR(EINVAL);
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: be more polite in the async caching threads
Btrfs: preserve commit_root for async caching
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: remove dcache entries for remote deleted inodes
GFS2: Fix incorrent statfs consistency check
GFS2: Don't put unlikely reclaim candidates on the reclaim list.
GFS2: Don't try and dealloc own inode
GFS2: Fix panic in glock memory shrinker
GFS2: keep statfs info in sync on grows
GFS2: Shrink the shrinker
ocfs2_quota_write needs to release the lock if it fails to
read quota block. So use "goto out" instead of "return err".
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Commit d01730d74d didn't completely fix
the problem since we still take dqio_mutex and i_mutex in the wrong
order. Move taking of i_mutex further down (luckily it's needed only
for updating inode flags) below where dqio_mutex is taken.
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
VAT inode is located in the last block recorded block of the medium. When the
drive errorneously reports number of recorded blocks, we failed to load the VAT
inode and thus mount the medium. This patch makes kernel try to read VAT inode
from the last block of the device if it is different from the last recorded
block.
Signed-off-by: Jan Kara <jack@suse.cz>
The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall. This
releases the semaphore more often when a transaction commit is
in progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.
During a commit, we have a loop where we update the extent allocation
tree root. We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.
Right now the commit_root pointer is changed inside this loop. It
is more correct to change the commit_root pointer only after all the
looping is done.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since both linked and unlinked inodes are counted by rgd->rd_dinodes, It
makes no sense to count them with the used data blocks (first check that
I changed), it makes sense to count them with the linked inodes (second
check), and it makes no sense to care if there are more unlinked inodes
than linked ones. This fixes these errors.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 was placing far too many glocks on the reclaim list that were not good
candidates for freeing up from cache. These locks would sit there and
repeatedly get scanned to see if they could be reclaimed, wasting a lot
of time when there was memory pressure. This fix does more checks on the
locks to see if they are actually likely to be removable from cache.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When searching for unlinked, but still allocated inodes during block
allocation, avoid the block relating to the inode that is doing the
allocation. This fixes a hang caused when an unlinked, but still
open, inode tries to allocate some more blocks and lands up
finding itself during the search for deallocatable inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It is possible for gfs2_shrink_glock_memory() to check a glock for
demotion
that's in the process of being freed by gfs2_glock_put(). In this case,
gfs2_shrink_glock_memory() will acquire a new reference to this glock,
and
then try to free the glock itself when it drops the refernce. To solve
this, gfs2_shrink_glock_memory() just needs to check if the glock is in
the process of being freed, and if so skip it without ever unlocking the
lru_lock.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 wasn't syncing its statfs info on grows. This causes a problem
when you grow the filesystem on multiple nodes. GFS2 would calculate
the new space based on the resource groups (which are always current),
and then assume that the filesystem had grown the from the existing
statfs size. If you grew the filesystem on two different nodes in a
short time, the second node wouldn't see the statfs size change from the
first node, and would assume that it was grown by a larger amount than
it was. When all these changes were synced out, the total fileystem
size would be incorrect (the first grow would be counted twice).
This patch syncs makes GFS2 read in the statfs changes from disk before
a grow, and write them out after the grow, while the master statfs inode
is locked.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes some of the special cases that the shrinker
was trying to deal with. As a result we leave fewer items on
the list and none at all which cannot be demoted. This makes
the list scanning more efficient and solves some issues seen
with large numbers of inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This file makes use of various macros defined in files like asm/current.h
or asm-generic/resource.h. All these files can be included via sched.h.
The building of the !MMU ARM kernel (with additional patches) fails
without this change.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
driver core: documentation: make it clear that sysfs is optional
driver core: sysdev: do not send KOBJ_ADD uevent if kobject_init_and_add fails
Dynamic debug: fix typo: -/->
driver core: firmware_class:fix memory leak of page pointers array
sysfs: fix hardlink count on device_move
Create bdgrab(). This function copies an existing reference to a
block_device. It is safe to call from any context.
Hibernation code wishes to copy a reference to the active swap device.
Right now it calls bdget() under a spinlock, but this is wrong because
bdget() can sleep. It doesn't need a full bdget() because we already
hold a reference to active swap devices (and the spinlock protects
against swapoff).
Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827
Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits)
Btrfs: Fix async caching interaction with unmount
Btrfs: change how we unpin extents
Btrfs: Correct redundant test in add_inode_ref
Btrfs: find smallest available device extent during chunk allocation
Btrfs: clear all space_info->full after removing a block group
Btrfs: make flushoncommit mount option correctly wait on ordered_extents
Btrfs: Avoid delayed reference update looping
Btrfs: Fix ordering of key field checks in btrfs_previous_item
Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
Btrfs: Remove code duplication in comp_keys
Btrfs: async block group caching
Btrfs: use hybrid extents+bitmap rb tree for free space
Btrfs: Fix crash on read failures at mount
Btrfs: remove of redundant btrfs_header_level
Btrfs: adjust NULL test
Btrfs: Remove broken sanity check from btrfs_rmap_block()
Btrfs: convert nested spin_lock_irqsave to spin_lock
Btrfs: make sure all dirty blocks are written at commit time
Btrfs: fix locking issue in btrfs_find_next_key
Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
...
The parse_tag_3_packet function does not check if the tag 3 packet contains a
encrypted key size larger than ECRYPTFS_MAX_ENCRYPTED_KEY_BYTES.
Signed-off-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
[tyhicks@linux.vnet.ibm.com: Added printk newline and changed goto to out_free]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tag 11 packets are stored in the metadata section of an eCryptfs file to
store the key signature(s) used to encrypt the file encryption key.
After extracting the packet length field to determine the key signature
length, a check is not performed to see if the length would exceed the
key signature buffer size that was passed into parse_tag_11_packet().
Thanks to Ramon de Carvalho Valle for finding this bug using fsfuzzer.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update directory hardlink count when moving kobjects to a new parent.
Fixes the following problem which occurs when several devices are
moved to the same parent and then unregistered:
> ls -laF /sys/devices/css0/defunct/
> total 0
> drwxr-xr-x 4294967295 root root 0 2009-07-14 17:02 ./
> drwxr-xr-x 114 root root 0 2009-07-14 17:02 ../
> drwxr-xr-x 2 root root 0 2009-07-14 17:01 power/
> -rw-r--r-- 1 root root 4096 2009-07-14 17:01 uevent
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
- don't stop the caching thread until btrfs_commit_super return.
- if caching is interrupted by umount, set last to (u64)-1.
otherwise the un-scanned range of block group will be considered
as free extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the referral is malformed or the hostname can't be resolved, then
the current code generates an oops. Fix it to handle these errors
gracefully.
Reported-by: Sandro Mathys <sm@sandro-mathys.ch>
Acked-by: Igor Mammedov <niallain@gmail.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
inotify: use GFP_NOFS under potential memory pressure
fsnotify: fix inotify tail drop check with path entries
inotify: check filename before dropping repeat events
fsnotify: use def_bool in kconfig instead of letting the user choose
inotify: fix error paths in inotify_update_watch
inotify: do not leak inode marks in inotify_add_watch
inotify: drop user watch count when a watch is removed
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: fix race between write_metadata_buffer and get_write_access
ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
jbd: Fix a race between checkpointing code and journal_get_write_access()
ext3: Fix truncation of symlinks after failed write
jbd: Fail to load a journal if it is too short
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] fix sparse warning
cifs: fix sb->s_maxbytes so that it casts properly to a signed value
cifs: disable serverino if server doesn't support it
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
dir has already been tested. It seems that this test should be on the
recently returned value inode.
A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b. The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.
But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent. It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.
This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs allocates individual extents from block groups, and each
block group has a specific type. It may hold metadata, data
mirrored or striped etc.
When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups. Once a block group is freed, the space it was
using on the device may be available for use by new block groups.
btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.
This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents. This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.
So, in the flushoncommit case, wait on all ordered extents. Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.
There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Numbers of needed credits for some quota operations were written
as raw numbers. Create appropriate defines instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
syncjiff is just a converted value of syncms. Some places which
are updating syncms forgot to update syncjiff as well. Since the
conversion is just a simple division / multiplication and it does
not happen frequently, just remove the syncjiff field to avoid
forgotten conversions.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We just set blockcheck stats to zeros but we should also
properly initialize the spinlock there.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Padding fields of on-disk dquot structure were not zeroed. Zero them
so that it's easier to use them later.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When we extend local quota file, we should initialize data
in newly allocated block. Firstly because on recovery we could
parse bogus data, secondly so that block checksums are properly
computed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In a code path extending local quota files we marked new header
buffer uptodate only after calling ocfs2_journal_access_dq() which
triggers a bug. Fix it and also call ocfs2 variant of the function
marking buffer uptodate.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Change i_size of global quota files so that it always remains aligned to block
size. This is mainly because the end of quota block may contain checksum (if
checksumming is enabled) and it's a bit awkward for it to be "outside" of quota
file (and it makes life harder for ocfs2-tools).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_adjust_adjacent_records, we will adjust adjacent records
according to the extent_list in the lower level. But actually
the lower level tree will either be a leaf or a branch. If we only
use ocfs2_is_empty_extent we will meet with some problem if the lower
tree is a branch (tree_depth > 1). So use !ocfs2_rec_clusters instead.
And actually only the leaf record can have holes. So add a BUG_ON
for non-leaf branch.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
BugLink: http://bugs.launchpad.net/ubuntu/+bug/396780
Commit 073aaa1b14 "helpers for acl
caching + switch to those" introduced new helper functions for
acl handling but seems to have introduced a regression for jfs as
the acl is released before returning it to the caller, instead of
leaving this for the caller to do.
This causes the acl object to be used after freeing it, leading
to kernel panics in completely different places.
Thanks to Christophe Dumez for reporting and bisecting into this.
Reported-by: Christophe Dumez <dchris@gmail.com>
Tested-by: Christophe Dumez <dchris@gmail.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
* 'lockdep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
lockdep: Fix lockdep annotation for pipe_double_lock()
This off-by-one bug causes sendfile() to not work properly. When a task
calls sendfile() on a file on a CIFS filesystem, the syscall returns -1
and sets errno to EOVERFLOW.
do_sendfile uses s_maxbytes to verify the returned offset of the file.
The problem there is that this value is cast to a signed value (loff_t).
When this is done on the s_maxbytes value that cifs uses, it becomes
negative and the comparisons against it fail.
Even though s_maxbytes is an unsigned value, it seems that it's not OK
to set it in such a way that it'll end up negative when it's cast to a
signed value. These casts happen in other codepaths besides sendfile
too, but the VFS is a little hard to follow in this area and I can't
be sure if there are other bugs that this will fix.
It's not clear to me why s_maxbytes isn't just declared as loff_t in the
first place, but either way we still need to fix these values to make
sendfile work properly. This is also an opportunity to replace the magic
bit-shift values here with the standard #defines for this.
This fixes the reproducer program I have that does a sendfile and
will probably also fix the situation where apache is serving from a
CIFS share.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
A recent regression when dealing with older servers. This bug was
introduced when we made serverino the default...
When the server can't provide inode numbers, disable it for the mount.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If the tree roots hit read errors during mount, btrfs is not properly
erroring out. We need to check the uptodate bits after
reading in the tree root node.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This removes the continues call's of btrfs_header_level. One call of
btrfs_header_level(c) its enough.
Signed-off-by Daniel Cadete <danielncadete10@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Move the call to BUG_ON to before the dereference of the tested value.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was never actually doing anything anyway (see the loop condition),
and it would be difficult to make it work for RAID[56].
Even if it was actually working, it's checking for the wrong thing
anyway. Instead of checking whether we list a block which _doesn't_ land
at the relevant physical location, it should be checking that we _have_
listed all the logical blocks which refer to the required physical
location on all devices.
This function is only called from remove_sb_from_cache() to ensure that
we reserve the logical blocks which would reside at the same physical
location as the superblock copies. So listing more blocks than we need
is actually OK.
With RAID[56] we're going to throw away an entire stripe for each block
we have to ignore, so we _are_ going to list blocks other than the
ones which actually contain the superblock.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If spin_lock_irqsave is called twice in a row with the same second
argument, the interrupt state at the point of the second call overwrites
the value saved by the first call. Indeed, the second call does not need
to save the interrupt state, so it is changed to a simple spin_lock.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The presumed use of the pipe_double_lock() routine is to lock 2 locks in
a deadlock free way by ordering the locks by their address. However it
fails to keep the specified lock classes in order and explicitly
annotates a deadlock.
Rectify this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
LKML-Reference: <1248163763.15751.11098.camel@twins>
Write dirty block groups may allocate new block, and so may add new delayed
back ref. btrfs_run_delayed_refs may make some block groups dirty.
commit_cowonly_roots does not handle the recursion properly, and some dirty
blocks can be left unwritten at commit time. This patch moves
btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
the code not break out of the loop until there are no dirty block groups or
delayed back refs.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When walking up the tree, btrfs_find_next_key assumes the upper level tree
block is properly locked. This isn't always true even path->keep_locks is 1.
This is because btrfs_find_next_key may advance path->slots[] several times
instead of only once.
When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
we can't guarantee the original value of 'path->slots[level]' is
'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
'level + 1' isn't locked.
This patch fixes the issue by explicitly checking the locking state,
re-searching the tree if it's not locked.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
if 1 is returned by btrfs_search_slot, the path already points to the
first item with 'key > searching key'. So increasing path->slots[0] by
one is superfluous in that case.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Change 'goto done' to 'break' for the case of all device extents have
been freed, so that the code updates space information will be execute.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
use __le64 instead of u64 in on-disk structure definition.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We just had a case in which a buggy server occasionally returns the wrong
attributes during an OPEN call. While the client does catch this sort of
condition in nfs4_open_done(), and causes the nfs4_atomic_open() to return
-EISDIR, the logic in nfs_atomic_lookup() is broken, since it causes a
fallback to an ordinary lookup instead of just returning the error.
When the buggy server then returns a regular file for the fallback lookup,
the VFS allows the open, and bad things start to happen, since the open
file doesn't have any associated NFSv4 state.
The fix is firstly to return the EISDIR/ENOTDIR errors immediately, and
secondly to ensure that we are always careful when dereferencing the
nfs_open_context state pointer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In commit ea455f8ab6, we moved the dentry lock
put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
can drop references to inodes while they are being invalidated by
invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
ending in an infinite loop, "Busy inodes after umount" messages etc.).
We fix the problem by stopping ocfs2_wq from doing any further releasing of
inode references on the superblock being unmounted, wait until it finishes
the current round of releasing and finally cleaning up all the references in
dentry_lock_list from ocfs2_put_super().
The issue was tracked down by Tao Ma <tao.ma@oracle.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In normal tree rotation left process, we will never touch the tree
branch above subtree_index and ocfs2_extend_rotate_transaction doesn't
reserve the credits for them either.
But when we want to delete the rightmost extent block, we have to update
the rightmost records for all the rightmost branch(See
ocfs2_update_edge_lengths), so we have to allocate extra credits for them.
What's more, we have to access them also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Commit 008f55d0e0 (nfs41: recover lease in
_nfs4_lookup_root) forces the state manager to always run on mount. This is
a bug in the case of NFSv4.0, which doesn't require us to send a
setclientid until we want to grab file state.
In any case, this is completely the wrong place to be doing state
management. Moving that code into nfs4_init_session...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The oops http://www.kerneloops.org/raw.php?rawid=537858&msgid= appears to
be due to the nfs4_lock_state->ls_state field being uninitialised. This
happens if the call to nfs4_free_lock_state() is triggered at the end of
nfs4_get_lock_state().
The fix is to move the initialisation of ls_state into the allocator.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
inotify can have a watchs removed under filesystem reclaim.
=================================
[ INFO: inconsistent lock state ]
2.6.31-rc2 #16
---------------------------------
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
khubd/217 [HC0[0]:SC0[0]:HE1:SE1] takes:
(iprune_mutex){+.+.?.}, at: [<c10ba899>] invalidate_inodes+0x20/0xe3
{IN-RECLAIM_FS-W} state was registered at:
[<c10536ab>] __lock_acquire+0x2c9/0xac4
[<c1053f45>] lock_acquire+0x9f/0xc2
[<c1308872>] __mutex_lock_common+0x2d/0x323
[<c1308c00>] mutex_lock_nested+0x2e/0x36
[<c10ba6ff>] shrink_icache_memory+0x38/0x1b2
[<c108bfb6>] shrink_slab+0xe2/0x13c
[<c108c3e1>] kswapd+0x3d1/0x55d
[<c10449b5>] kthread+0x66/0x6b
[<c1003fdf>] kernel_thread_helper+0x7/0x10
[<ffffffff>] 0xffffffff
Two things are needed to fix this. First we need a method to tell
fsnotify_create_event() to use GFP_NOFS and second we need to stop using
one global IN_IGNORED event and allocate them one at a time. This solves
current issues with multiple IN_IGNORED on a queue having tail drop
problems and simplifies the allocations since we don't have to worry about
two tasks opperating on the IGNORED event concurrently.
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify drops new events when they are the same as the tail event on the
queue to be sent to userspace. The problem is that if the event comes with
a path we forget to break out of the switch statement and fall into the
code path which matches on events that do not have any type of file backed
information (things like IN_UNMOUNT and IN_Q_OVERFLOW). The problem is
that this code thinks all such events should be dropped. Fix is to add a
break.
Signed-off-by: Eric Paris <eparis@redhat.com>
inotify drops events if the last event on the queue is the same as the
current event. But it does 2 things wrong. First it is comparing old->inode
with new->inode. But after an event if put on the queue the ->inode is no
longer allowed to be used. It's possible between the last event and this new
event the inode could be reused and we would falsely match the inode's memory
address between two differing events.
The second problem is that when a file is removed fsnotify is passed the
negative dentry for the removed object rather than the postive dentry from
immediately before the removal. This mean the (broken) inotify tail drop code
was matching the NULL ->inode of differing events.
The fix is to check the file name which is stored with events when doing the
tail drop instead of wrongly checking the address of the stored ->inode.
Reported-by: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify doens't give the user anything. If someone chooses inotify or
dnotify it should build fsnotify, if they don't select one it shouldn't be
built. This patch changes fsnotify to be a def_bool=n and makes everything
else select it. Also fixes the issue people complained about on lwn where
gdm hung because they didn't have inotify and they didn't get the inotify
build option.....
Signed-off-by: Eric Paris <eparis@redhat.com>
inotify_update_watch could leave things in a horrid state on a number of
error paths. We could try to remove idr entries that didn't exist, we
could send an IN_IGNORED to userspace for watches that don't exist, and a
bit of other stupidity. Clean these up by doing the idr addition before we
put the mark on the inode since we can clean that up on error and getting
off the inode's mark list is hard.
Signed-off-by: Eric Paris <eparis@redhat.com>
inotify_add_watch had a couple of problems. The biggest being that if
inotify_add_watch was called on the same inode twice (to update or change the
event mask) a refence was taken on the original inode mark by
fsnotify_find_mark_entry but was not being dropped at the end of the
inotify_add_watch call. Thus if inotify_rm_watch was called although the mark
was removed from the inode, the refcnt wouldn't hit zero and we would leak
memory.
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The inotify rewrite forgot to drop the inotify watch use cound when a watch
was removed. This means that a single inotify fd can only ever register a
maximum of /proc/sys/fs/max_user_watches even if some of those had been
freed.
Signed-off-by: Eric Paris <eparis@redhat.com>
The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
too early; this could potentially allow another thread to call get_write_access
on the buffer head, modify the data, and dirty it, and allowing the wrong data
to be written into the journal. Fortunately, if we lose this race, the only
time this will actually cause filesystem corruption is if there is a system
crash or other unclean shutdown of the system before the next commit can take
place.
Signed-off-by: dingdinghua <dingdinghua85@gmail.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
ocfs2_get_block() does no allocation. Hole filling for writes should
have happened farther up in the call chain. We detect this case and
print an error, but we then continue with the function. We should be
exiting immediately.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
A typo caused ocfs2_write_cluster() to return 0 in some error cases.
Fix it.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: Fix incorrect parameters to v9fs_file_readn.
9p: Possible regression in p9_client_stat
9p: default 9p transport module fix
gcc 4.4.1 generates the following build warning on i386:
CC [M] fs/ocfs2/xattr.o
fs/ocfs2/xattr.c: In function ‘ocfs2_xattr_block_get’:
fs/ocfs2/xattr.c:1055: warning: ‘block_off’ may be used uninitialized in this function
The following fix is based on a similar approach by David Howells
few days back: http://lkml.org/lkml/2009/7/9/109,
Signed-off-by: Subrata Modak<subrata@linux.vnet.ibm.com>,
Signed-off-by: Joel Becker <joel.becker@oracle.com>
generic_write_checks() expects count to be initialized to the size of
the write. Writes to files open with O_DIRECT|O_LARGEFILE write 0 bytes
because count is uninitialized.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
...otherwise, we'll leak this memory if we have to reconnect (e.g. after
network failure).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext3_getblk(). Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.
Signed-off-by: Jan Kara <jack@suse.cz>
The following race can happen:
CPU1 CPU2
checkpointing code checks the buffer, adds
it to an array for writeback
do_get_write_access()
...
lock_buffer()
unlock_buffer()
flush_batch() submits the buffer for IO
__jbd_journal_file_buffer()
So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment. The similar problem can happen with
the journal_get_create_access() path.
We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.
Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.
Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Contents of long symlinks is written via standard write methods. So when the
write fails, we add inode to orphan list. But symlinks don't have .truncate
method defined so nobody properly removes them from the orphan list (both on
disk and in memory).
Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
(which is saner anyway since we don't need anything vmtruncate() does except
from calling .truncate in these paths). We also add inode to orphan list only
if ext3_can_truncate() is true (currently, it can be false for symlinks when
there are no blocks allocated) - otherwise orphan list processing will complain
and ext3_truncate() will not remove inode from on-disk orphan list.
Signed-off-by: Jan Kara <jack@suse.cz>
Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.
Reported-by: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing/function-profiler: do not free per cpu variable stat
tracing/events: Move TRACE_SYSTEM outside of include guard
Fix v9fs_vfs_readpage. The offset and size parameters to v9fs_file_readn
were interchanged and hence passed incorrectly.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
In the tcp_connect_to_sock() error exit path, the socket
allocated at the top of the function was not being freed.
Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
fs/Kconfig file was split into individual fs/*/Kconfig files before
nilfs was merged. I've found the current config entry of nilfs is
tainting the work. Sorry, I didn't notice. This fixes the violation.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix race between write_metadata_buffer and get_write_access
ext4: Fix ext4_mb_initialize_context() to initialize all fields
ext4: fix null handler of ioctls in no journal mode
ext4: Fix buffer head reference leak in no-journal mode
ext4: Move __ext4_journalled_writepage() to avoid forward declaration
ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc
ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation
ext4: Don't look at buffer_heads outside i_size.
ext4: Fix goal inum check in the inode allocator
ext4: fix no journal corruption with locale-gen
ext4: Calculate required journal credits for inserting an extent properly
ext4: Fix truncation of symlinks after failed write
jbd2: Fix a race between checkpointing code and journal_get_write_access()
ext4: Use rcu_barrier() on module unload.
ext4: naturally align struct ext4_allocation_request
ext4: mark several more functions in mballoc.c as noinline
ext4: Fix potential reclaim deadlock when truncating partial block
jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region
ext4: Fix type warning on 64-bit platforms in tracing events header
The function jbd2_journal_write_metadata_buffer() calls
jbd_unlock_bh_state(bh_in) too early; this could potentially allow
another thread to call get_write_access on the buffer head, modify the
data, and dirty it, and allowing the wrong data to be written into the
journal. Fortunately, if we lose this race, the only time this will
actually cause filesystem corruption is if there is a system crash or
other unclean shutdown of the system before the next commit can take
place.
Signed-off-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>