The end_bio routines are changed to take a pointer to the extent state
struct, and the state tree is walked in order to set/clear appropriate
bits as IO completes. This greatly reduces the number of rbtree searches
done by the end_bio handlers, and reduces lock contention.
The extent_io releasepage function is changed to avoid expensive searches
for locked state.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.
The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.
A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.
lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
test_range_bit doesn't properly handle the case: there's a hole at the
end of the range and there's no other extent_state after the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_free_objectid may return a used objectid due to arithmetic
underflow. This bug may happen when parameter 'root' is tree root, so
it may cause serious problems when creating snapshot or sub-volume.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using ilookup5 during data=ordered writeback could deadlock on I_LOCK. This
saves a pointer to the inode instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds readonly inode flag support. A file with this flag
can't be modified, but can be deleted.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While shrinking the FS, the allocation functions need to make sure
they don't try to allocate bytes past the end of the FS.
nodatacow needed an extra check to force cows when the existing extents are
past the end of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is very difficult to create a consistent snapshot of the btree when
other writers may update the btree before the commit is done.
This changes the snapshot creation to happen during the commit, while
no other updates are possible.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The shrinking code used btrfs_next_leaf to find the next item, but
this does not cow the blocks it touches. This fix calls search_slot after
finding the next item to do appropriate cow and balancing.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The patch fixes the overlapping extent issue in shrink_extent_tree.
It checks whether there is an overlapping extent by using
find_previous_extent. If there is an overlapping extent, it setups
key.objectid and cur_byte properly.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is intended to prevent accidentally filling the drive. A determined
user can still make things oops.
It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed the offset into the extent was incorrectly being added to the
extent start before trying to find it in the extent allocation tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.
In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
One of my old patches introduces a new bug to
btrfs_drop_extents(changeset 275). Inline extents are not truncated
properly when "extent_end == end", it can trigger the BUG_ON at
file.c:600. I hope I don't introduce new bug this time.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Hello everybody,
compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.
Signed-off-by: Christian Hesse <mail@earthworm.de>
--
Regards,
Chris
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="btrfs-section_mismatches.patch"
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The codes that fixup the right leaf and the codes that dirty the
extnet buffer use the variable 'right_nritems' , both of them expect
'right_nritems' is the number of items in right leaf after the push.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There was a slight problem with ACL's returning EINVAL when you tried to set an
ACL. This isn't correct, we should be returning EOPNOTSUPP, so I did a very
ugly thing and just commented everybody out and made them return EOPNOTSUPP.
This is only temporary, I'm going back to implement ACL's, but Chris wants to
push out a release so this will suffice for now.
Also Yan suggested setting reada to -1 in the delete case to enable backwards
readahead, and in the listxattr case I moved path->reada = 2; to after the if
(!path) check so we can avoid a possible null dereference. Thank you,
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds a new parameter 'full_scan' to 'find_search_start',
thereby 'find_search_start' can know whether 'find_free_extent' is in
full scan phrase. I feel that 'find_search_start' should skip calling
'btrfs_find_block_group' when 'find_free_extent' is in full scan
phrase. In my test on a 2GB volume, Oops occurs when space usage is
about 76%. After apply the patch, Oops occurs when space usage is
near 100%.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group', the last parameter of the function indicates
pin vs unpin.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When 'item_end' is equal to 'inode->i_size', 'found_type' is updated
and current item is skipped. This behavior is correct for extent item,
but incorrect for csum item. For example, there is a csum item with
'offset == 0'. When deleting the inode, 'inode->i_size' is set to 0,
so the csum item isn't deleted.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't set hint_byte to EXTENT_MAP_INLINE when 'end == extent_end' or
'start == key.offset' . The inline extent will be truncated in these
cases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When calculating the size of inline extent, inode->i_size should also
be take into consideration, otherwise sys_write may drop some data
silently. You can test this bug by:
#dd if=/dev/zero bs=4k count=1 of=test_file
#dd if=/dev/zero bs=2k count=1 of=test_file conv=notrunc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When pin_down_bytes decides not to pin a block because it was from the
current transaction, make sure the in memory cache of free extents is updated
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a 'finish_wait', but no 'prepare_to_wait' . So I think that
the 'prepare_to_wait' is missing. The second change is according to
the name of variable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fixes do a number of things:
1) Most btrfs_drop_extent callers will try to leave the inline extents in
place. It can truncate bytes off the beginning of the inline extent if
required.
2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.
3) btrfs_truncate_in_transaction truncates inline extents
4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Execution should goto label 'insert' when 'btrfs_next_leaf' return a
non-zero value, otherwise the parameter 'slot' for
'btrfs_item_key_to_cpu' may be out of bounds. The original codes jump
to label 'insert' only when 'btrfs_next_leaf' return a negative
value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1. Reorder kmap and the test for 'page != NULL'
2. Zero-fill rest area of a block when inline extent isn't big enough.
3. Do not insert extent_map into the map tree when page == NULL.
(If insert the extent_map into the map tree, subsequent read requests
will find it in the map tree directly and the corresponding inline
extent data aren't copied into page by the the get_extent function.
extent_read_full_page can't handle that case)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ENOTEMPTY check in btrfs_rmdir isn't reliable. It's possible that
the backward search finds . or .. at first, then some other directory
entry. In that case, btrfs_rmdir delete . or .. improperly. The
patch also fixes a fs_mutex unlock issue in btrfs_rmdir.
--
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1) Forced defrag wasn't working properly (btrfsctl -d) because some
cache only checks were incorrect.
2) Defrag only the leaves unless in forced defrag mode.
3) Don't use complex logic to figure out if a leaf is needs defrag
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This modifies inline extent size calculation, so that
insert_inline_extent can handle the case that parameter 'offset' is
not zero; it also a few codes to zero uninitialized area in inline
extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When making room for a new item, it is ok to create an empty leaf, but
when making room to extend an item, split_leaf needs to make sure it
keeps the item we're extending in the path and make sure we don't end up
with an empty leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reduces the number of calls to btrfs_extend_item and greatly lowers
the cpu usage while writing large files.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just use kobject_set_name(), that works in all kernels (I think...).
Kernels newer than 2.6.23 currently fail with:
/home/axboe/git/btrfs/btrfs-unstable/sysfs.c:188: error: unknown field
'name' specified in initializer
Signed-off-by: Chris Mason <chris.mason@oracle.com>
endio handling is typically called with interrupts disabled, but can
also be called with it enabled. So save interrupts before using KM_IRQ0
to be completely safe.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It now returns void and it is never called for partial completions, so
the bio->bi_size check must go.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to defrag huge directories, but skip the expensive defrag
case in more common usage, where it does not help as much.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I think check whether extent is a hole before update 'inode->i_blocks'
is unconditional required. (original codes check it only when
del_item isn't equal to 0)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
found_type has already been decreased by codes above the change, I
think decrease it by one again doesn't make sense.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_extent would fail to wrap around to the start of the drive because
it was doing the enospc case checking twice in some cases, causing it
to return -ENOSPC early.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_btree_balance_dirty is changed to pass the number of pages dirtied
for more accurate dirty throttling. This lets the VM make better decisions
about when to force some writeback.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Cache block group was overly complex and missed free blocks at the very start
of the group. This patch simplifies things significantly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a helper per ioctl function to make the code more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
No reason to grab the BKL before calling into the btrfs ioctl code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Dead roots are trees left over after a crash, and they were either in the
process of being removed or were waiting to be removed when the box crashed.
Before, a search of the entire tree of root pointers was done on mount
looking for dead roots. Now, the search is done the first time we load
a root.
This makes mount faster when there are a large number of snapshots, and it
enables the block accounting code to properly update the block counts on
the latest root as old versions of the root are reaped after a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
XFS updates the ondisk inode size only after the data I/O has finished,
so it needs a hook when the writepage end_bio handler has finished.
Might not be worth applying as-is as the per-page callback is very
ineffcient. What XFS really wants is a callback when writeout of a
whole extent has completed. This delayed i_size updates scheme might
be worthwile for btrfs aswell, btw.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The writepage_io is not mandatory, e.g. my port of xfs to the extent_map
code does not have one for now. So handle a NULL pointer gracefully
here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
generic_bmap is completely trivial, while the extent to bh mapping in
btrfs is rather complex. So provide a extent_bmap instead that takes
a get_extent callback and can be used by filesystem using the extent_map
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bio completion handlers can be run in any context, e.g. when using
the old ide driver they run in hardirq context with irqs disabled so
lockdep rightfully warns about using write_lock_irq useage in these
handlers.
This patch switches clear_extent_bit and set_extent_bit to
write_lock_irqsave to fix this problem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that set_extent_bit was exiting too early when there
was a hole in the map. The fix is to reorder the tests to check for the
hole first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
File data checksums are only done during writepage, so we have to make sure
all pages are written when the snapshot is taken. This also adds some
locking so that new writes don't race in and add new dirty pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Modified form of original patch from Christoph Hellwig to make btrfs
mount into the default subvolume by default.
mount /dev/somedevice:subvolumename to get other subvolumes or
mount /dev/somedevice:. to get the root
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows the tree walking code to defrag only the newly allocated
buffers, it seems to be a good balance between perfect defragging and the
performance hit of repeatedly reallocating blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.
File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, snapshot deletion was a single atomic unit. This caused considerable
lock contention and required an unbounded amount of space. Now,
the drop_progress field in the root item is used to indicate how far along
snapshot deletion is, and to resume where it left off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Almost none of the files including module.h need to do so,
remove them.
Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The super block written during commit was not consistent with the state of
the trees. This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Attaching below is some of the code cleanups that i came across while
reading the code.
a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy
Signed-off-by: Chris Mason <chris.mason@oracle.com>