Adding a spare to a raid10 doesn't cause recovery to start.
This is due to an silly type in
commit 6c2fce2ef6
and so is a bug in 2.6.27 and .28-rc.
Thanks to Thomas Backlund for bisecting to find this.
Cc: Thomas Backlund <tmb@mandriva.org>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.
For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.
So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.
Signed-off-by: NeilBrown <neilb@suse.de>
Since all bio_split calls refer the same single bio_split_pool, the bio_split
function can use bio_split_pool directly instead of the mempool_t parameter;
then the mempool_t parameter can be removed from bio_split param list, and
bio_split_pool is only referred in fs/bio.c file, can be marked static.
Signed-off-by: Denis ChengRq <crquan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...
* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().
* {disk|all}_stat_*() are gone.
* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.
* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.
* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.
* Separate stats show code paths for disk are collapsed into part
stats show code paths.
* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters. It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.
This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access). diskstats access
should always be enclosed between the two functions. As such, there's
no need for the versions which disables preemption. They're removed
and double underbars ones are renamed to drop the underbars. As an
extra argument is added, there's no danger of using the old version
unconverted.
disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.
This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others. Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The raid10 resync/recovery code currently limits the amount of
in-flight resync IO to 2Meg. This was copied from raid1 where
it seems quite adequate. However for raid10, some layouts require
a bit of seeking to perform a resync, and allowing a larger buffer
size means that the seeking can be significantly reduced.
There is probably no real need to limit the amount of in-flight
IO at all. Any shortage of memory will naturally reduce the
amount of buffer space available down to a set minimum, and any
concurrent normal IO will quickly cause resync IO to back off.
The only problem would be that normal IO has to wait for all resync IO
to finish, so a very large amount of resync IO could cause unpleasant
latency when normal IO starts up.
So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
available) which seems to be a good amount. Also reduce the amount
of memory reserved as there is no need to keep 2Meg just for resync if
memory is tight.
Thanks to Keld for the suggestion.
Cc: Keld Jørn Simonsen <keld@dkuug.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md:
md: raid10: wake up frozen array
md: do not count blocked devices as spares
md: do not progress the resync process if the stripe was blocked
md: delay notification of 'active_idle' to the recovery thread
md: fix merge error
md: move async_tx_issue_pending_all outside spin_lock_irq
When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect. This causes the array to
remain frozen for eternity. We add a wake_up
to allow the array to de-freeze. This code is
nearly identical to the raid1 code, which has
this fix already.
Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-linus' of git://neil.brown.name/md: (52 commits)
md: Protect access to mddev->disks list using RCU
md: only count actual openers as access which prevent a 'stop'
md: linear: Make array_size sector-based and rename it to array_sectors.
md: Make mddev->array_size sector-based.
md: Make super_type->rdev_size_change() take sector-based sizes.
md: Fix check for overlapping devices.
md: Tidy up rdev_size_store a bit:
md: Remove some unused macros.
md: Turn rdev->sb_offset into a sector-based quantity.
md: Make calc_dev_sboffset() return a sector count.
md: Replace calc_dev_size() by calc_num_sectors().
md: Make update_size() take the number of sectors.
md: Better control of when do_md_stop is allowed to stop the array.
md: get_disk_info(): Don't convert between signed and unsigned and back.
md: Simplify restart_array().
md: alloc_disk_sb(): Return proper error value.
md: Simplify sb_equal().
md: Simplify uuid_equal().
md: sb_equal(): Fix misleading printk.
md: Fix a typo in the comment to cmd_match().
...
This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.
The following bio fields are used:
bio->bi_sector
bio->bi_bdev
bio->bi_size
bio->bi_rw using bio_data_dir()
This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up. (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.
This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.
So convert all to return 0 for success or -errno on failure
and fix call sites to match.
Signed-off-by: Neil Brown <neilb@suse.de>
i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.
Signed-off-by: Neil Brown <neilb@suse.de>
If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.
Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.
This fix is already in place for raid1, but not raid5/6 or raid10.
So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.
If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action
will make sure recovery completes properly.
Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.
For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.
We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.
The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.
Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.
Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.
Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.
Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.
For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.
With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.
Thanks to Dan Williams for review and fixing the silly bugs.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allows a userspace metadata handler to take action upon detecting a device
failure.
Based on an original patch by Neil Brown.
Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MD drivers use one printk() call to print 2 log messages and the second line
may be prefixed by a TAB character. It may also output a trailing space
before newline. klogd (I think) turns the TAB character into the 2 characters
'^I' when logging to a file. This looks ugly.
Instead of a leading TAB to indicate continuation, prefix both output lines
with 'raid:' or similar. Also remove any trailing space in the vicinity of
the affected code and consistently end the sentences with a period.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.
Abstract:
When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.
This case, the raid array has already been broken and it may not matter. But
I think stall is not preferable. If it occurs, even shutdown or reboot will
fail because of resource busy.
The deadlock mechanism:
The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering. The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().
If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented. I found a case
where this decrement is forgotten. This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.
For example, this problem would be reproduced in the following case:
Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
[>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec
This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
[=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec
2739 ? S< 0:17 [md0_raid10]
28608 ? D< 0:00 [md0_resync]
28629 pts/1 Ss 0:00 bash
28830 pts/1 R+ 0:00 ps ax
31819 ? D< 0:00 [kjournald]
The resync thread keeps working, but actually it is deadlocked.
Patch:
By this patch, the remaining counter will be decremented if needed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.
If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.
This patch fixes the problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.
Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.
Some advantages:
The fastest part which is the outer sectors of the disks involved will be
used. The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.
Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.
Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.
If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.
Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.
This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).
However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.
So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.
Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.
Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.
Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently an md array with a write-intent bitmap does not updated that bitmap
to reflect successful partial resync. Rather the entire bitmap is updated
when the resync completes.
This is because there is no guarentee that resync requests will complete in
order, and tracking each request individually is unnecessarily burdensome.
However there is value in regularly updating the bitmap, so add code to
periodically pause while all pending sync requests complete, then update the
bitmap. Doing this only every few seconds (the same as the bitmap update
time) does not notciably affect resync performance.
[snitzer@gmail.com: export bitmap_cond_end_sync]
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.
Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.
Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.
While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When writing to a broken array, raid10 currently happily emits empty bio
lists. IOW, the master bio will never be completed, sending writers to
UNINTERRUPTIBLE_SLEEP forever.
Signed-off-by: Arne Redlich <agr@powerkom-dd.de>
Acked-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of read errors raid10d tries to print a nice error message,
unfortunately using data from an already put bio.
Signed-off-by: Maik Hampel <m.hampel@gmx.de>
Acked-By: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.
write_page returns an error which is not consistently checked. It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.
When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.
bitmap_update_sb returns an error that is never checked.
So make these 'void' and be consistent about checking the bit.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
garbage can get synced on top of good data.
2/ We round the wrong way in part of the device size calculation, which
can cause confusion.
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two errors that can lead to recovery problems with raid10
when used in 'far' more (not the default).
Due to a '>' instead of '>=' the wrong block is located which would result in
garbage being written to some random location, quite possible outside the
range of the device, causing the newly reconstructed device to fail.
The device size calculation had some rounding errors (it didn't round when it
should) and so recovery would go a few blocks too far which would again cause
a write to a random block address and probably a device error.
The code for working with device sizes was fairly confused and spread out, so
this has been tided up a bit.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
introducing additional latency.
Fixing this in raid1 and raid10 seems to be straightforward enough.
For our particular usage case in DRBD, passing this flag improved some
initialization time from ~5 minutes to ~5 seconds.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Lars Ellenberg <lars@linbit.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks Jens for alerting me to this.
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
drivers/md/raid1.c:1479: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid10.c:1475: warning: long long unsigned int format, long unsigned int arg (arg 4)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Two less-used md personalities have bugs in the calculation of ->degraded (the
extent to which the array is degraded).
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
raid1, raid10 and multipath don't report their 'congested' status through
bdi_*_congested, but should.
This patch adds the appropriate functions which just check the 'congested'
status of all active members (with appropriate locking).
raid1 read_balance should be modified to prefer devices where
bdi_read_congested returns false. Then we could use the '&' branch rather
than the '|' branch. However that should would need some benchmarking first
to make sure it is actually a good idea.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The error handling routines don't use proper locking, and so two concurrent
errors could trigger a problem.
So:
- use test-and-set and test-and-clear to synchonise
the In_sync bits with the ->degraded count
- use the spinlock to protect updates to the
degraded count (could use an atomic_t but that
would be a bigger change in code, and isn't
really justified)
- remove un-necessary locking in raid5
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It isn't needed as mddev->degraded contains equivalent info.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
Some device state has changed requiring superblock update
on all devices.
MD_CHANGE_CLEAN
The array has transitions from 'clean' to 'dirty' or back,
requiring a superblock update on active devices, but possibly
not on spares
MD_CHANGE_PENDING
A superblock update is underway.
We wait for an update to complete by waiting for all flags to be clear. A
flag can be set at any time, even during an update, without risk that the
change will be lost.
Stop exporting md_update_sb - isn't needed.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
raid10d has toooo many nested block, so take the fix_read_error functionality
out into a separate function.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is generally useful, but particularly helps see if it is the same sector
that always needs correcting, or different ones.
[akpm@osdl.org: fix printk warnings]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The size calculation made assumtion which the new offset mode didn't
follow. This gets the size right in all cases.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The "industry standard" DDF format allows for a stripe/offset layout where
data is duplicated on different stripes. e.g.
A B C D
D A B C
E F G H
H E F G
(columns are drives, rows are stripes, LETTERS are chunks of data).
This is similar to raid10's 'far' mode, but not quite the same. So enhance
'far' mode with a 'far/offset' option which follows the layout of DDFs
stripe/offset.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>