android_kernel_xiaomi_sm8350

Author	SHA1	Message	Date
Chandra Seetharaman	bab7cfc733	[SCSI] scsi_dh: Add a single threaded workqueue for initializing paths Before this patch set (SCSI hardware handlers), initialization of a path was done asynchronously. Doing that requires a workqueue in each device/hardware handler module and leads to unneccessary complication in the device handler code, making it difficult to read the code and follow the state diagram. Moving that workqueue to this level makes the device handler code simpler. Hence, the workqueue is moved to dm level. A new workqueue is added instead of adding it to the existing workqueue (kmpathd) for the following reasons: 1. Device activation has to happen faster, stacking them along with the other workqueue might lead to unnecessary delay in the activation of the path. 2. The effect could be felt the other way too. i.e the current events that are handled by the existing workqueue might get a delayed response. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
Chandra Seetharaman	cfae5c9bb6	[SCSI] scsi_dh: Use SCSI device handler in dm-multipath This patch converts dm-mpath to use scsi device handlers instead of dm's hardware handlers. This patch does not add any new functionality. Old behaviors remain and userspace tools work as is except that arguments supplied with hardware handler are ignored. One behavioral exception is: Activation of a path is synchronous in this patch, opposed to the older behavior of being asynchronous (changed in patch 07: scsi_dh: Add a single threaded workqueue for initializing a path) Note: There is no need to get a reference for the device handler module (as it was done in the dm hardware handler case) here as the reference is held when the device was first found. Instead we check and make sure that support for the specified device is present at table load time. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
NeilBrown	dfc7064500	md: restart recovery cleanly after device failure. When we get any IO error during a recovery (rebuilding a spare), we abort the recovery and restart it. For RAID6 (and multi-drive RAID1) it may not be best to restart at the beginning: when multiple failures can be tolerated, the recovery may be able to continue and re-doing all that has already been done doesn't make sense. We already have the infrastructure to record where a recovery is up to and restart from there, but it is not being used properly. This is because: - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR, which causes the recovery not be be checkpointed. - We remove spares and then re-added them which loses important state information. The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't needed. If there is an error, the relevant drive will be marked as Faulty, and that is enough to ensure correct handling of the error. So we first remove MD_RECOVERY_ERR, changing some of the uses of it to MD_RECOVERY_INTR. Then we cause the attempt to remove a non-faulty device from an array to fail (unless recovery is impossible as the array is too degraded). Then when remove_and_add_spares attempts to remove the devices on which recovery can continue, it will fail, they will remain in place, and recovery will continue on them as desired. Issue: If we are halfway through rebuilding a spare and another drive fails, and a new spare is immediately available, do we want to: 1/ complete the current rebuild, then go back and rebuild the new spare or 2/ restart the rebuild from the start and rebuild both devices in parallel. Both options can be argued for. The code currently takes option 2 as a/ this requires least code change b/ this results in a minimally-degraded array in minimal time. Cc: "Eivind Sarto" <ivan@kasenna.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Bernd Schubert	90b08710e4	md: allow parallel resync of md-devices. In some configurations, a raid6 resync can be limited by CPU speed (Calculating P and Q and moving data) rather than by device speed. In these cases there is nothing to be gained byt serialising resync of arrays that share a device, and doing the resync in parallel can provide benefit. So add a sysfs tunable to flag an array as being allowed to resync in parallel with other arrays that use (a different part of) the same device. Signed-off-by: Bernd Schubert <bs@q-leap.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Dan Williams	4f54b0e948	md: notify userspace on 'stop' events This additional notification to 'array_state' is needed to allow the monitor application to learn about stop events via sysfs. The sysfs_notify("sync_action") call that comes at the end of do_md_stop() (via md_new_event) is insufficient since the 'sync_action' attribute has been removed by this point. (Seems like a sysfs-notify-on-removal patch is a better fix. Currently removal updates the event count but does not wake up waiters) Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
NeilBrown	09a44cc150	md: notify userspace on 'write-pending' changes to array_state When an array enters write pending, 'array_state' changes, so we must be sure to sysfs_notify. Also, when waiting for user-space to acknowledge 'write-pending' by marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to be cleared as that might not happen. So explicity test for the bits that we are really interested in. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
NeilBrown	698b18c1e8	md: raid1: Fix restoration of bio between failed read and write. When performing a "recovery" or "check" pass on a RAID1 array, we read from each device and possible, if there is a difference or a read error, write back to some devices. We use the same 'bio' for both read and write, resetting various fields between the two operations. We forgot to reset bv_offset and bv_len however. These are often left unchanged, but in the case where there is an IO error one or two sectors into a page, they are changed. This results in correctable errors not being corrected properly. It does not result in any data corruption. Cc: "Fairbanks, David" <David.Fairbanks@stratus.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Bernd Schubert	6be9d49401	md: md: raid5 rate limit error printk Last night we had scsi problems and a hardware raid unit was offlined during heavy i/o. While this happened we got for about 3 minutes a huge number messages like these Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2). I guess the high error rate is responsible for not scheduling other events - during this time the system was not pingable and in the end also other devices run into scsi command timeouts causing problems on these unrelated devices as well. Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Christoph Hellwig	6bcfd60186	md: kill file_path wrapper Kill the trivial and rather pointless file_path wrapper around d_path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:09 -07:00
NeilBrown	84255d1018	md: fix possible oops when removing a bitmap from an active array It is possible to add a write-intent bitmap to an active array, or remove the bitmap that is there. When we do with the 'quiesce' the array, which causes make_request to block in "wait_barrier()". However we are sampling the value of "mddev->bitmap" before the wait_barrier call, and using it afterwards. This can result in using a bitmap structure that has been freed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:09 -07:00
Neil Brown	e7e72bf641	Remove blkdev warning triggered by using md As setting and clearing queue flags now requires that we hold a spinlock on the queue, and as blk_queue_stack_limits is called without that lock, get the lock inside blk_queue_stack_limits. For blk_queue_stack_limits to be able to find the right lock, each md personality needs to set q->queue_lock to point to the appropriate lock. Those personalities which didn't previously use a spin_lock, us q->__queue_lock. So always initialise that lock when allocated. With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no longer cause warnings as it will be clear that the proper lock is held. Thanks to Dan Williams for review and fixing the silly bugs. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alistair John Strachan <alistair@devzero.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Jacek Luczak <difrost.kernel@gmail.com> Cc: Prakash Punnoor <prakash@punnoor.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-14 19:11:15 -07:00
Dan Williams	c8894419ac	md: fix raid5 'repair' operations commit `bd2ab67030` "md: close a livelock window in handle_parity_checks5" introduced a bug in handling 'repair' operations. After a repair operation completes we clear the state bits tracking this operation. However, they are cleared too early and this results in the code deciding to re-run the parity check operation. Since we have done the repair in memory the second check does not find a mismatch and thus does not do a writeback. Test results: $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 51072 $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 0 (also fix incorrect indentation) Cc: <stable@kernel.org> Tested-by: George Spelvin <linux@horizon.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-13 08:02:24 -07:00
Harvey Harrison	cb6969e8cd	misc: fix integer as NULL pointer warnings drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-08 10:46:55 -07:00
Dan Williams	6bfe0b4990	md: support blocking writes to an array on device failure Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Dan Williams	11e2ede022	md: prevent duplicates in bind_rdev_to_array Found when trying to reassemble an active externally managed array. Without this check we hit the more noisy "sysfs duplicate" warning in the later call to kobject_add. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Dan Williams	242b363e22	md: remove a stray command from a copy and paste error in resync_start_store Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
NeilBrown	648b629ed4	md: fix up switching md arrays between read-only and read-write When setting an array to 'readonly' or to 'active' via sysfs, we must make the appropriate set_disk_ro call too. Also when switching to "read_auto" (which is like readonly, but blocks on the first write so that metadata can be marked 'dirty') we need to be more careful about what state we are changing from. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	31a59e3425	md: fix 'safemode' handling for external metadata. 'safemode' relates to marking an array as 'clean' if there has been no write traffic for a while (a couple of seconds), to reduce the chance of the array being found dirty on reboot. ->safemode is set to '1' when there have been no write for a while, and it gets set to '0' when the superblock is updates with the 'clean' flag set. This requires a few fixes for 'external' metadata: - When an array is set to 'clean' via sysfs, 'safemode' must be cleared. - when we write to an array that has 'safemode' set (there must have been some delay in updating the metadata), we need to clear safemode. - Don't try to update external metadata in md_check_recovery for safemode transitions - it won't work. Also, don't try to support "immediate safe mode" (safemode==2) for external metadata, it cannot really work (the safemode timeout can be set very low if this is really needed). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	d897dbf914	md: reinitialise more mddev fields in do_md_stop. I keep finding problems where an mddev gets reused and some fields has a value from a previous usage that confuses the new usage. So clear all fields that could possible need clearing when calling do_md_stop. Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	8377bc8080	md: skip all metadata update processing when using external metadata. All the metadata update processing for external metadata is on in user-space or through the sysfs interfaces, so make "md_update_sb" a no-op in that case. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
Dan Williams	6a51830e14	md: fix use after free when removing rdev via sysfs rdev->mddev is no longer valid upon return from entry->store() when the 'remove' command is given. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
Jens Axboe	c9a3f6d6f5	dm: use unlocked variants of queue flag check/set dm.c already provides mutual exclusion through ->map_lock. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 10:21:12 -07:00
Linus Torvalds	bd5d435a96	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Skip I/O merges when disabled block: add large command support block: replace sizeof(rq->cmd) with BLK_MAX_CDB ide: use blk_rq_init() to initialize the request block: use blk_rq_init() to initialize the request block: rename and export rq_init() block: no need to initialize rq->cmd with blk_get_request block: no need to initialize rq->cmd in prepare_flush_fn hook block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline block/elevator.c:elv_rq_merge_ok() mustn't be inline block: make queue flags non-atomic block: add dma alignment and padding support to blk_rq_map_kern unexport blk_max_pfn ps3disk: Remove superfluous cast block: make rq_init() do a full memset() relay: fix splice problem	2008-04-29 08:18:03 -07:00
Denis V. Lunev	c7705f3449	drivers: use non-racy method for proc entries creation (2) Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:22 -07:00
FUJITA Tomonori	992b5bceee	block: no need to initialize rq->cmd with blk_get_request blk_get_request initializes rq->cmd (rq_init does) so the users don't need to do that. The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd, as a preparation for large command support, which changes rq->cmd from the static array to a pointer. sizeof(rq->cmd) will not make sense and &rq->cmd won't work. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:55 +02:00
Nick Piggin	75ad23bc0f	block: make queue flags non-atomic We can save some atomic ops in the IO path, if we clearly define the rules of how to modify the queue flags. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:33 +02:00
Julia Lawall	62b0559aad	drivers/md: use time_before, time_before_eq, etc The functions time_before, time_before_eq, time_after, and time_after_eq are more robust for comparing jiffies against other values. A simplified version of the semantic patch making this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @ change_compare_np @ expression E; @@ ( - jiffies <= E + time_before_eq(jiffies,E) \| - jiffies >= E + time_after_eq(jiffies,E) \| - jiffies < E + time_before(jiffies,E) \| - jiffies > E + time_after(jiffies,E) ) @ include depends on change_compare_np @ @@ #include <linux/jiffies.h> @ no_include depends on !include && change_compare_np @ @@ #include <linux/...> + #include <linux/jiffies.h> // </smpl> [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Nick Andrew	d7a420c947	raid: remove leading TAB on printk messages MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Dan Williams	4ef197d87a	md: raid5.c convert simple_strtoul to strict_strtoul strict_strtoul handles the open-coded sanity checks in raid5_store_stripe_cache_size and raid5_store_preread_threshold Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Dan Williams	8b3e6cdc53	md: introduce get_priority_stripe() to improve raid456 write performance Improve write performance by preventing the delayed_list from dumping all its stripes onto the handle_list in one shot. Delayed stripes are now further delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when: * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list' * 'handle_list' is empty and i/o is being done to satisfy full stripe-width write requests * 'bypass_count' is less than 'bypass_threshold'. By default the threshold is 1, i.e. every other stripe handled is a preread stripe provided the top two conditions are false. Benchmark data: System: 2x Xeon 5150, 4x SATA, mem=1GB Baseline: 2.6.24-rc7 Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512) Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS) * patched: +13% * patched + preread_bypass_threshold = 0: +37% Changes since v1: * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and make it configurable. This defaults to fairness and modest performance gains out of the box. Changes since v2: * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not necessary, the important change was clearing STRIPE_DELAYED in add_stripe_bio and this has been moved out to make_request for the hang fix. * [neilb@suse.de]: simplify get_priority_stripe * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is sampled empty (+11%) * [dan.j.williams@intel.com]: decrement the bypass_count at the detection of stripes being naturally promoted off of hold_list +2%. Note, resetting bypass_count instead of decrementing on these events yields +4% but that is probably too aggressive. Changes since v3: * cosmetic fixups Tested-by: James W. Laferriere <babydr@baby-dragons.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Harvey Harrison	e46b272b66	md: replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Harvey Harrison	9a7b2b0f36	md: fix integer as NULL pointer warnings in md.c drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer Add some braces to match the else-block as well. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Frederik Deweerdt	cf13ab8e02	dm: remove md argument from specific_minor The small patch below: - Removes the unused md argument from both specific_minor() and next_free_minor() - Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev() This has been compile tested on x86. Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:27:02 +01:00
Adrian Bunk	4fdfe401e9	dm table: remove unused dm_create_error_table dm_create_error_table() was added in kernel 2.6.18 and never used... Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:27:00 +01:00
Adrian Bunk	e8488d0858	dm table: drop void suspend_targets return void returning functions returned the return value of another void returning function... Spotted by sparse. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:59 +01:00
Mikulas Patocka	7ff14a3615	dm: unplug queues in threads Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O. It is specified that any submitted bio without BIO_RW_SYNC flag may plug the queue (i.e. block the requests from being dispatched to the physical device). The queue is unplugged when the caller calls blk_unplug() function. Usually, the sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs the queue and waits (to be possibly joined with other adjacent bios). Then, when the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to the disk. This was happenning: When doing O_SYNC writes, function fsync_buffers_list() submits a list of bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is woken up. fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but there are no bios on the device queue as they are still in the dm_raid1 queue. wait_on_buffer() starts waiting until the IO is finished. kmirrord is scheduled, kmirrord takes bios and submits them to the devices. The submitted bio plugs the harddisk queue but there is no one to unplug it. (The process that called wait_on_buffer() is already sleeping.) So there is a 3ms timeout, after which the queues on the harddisks are unplugged and requests are processed. This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes), dm-raid1 is 10 times slower than md raid1. Every time we submit something asynchronously via dm_io, we must unplug the queue actually to send the request to the device. This patch adds an unplug call to kmirrord - while processing requests, it keeps the queue plugged (so that adjacent bios can be merged); when it finishes processing all the bios, it unplugs the queue to submit the bios. It also fixes kcopyd which has the same potential problem. All kcopyd requests are submitted with BIO_RW_SYNC. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-25 13:26:57 +01:00
Mikulas Patocka	a2aebe03be	dm raid1: use timer This patch replaces the schedule() in the main kmirrord thread with a timer. The schedule() could introduce an unwanted delay when work is ready to be processed. The code instead calls wake() when there's work to be done immediately, and delayed_wake() after a failure to give a short delay before retrying. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:56 +01:00
Alasdair G Kergon	a765e20eeb	dm: move include files Publish the dm-io, dm-log and dm-kcopyd headers in include/linux. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:55 +01:00
Alasdair G Kergon	2d1e580afe	dm kcopyd: rename Rename kcopyd.[ch] to dm-kcopyd.[ch]. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:54 +01:00
Alasdair G Kergon	0da336e5fa	dm: expose macros Make dm.h macros and inlines available in include/linux/device-mapper.h Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:53 +01:00
Mikulas Patocka	945fa4d283	dm kcopyd: remove redundant client counting Remove client counting code that is no longer needed. Initialization and destruction is made globally from dm_init and dm_exit and is not based on client counts. Initialization allocates only one empty slab cache, so there is no negative impact from performing the initialization always, regardless of whether some client uses kcopyd or not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:52 +01:00
Mikulas Patocka	08d8757a4d	dm kcopyd: private mempool Change the global mempool in kcopyd into a per-device mempool to avoid deadlock possibilities. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:50 +01:00
Mikulas Patocka	8c0cbc2f79	dm kcopyd: per device Make one kcopyd thread per device. The original shared kcopyd could deadlock. Configuration:	2008-04-25 13:26:49 +01:00
Jonathan Brassow	2a23aa1ddb	dm log: make module use tracking internal Remove internal module reference fields from the interface. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:48 +01:00
Alasdair G Kergon	b8206bc3de	dm log: move register functions Reorder a couple of functions in the file so the next patch is readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:47 +01:00
Heinz Mauelshagen	416cd17b19	dm log: clean interface Clean up the dm-log interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:46 +01:00
Heinz Mauelshagen	eb69aca5d3	dm kcopyd: clean interface Clean up the kcopyd interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:44 +01:00
Heinz Mauelshagen	22a1ceb1e6	dm io: clean interface Clean up the dm-io interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:43 +01:00
Alasdair G Kergon	e01fd7eeb0	dm io: rename error to error_bits Rename 'error' to 'error_bits' for clarity. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:41 +01:00
Mikulas Patocka	72727bad54	dm snapshot: store pointer to target instance Save pointer to dm_target in dm_snapshot structure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:40 +01:00
Heinz Mauelshagen	769aef30f0	dm log: move dirty region log code into separate module Move the dirty region log code into a separate module so other targets can share the code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:39 +01:00
Heinz Mauelshagen	b7fd54a70f	dm log: generalise name in messages Change dm-log.c messages from "mirror log" to "dirty region log" as a new dm target wants to share this code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:38 +01:00
Robert P. J. Day	c12bfc923e	dm raid1: use list_split_init Use shorter list_splice_init() for brevity. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:36 +01:00
Milan Broz	8ee2767a59	dm snapshot: reduce default memory allocation Limit the amount of memory allocated per snapshot on systems with a large page size. (The larger default chunk size on these systems compensates for the smaller number of pages reserved.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:35 +01:00
Mikulas Patocka	924362629b	dm snapshot: fix chunksize sector conversion If a snapshot has a smaller chunksize than the page size the conversion to pages currently returns 0 instead of 1, causing: kernel BUG in mempool_resize. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-04-25 13:26:34 +01:00
Nick Andrew	fdefa4d87e	RAID: remove trailing space from printk line drivers/md/*.[ch] contains only one more printk line with a trailing space. Remove it. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>	2008-04-21 22:42:58 +00:00
Dan Williams	bd2ab67030	md: close a livelock window in handle_parity_checks5 If a failure is detected after a parity check operation has been initiated, but before it completes handle_parity_checks5 will never quiesce operations on the stripe. Explicitly handle this case by "canceling" the parity check, i.e. clear the STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh any non-uptodate blocks. Kernel versions >= 2.6.23 are susceptible. Cc: <stable@kernel.org> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-11 08:06:44 -07:00
Alasdair G Kergon	4cdc1d1fa5	dm io: write error bits form long not int write_err is an unsigned long used with set_bit() so should not be passed around as unsigned int. http://bugzilla.kernel.org/show_bug.cgi?id=10271 Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:23 -07:00
Milan Broz	3f1e9070f6	dm crypt: fix ctx pending Fix regression in dm-crypt introduced in commit `3a7f6c990a` ("dm crypt: use async crypto"). If write requests need to be split into pieces, the code must not process them in parallel because the crypto context cannot be shared. So there can be parallel crypto operations on one part of the write, but only one write bio can be processed at a time. This is not optimal and the workqueue code needs to be optimized for parallel processing, but for now it solves the problem without affecting the performance of synchronous crypto operation (most of current dm-crypt users). http://bugzilla.kernel.org/show_bug.cgi?id=10242 http://bugzilla.kernel.org/show_bug.cgi?id=10207 Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:22 -07:00
Andrew Morton	9ea85ebae1	drivers/md/raid5.c: fix printk warnings gcc-3.4.5 on sparc64: drivers/md/raid5.c: In function `raid5_end_read_request': drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4) drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3) drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3) sector_t is u64, and we don't know what type the architecture uses to implement u64 (on some it is unsigned long). Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-19 18:53:37 -07:00
NeilBrown	0e82989d95	md: remove the 'super' sysfs attribute from devices in an 'md' array Exposing the binary blob which is the md 'super-block' via sysfs doesn't really fit with the whole sysfs model, and ever since commit `8118a859dc` ("sysfs: fix off-by-one error in fill_read_buffer()") it doesn't actually work at all (as the size of the blob is often one page). (akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG) So just remove it altogether. It isn't really useful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-19 18:53:35 -07:00
NeilBrown	7be3dfec47	md: reduce CPU wastage on idle md array with a write-intent bitmap Recent patch titled Reduce CPU wastage on idle md array with a write-intent bitmap. would sometimes leave the array with dirty bitmap bits that stay dirty. A subsequent write would sort things out so it isn't a big problem, but should be fixed nonetheless. We need to make sure that when the bitmap becomes not "allclean", the daemon_sleep really does get set to a sensible value. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:19 -07:00
NeilBrown	52720ae77d	md: fix formatting error in /proc/mdstat If an md array is "auto-read-only", then this appears in /proc/mdstat as /dev/md0: active(auto-read-only) whereas if it is truely readonly, it appears as /dev/md0: active (read-only) The difference being a space. One program known to parse this file expects the space and gets badly confused. It will be fixed, but it would be best if what the kernel generates is more consistent too. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:19 -07:00
K.Tanaka	a07e6ab41b	md: the md RAID10 resync thread could cause a md RAID10 array deadlock This message describes another issue about md RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi error results in disabling a disk during RAID10 recovery, the resync threads of md RAID10 could stall. This case, the raid array has already been broken and it may not matter. But I think stall is not preferable. If it occurs, even shutdown or reboot will fail because of resource busy. The deadlock mechanism: The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be handled when recovering. The "remaining" counter is incremented when building a BIO in sync_request() and is decremented when finish a BIO in end_sync_write(). If building a BIO fails for some reasons in sync_request(), the "remaining" should be decremented if it has already been incremented. I found a case where this decrement is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for md_done_sync() called by end_sync_write(), but end_sync_write() never calls md_done_sync() because of the "remaining" counter mismatch. For example, this problem would be reproduced in the following case: Personalities : [raid10] md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F) 3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_] [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec This case, sdf1 is recovering, sdb1 and sde1 are disabled. An additional error with detaching sdd will cause a deadlock. md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F) 3919616 blocks 64K chunks 2 near-copies [4/1] [_U__] [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec 2739 ? S< 0:17 [md0_raid10] 28608 ? D< 0:00 [md0_resync] 28629 pts/1 Ss 0:00 bash 28830 pts/1 R+ 0:00 ps ax 31819 ? D< 0:00 [kjournald] The resync thread keeps working, but actually it is deadlocked. Patch: By this patch, the remaining counter will be decremented if needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	1c830532f6	md: fix possible raid1/raid10 deadlock on read error during resync Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for another possible deadlock in raid1/raid10 error handing. If a read request returns an error while a resync is happening and a resync request is pending, the attempt to fix the error will block until the resync progresses, and the resync will block until the read request completes. Thus a deadlock. This patch fixes the problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
Keld Simonsen	8ed3a19563	md: don't attempt read-balancing for raid10 'far' layouts This patch changes the disk to be read for layout "far > 1" to always be the disk with the lowest block address. Thus the chunks to be read will always be (for a fully functioning array) from the first band of stripes, and the raid will then work as a raid0 consisting of the first band of stripes. Some advantages: The fastest part which is the outer sectors of the disks involved will be used. The outer blocks of a disk may be as much as 100 % faster than the inner blocks. Average seek time will be smaller, as seeks will always be confined to the first part of the disks. Mixed disks with different performance characteristics will work better, as they will work as raid0, the sequential read rate will be number of disks involved times the IO rate of the slowest disk. If a disk is malfunctioning, the first disk which is working, and has the lowest block address for the logical block will be used. Signed-off-by: Keld Simonsen <keld@dkuug.dk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	27c529bb8e	md: lock access to rdev attributes properly When we access attributes of an rdev (component device on an md array) through sysfs, we really need to lock the array against concurrent changes. We currently do that when we change an attribute, but not when we read an attribute. We need to lock when reading as well else rdev->mddev could become NULL while we are accessing it. So add appropriate locking (mddev_lock) to rdev_attr_show. rdev_size_store requires some extra care as well as it needs to unlock the mddev while scanning other mddevs for overlapping regions. We currently assume that rdev->mddev will still be unchanged after the scan, but that cannot be certain. So take a copy of rdev->mddev for use at the end of the function. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	2515619823	md: make sure a reshape is started when device switches to read-write A resync/reshape/recovery thread will refuse to progress when the array is marked read-only. So whenever it mark it not read-only, it is important to wake up thread resync thread. There is one place we didn't do this. The problem manifests if the start_ro module parameters is set, and a raid5 array that is in the middle of a reshape (restripe) is started. The array will initially be semi-read-only (meaning it acts like it is readonly until the first write). So the reshape will not proceed. On the first write, the array will become read-write, but the reshape will not be started, and there is no event which will ever restart that thread. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	d0fae18f1b	md: clean up irregularity with raid autodetect When a raid1 array is stopped, all components currently get added to the list for auto-detection. However we should really only add components that were found by autodetection in the first place. So add a flag to record that information, and use it. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	a1801f858e	md: guard against possible bad array geometry in v1 metadata Make sure the data doesn't start before the end of the superblock when the superblock is at the start of the device. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:17 -08:00
NeilBrown	8311c29d40	md: reduce CPU wastage on idle md array with a write-intent bitmap On an md array with a write-intent bitmap, a thread wakes up every few seconds and scans the bitmap looking for work to do. If the array is idle, there will be no work to do, but a lot of scanning is done to discover this. So cache the fact that the bitmap is completely clean, and avoid scanning the whole bitmap when the cache is known to be clean. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:17 -08:00
NeilBrown	a35e63efa1	md: fix deadlock in md/raid1 and md/raid10 when handling a read error When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. This is done in the raid1d(raid10d) thread and must wait for all submitted IO to complete (except for requests that failed and are sitting in the retry queue - these are counted in ->nr_queue and will stay there during a freeze). However write requests need attention from raid1d as bitmap updates might be required. This can cause a deadlock as raid1 is waiting for requests to finish that themselves need attention from raid1d. So we create a new function 'flush_pending_writes' to give that attention, and call it in freeze_array to be sure that we aren't waiting on raid1d. Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:17 -08:00
Adrian Bunk	e03f1a8422	dm-raid1.c: fix NULL dereferences This patch fixes two NULL dereferences introduced by commit `06386bbfd2` and spotted by the Coverity checker. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-19 15:52:27 -08:00
Jan Blunck	cf28b4863f	d_path: Make d_path() use a struct path d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to reflect this. [akpm@linux-foundation.org: fix build in mm/memory.c] Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Bryan Wu <bryan.wu@analog.com> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:17:09 -08:00
Jan Blunck	c32c2f63a9	d_path: Make seq_path() use a struct path argument seq_path() is always called with a dentry and a vfsmount from a struct path. Make seq_path() take it directly as an argument. Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:17:08 -08:00
Jan Blunck	1d957f9bf8	Introduce path_put() * Add path_put() functions for releasing a reference to the dentry and vfsmount of a struct path in the right order * Switch from path_release(nd) to path_put(&nd->path) * Rename dput_path() to path_put_conditional() [akpm@linux-foundation.org: fix cifs] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: <linux-fsdevel@vger.kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:33 -08:00
Jan Blunck	4ac9137858	Embed a struct path into struct nameidata instead of nd->{dentry,mnt} This is the central patch of a cleanup series. In most cases there is no good reason why someone would want to use a dentry for itself. This series reflects that fact and embeds a struct path into nameidata. Together with the other patches of this series - it enforced the correct order of getting/releasing the reference count on <dentry,vfsmount> pairs - it prepares the VFS for stacking support since it is essential to have a struct path in every place where the stack can be traversed - it reduces the overall code size: without patch series: text data bss dec hex filename 5321639 858418 715768 6895825 6938d1 vmlinux with patch series: text data bss dec hex filename 5320026 858418 715768 `6894212` 693284 vmlinux This patch: Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix cifs] [akpm@linux-foundation.org: fix smack] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:33 -08:00
Al Viro	39ed7adb17	dm-raid1 breakage on 64bit test_and_set_bit() on address of uint32_t is a Bad Idea(tm)... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-13 08:16:34 -08:00
Jonathan Brassow	af195ac82e	dm raid1: report fault status This patch adds extra information to the mirror status output, so that it can be determined which device(s) have failed. For each mirror device, a character is printed indicating the most severe error encountered. The characters are: * A => Alive - No failures * D => Dead - A write failure occurred leaving mirror out-of-sync * S => Sync - A sychronization failure occurred, mirror out-of-sync * R => Read - A read failure occurred, mirror data unaffected This allows userspace to properly reconfigure the mirror set. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:39 +00:00
Jonathan Brassow	06386bbfd2	dm raid1: handle read failures This patch gives the ability to respond-to/record device failures that happen during read operations. It also adds the ability to read from mirror devices that are not the primary if they are in-sync. There are essentially two read paths in mirroring; the direct path and the queued path. When a read request is mapped, if the region is 'in-sync' the direct path is taken; otherwise the queued path is taken. If the direct path is taken, we must record bio information so that if the read fails we can retry it. We then discover the status of a direct read through mirror_end_io. If the read has failed, we will mark the device from which the read was attempted as failed (so we don't try to read from it again), restore the bio and try again. If the queued path is taken, we discover the results of the read from 'read_callback'. If the device failed, we will mark the device as failed and attempt the read again if there is another device where this region is known to be 'in-sync'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:37 +00:00
Jonathan Brassow	b80aa7a0c2	dm raid1: fix EIO after log failure This patch adds the ability to requeue write I/O to core device-mapper when there is a log device failure. If a write to the log produces and error, the pending writes are put on the "failures" list. Since the log is marked as failed, they will stay on the failures list until a suspend happens. Suspends come in two phases, presuspend and postsuspend. We must make sure that all the writes on the failures list are requeued in the presuspend phase (a requirement of dm core). This means that recovery must be complete (because writes may be delayed behind it) and the failures list must be requeued before we return from presuspend. The mechanisms to ensure recovery is complete (or stopped) was already in place, but needed to be moved from postsuspend to presuspend. We rely on 'flush_workqueue' to ensure that the mirror thread is complete and therefore, has requeued all writes in the failures list. Because we are using flush_workqueue, we must ensure that no additional 'queue_work' calls will produce additional I/O that we need to requeue (because once we return from presuspend, we are unable to do anything about it). 'queue_work' is called in response to the following functions: - complete_resync_work = NA, recovery is stopped - rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it is ready to recover the region (recovery is stopped) or it needs to clear the region in the log* this doesn't get called while suspending - rh_recovery_end = NA, recovery is stopped - rh_recovery_start = NA, recovery is stopped - write_callback = 1) Writes w/o failures simply call bio_endio -> mirror_end_io -> rh_dec (see rh_dec above) 2) Writes with failures are put on the failures list and queue_work is called write_callbacks don't happen during suspend ** - do_failures = NA, 'queue_work' not called if suspending - add_mirror (initialization) = NA, only done on mirror creation - queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue is called. 2) No more I/Os are being issued. 3) Re-attempted READs can still be handled. (Write completions are handled through rh_dec/ write_callback - mention above - and do not use queue_bio.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:35 +00:00
Jonathan Brassow	8f0205b798	dm raid1: handle recovery failures This patch adds the calls to 'fail_mirror' if an error occurs during mirror recovery (aka resynchronization). 'fail_mirror' is responsible for recording the type of error by mirror device and ensuring an event gets raised for the purpose of notifying userspace. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:32 +00:00
Jonathan Brassow	72f4b31410	dm raid1: handle write failures This patch gives mirror the ability to handle device failures during normal write operations. The 'write_callback' function is called when a write completes. If all the writes failed or succeeded, we report failure or success respectively. If some of the writes failed, we call fail_mirror; which increments the error count for the device, notes the type of error encountered (DM_RAID1_WRITE_ERROR), and selects a new primary (if necessary). Note that the primary device can never change while the mirror is not in-sync (IOW, while recovery is happening.) This means that the scenario where a failed write changes the primary and gives recovery_complete a chance to misread the primary never happens. The fact that the primary can change has necessitated the change to the default_mirror field. We need to protect against reading garbage while the primary changes. We then add the bio to a new list in the mirror set, 'failures'. For every bio in the 'failures' list, we call a new function, '__bio_mark_nosync', where we mark the region 'not-in-sync' in the log and properly set the region state as, RH_NOSYNC. Userspace must also be notified of the failure. This is done by 'raising an event' (dm_table_event()). If fail_mirror is called in process context the event can be raised right away. If in interrupt context, the event is deferred to the kmirrord thread - which raises the event if 'event_waiting' is set. Backwards compatibility is maintained by ignoring errors if the DM_FEATURES_HANDLE_ERRORS flag is not present. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:29 +00:00
Milan Broz	d74f81f8ad	dm snapshot: combine consecutive exceptions in memory Provided sector_t is 64 bits, reduce the in-memory footprint of the snapshot exception table by the simple method of using unused bits of the chunk number to combine consecutive entries. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:27 +00:00
Brian Wood	4f7f5c675f	dm: stripe enhanced status return This patch adds additional information to the status line. It is added at the end of the returned text so it will not interfere with existing implementations using this data. The addition of this information will allow for a common return interface to match that returned with the dm-raid1.c status line (with Jonathan Brassow's patches). Here is a sample of what is returned with a mirror "status" call: isw_eeaaabgfg_mirror: 0 488390920 mirror 2 8:16 8:32 3727/3727 1 AA 1 core Here's what's returned with this patch for a stripe "status" call: isw_dheeijjdej_stripe: 0 976783872 striped 2 8:16 8:32 1 AA Signed-off-by: Brian Wood <brian.j.wood@intel.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:24 +00:00
Brian Wood	a25eb9446a	dm: stripe trigger event on failure This patch adds the stripe_end_io function to process errors that might occur after an IO operation. As part of this there are a number of enhancements made to record and trigger events: - New atomic variable in struct stripe to record the number of errors each stripe volume device has experienced (could be used later with uevents to report back directly to userspace) - New workqueue/work struct setup to process the trigger_event function - New end_io function. It is here that testing for BIO error conditions take place. It determines the exact stripe that cause the error, records this in the new atomic variable, and calls the queue_work() function - New trigger_event function to process failure events. This calls dm_table_event() Signed-off-by: Brian Wood <brian.j.wood@intel.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:22 +00:00
Jonathan Brassow	fb8b284806	dm log: auto load modules If the log type is not recognised, attempt to load the module 'dm-log-<type>.ko'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:19 +00:00
Milan Broz	304f3f6a58	dm: move deferred bio flushing to workqueue Add a single-thread workqueue for each mapped device and move flushing of the lists of pushback and deferred bios to this new workqueue. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:17 +00:00
Milan Broz	3a7f6c990a	dm crypt: use async crypto dm-crypt: Use crypto ablkcipher interface Move encrypt/decrypt core to async crypto call. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:14 +00:00
Milan Broz	95497a9600	dm crypt: prepare async callback fn dm-crypt: Use crypto ablkcipher interface Prepare callback function for async crypto operation. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:12 +00:00
Milan Broz	43d6903482	dm crypt: add completion for async dm-crypt: Use crypto ablkcipher interface Prepare completion for async crypto request. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:09 +00:00
Milan Broz	ddd42edfd8	dm crypt: add async request mempool dm-crypt: Use crypto ablkcipher interface Introduce mempool for async crypto requests. cc->req is used mainly during synchronous operations (to prevent allocation and deallocation of the same object). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:07 +00:00
Milan Broz	01482b7671	dm crypt: extract scatterlist processing dm-crypt: Use crypto ablkcipher interface Move scatterlists to separate dm_crypt_struct and pick out block processing from crypt_convert. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:04 +00:00
Milan Broz	899c95d36c	dm crypt: tidy io ref counting Make io reference counting more obvious. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:11:02 +00:00
Milan Broz	84131db689	dm crypt: introduce crypt_write_io_loop Introduce crypt_write_io_loop(). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:59 +00:00
Milan Broz	dec1cedf9d	dm crypt: abstract crypt_write_done Process write request in separate function and queue final bio through io workqueue. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:57 +00:00
Milan Broz	0c395b0f8d	dm crypt: store sector mapping in dm_crypt_io Add sector into dm_crypt_io instead of using local variable. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:54 +00:00
Alasdair G Kergon	395b167ca0	dm crypt: move queue functions Reorder kcryptd functions for clarity. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:52 +00:00
Milan Broz	4e4eef64e2	dm crypt: adjust io processing functions Rename functions to follow calling convention. Prepare write io error processing function skeleton. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:49 +00:00
Milan Broz	ee7a491e62	dm crypt: tidy crypt_endio Simplify crypt_endio function. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:46 +00:00
Milan Broz	5742fd7775	dm crypt: move error setting outside crypt_dec_pending Move error code setting outside of crypt_dec_pending function. Use -EIO if crypt_convert_scatterlist() fails. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:43 +00:00
Milan Broz	fcd369daa3	dm crypt: remove unnecessary crypt_context write parm Remove write attribute from convert_context and use bio flag instead. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:41 +00:00
Milan Broz	53017030e2	dm crypt: move convert_context inside dm_crypt_io Move convert_context inside dm_crypt_io. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:38 +00:00
Alasdair G Kergon	009cd09042	dm mpath: add missing static A static declaration missing. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:35 +00:00
Alasdair G Kergon	0149e57fed	dm: targets no longer experimental Drop the EXPERIMENTAL tag from well-established device-mapper targets, so the newer ones stand out better. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:32 +00:00
Milan Broz	46125c1c90	dm: refactor dm_suspend completion wait Move completion wait to separate function Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:30 +00:00
Milan Broz	94d6351e14	dm: split dm_suspend io_lock hold into two Change io_locking to allow processing flush in separate thread. Because we have DMF_BLOCK_IO already set, any possible new ios are queued in dm_requests now. In the case of interrupting previous wait there can be more ios queued (we unlocked io_lock for a while) but this is safe. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:27 +00:00
Milan Broz	73d410c013	dm: tidy dm_suspend Tidy dm_suspend function - change return value logic in dm_suspend - use atomic_read only once. - move DMF_BLOCK_IO clearing into one place Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:25 +00:00
Milan Broz	6d6f10df89	dm: refactor deferred bio_list processing Refactor deferred bio_list processing. - use separate _merge_pushback_list function - move deferred bio list pick up to flush function - use bio_list_pop instead of bio_list_get - simplify noflush flag use No real functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:22 +00:00
Milan Broz	6ed7ade896	dm: tidy alloc_dev labels Tidy labels in alloc_dev to make later patches more clear. No functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:19 +00:00
Andrew Morton	a26ffd4aa9	dm ioctl: use uninitialized_var drivers/md/dm-ioctl.c:1405: warning: 'param' may be used uninitialized in this function Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:16 +00:00
Andrew Morton	69a2ce72a4	dm: table use uninitialized_var drivers/md/dm-table.c: In function 'dm_get_device': drivers/md/dm-table.c:478: warning: 'dev' may be used uninitialized in this function Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:14 +00:00
Andrew Morton	e48b9db251	dm snapshot: use uninitialized_var drivers/md/dm-exception-store.c: In function 'persistent_read_metadata': drivers/md/dm-exception-store.c:452: warning: 'new_snapshot' may be used uninitialized in this function Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:11 +00:00
Daniel Walker	e61290a4a2	dm: convert suspend_lock semaphore to mutex Replace semaphore with mutex. Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:08 +00:00
Robert P. J. Day	8defd83084	dm snapshot: use rounddown_pow_of_two Since the source file already includes the log2.h header file, it seems pointless to re-invent the necessary routine. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:06 +00:00
Jun'ichi Nomura	82d601dc07	dm: table remove unused total "total = 0" does nothing. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:10:04 +00:00
Paul Jimenez	afb24528f9	dm: table use list_for_each This patch is some minor janitorish cleanup, using some macros from linux/list.h (already #included via dm.h) to improve readability. Signed-off-by: Paul Jimenez <pj@place.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:59 +00:00
Milan Broz	76c072b48e	dm ioctl: move compat code Move compat_ioctl handling into dm-ioctl.c. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:56 +00:00
Alasdair G Kergon	27238b2bea	dm ioctl: remove lock_kernel Remove lock_kernel() from the device-mapper ioctls - there should be sufficient internal locking already where required. Also remove some superfluous casts. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:53 +00:00
Alasdair G Kergon	b9249e5568	dm: mark function lists static Add a couple of statics. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:51 +00:00
Milan Broz	7e5c1e830b	dm: add missing memory barrier to dm_suspend Add memory barrier to fix atomic_read of pending value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-02-08 02:09:49 +00:00
NeilBrown	6ed3003c19	md: fix an occasional deadlock in raid5 raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Neil said: How about we queue it for 2.6.25-rc1 and then about when -rc2 comes out, we queue it for 2.6.24.y? Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Tested-by: dean gaudet <dean@arctic.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	73c34431c7	md: change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING. Finish ITERATE_ to for_each conversion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	d089c6af10	md: change ITERATE_RDEV to rdev_for_each As this is more in line with common practice in the kernel. Also swap the args around to be more like list_for_each. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	29ac4aa3fc	md: change INTERATE_MDDEV to for_each_mddev As this is more consistent with kernel style. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	20a49ff679	md: change a few 'int' to 'size_t' in md As suggested by Andrew Morton. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	177a99b23e	md: fix use-after-free bug when dropping an rdev from an md array Due to possible deadlock issues we need to use a schedule work to kobject_del an 'rdev' object from a different thread. A recent change means that kobject_add no longer gets a refernce, and kobject_del doesn't put a reference. Consequently, we need to explicitly hold a reference to ensure that the last reference isn't dropped before the scheduled work get a chance to call kobject_del. Also, rename delayed_delete to md_delayed_delete to that it is more obvious in a stack trace which code is to blame. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	a17184a911	md: allow an md array to appear with 0 drives if it has external metadata Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:19 -08:00
NeilBrown	ca38805945	md: lock address when changing attributes of component devices Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c5d79adba7	md: allow devices to be shared between md arrays Currently, a given device is "claimed" by a particular array so that it cannot be used by other arrays. This is not ideal for DDF and other metadata schemes which have their own partitioning concept. So for externally managed metadata, just claim the device for md in general, require that "offset" and "size" are set properly for each device, and make sure that if a device is included in different arrays then the active sections do not overlap. This involves adding another flag to the rdev which makes it awkward to set "->flags = 0" to clear certain flags. So now clear flags explicitly by name when we want to clear things. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	1ec4a9398d	md: set and test the ->persistent flag for md devices more consistently If you try to start an array for which the number of raid disks is listed as zero, md will currently try to read metadata off any devices that have been given. This was done because the value of raid_disks is used to signal whether array details have been provided by userspace (raid_disks > 0) or must be read from the devices (raid_disks == 0). However for an array without persistent metadata (or with externally managed metadata) this is the wrong thing to do. So we add a test in do_md_run to give an error if raid_disks is zero for non-persistent arrays. This requires that mddev->persistent is set corrently at this point, which it currently isn't for in-kernel autodetected arrays. So set ->persistent for autodetect arrays, and remove the settign in super_*_validate which is now redundant. Also clear ->persistent when stopping an array so it is consistently zero when starting an array. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c620727779	md: allow a maximum extent to be set for resyncing This allows userspace to control resync/reshape progress and synchronise it with other activities, such as shared access in a SAN, or backing up critical sections during a tricky reshape. Writing a number of sectors (which must be a multiple of the chunk size if such is meaningful) causes a resync to pause when it gets to that point. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	c303da6d71	md: give userspace control over removing failed devices when external metdata in use When a device fails, we must not allow an further writes to the array until the device failure has been recorded in array metadata. When metadata is managed externally, this requires some synchronisation... Allow/require userspace to explicitly remove failed devices from active service in the array by writing 'none' to the 'slot' attribute. If this reduces the number of failed devices to 0, the write block will automatically be lowered. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	e691063a61	md: support 'external' metadata for md arrays - Add a state flag 'external' to indicate that the metadata is managed externally (by user-space) so important changes need to be left of user-space to handle. Alternates are non-persistant ('none') where there is no stable metadata - after the array is stopped there is no record of it's status - and internal which can be version 0.90 or version 1.x These are selected by writing to the 'metadata' attribute. - move the updating of superblocks (sync_sbs) to after we have checked if there are any superblocks or not. - New array state 'write_pending'. This means that the metadata records the array as 'clean', but a write has been requested, so the metadata has to be updated to record a 'dirty' array before the write can continue. This change is reported to md by writing 'active' to the array_state attribute. - tidy up marking of sb_dirty: - don't set sb_dirty when resync finishes as md_check_recovery calls md_update_sb when the sync thread finishes anyway. - Don't set sb_dirty in multipath_run as the array might not be dirty. - don't mark superblock dirty when switching to 'clean' if there is no internal superblock (if external, userspace can choose to update the superblock whenever it chooses to). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
NeilBrown	b47490c9bc	md: Update md bitmap during resync. Currently an md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciably affect resync performance. [snitzer@gmail.com: export bitmap_cond_end_sync] Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
H. Peter Anvin	66c811e993	md: raid6: clean up the style of raid6test/test.c Clean up the coding style in raid6test/test.c. Break it apart into subfunctions to make the code more readable. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
H. Peter Anvin	98ec302be5	md: raid6: Fix mktable.c Make both mktables.c and its output CodingStyle compliant. Update the copyright notice. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
Oliver Pinter	54212cf405	coding style cleanups for drivers/md/mktables.c Signed-off-by: Oliver Pinter <oliver.pntr@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-06 10:41:18 -08:00
Greg Kroah-Hartman	c10997f657	Kobject: convert drivers/* from kobject_unregister() to kobject_put() There is no need for kobject_unregister() anymore, thanks to Kay's kobject cleanup changes, so replace all instances of it with kobject_put(). Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:40 -08:00
Greg Kroah-Hartman	f9cb074bff	Kobject: rename kobject_init_ng() to kobject_init() Now that the old kobject_init() function is gone, rename kobject_init_ng() to kobject_init() to clean up the namespace. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman	b2d6db5878	Kobject: rename kobject_add_ng() to kobject_add() Now that the old kobject_add() function is gone, rename kobject_add_ng() to kobject_add() to clean up the namespace. Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:38 -08:00
Greg Kroah-Hartman	649316b25b	Kobject: convert drivers/md/md.c to use kobject_init/add_ng() This converts the code to use the new kobject functions, cleaning up the logic in doing so. Cc: Neil Brown <neilb@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:37 -08:00
Kay Sievers	edfaa7c365	Driver core: convert block from raw kobjects to core devices This moves the block devices to /sys/class/block. It will create a flat list of all block devices, with the disks and partitions in one directory. For compatibility /sys/block is created and contains symlinks to the disks. /sys/class/block \|-- sda -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda \|-- sda1 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda1 \|-- sda10 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda10 \|-- sda5 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda5 \|-- sda6 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda6 \|-- sda7 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda7 \|-- sda8 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda8 \|-- sda9 -> ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/sda9 `-- sr0 -> ../../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0 /sys/block/ \|-- sda -> ../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda `-- sr0 -> ../devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sr0 Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:36 -08:00
Greg Kroah-Hartman	3830c62fef	Kobject: change drivers/md/md.c to use kobject_init_and_add Stop using kobject_register, as this way we can control the sending of the uevent properly, after everything is properly initialized. Cc: Neil Brown <neilb@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-01-24 20:40:29 -08:00
Dan Williams	0f94e87cde	md: fix data corruption when a degraded raid5 array is reshaped We currently do not wait for the block from the missing device to be computed from parity before copying data to the new stripe layout. The change in the raid6 code is not techincally needed as we don't delay data block recovery in the same way for raid6 yet. But making the change now is safer long-term. This bug exists in 2.6.23 and 2.6.24-rc Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-08 16:10:35 -08:00
Milan Broz	91e1062592	dm crypt: use bio_add_page Fix possible max_phys_segments violation in cloned dm-crypt bio. In write operation dm-crypt needs to allocate new bio request and run crypto operation on this clone. Cloned request has always the same size, but number of physical segments can be increased and violate max_phys_segments restriction. This can lead to data corruption and serious hardware malfunction. This was observed when using XFS over dm-crypt and at least two HBA controller drivers (arcmsr, cciss) recently. Fix it by using bio_add_page() call (which tests for other restrictions too) instead of constructing own biovec. All versions of dm-crypt are affected by this bug. Cc: stable@kernel.org Cc: dm-crypt@saout.de Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:13 +00:00
Neil Brown	91212507f9	dm: merge max_hw_sector Make sure dm honours max_hw_sectors of underlying devices We still have no firm testing evidence in support of this patch but believe it may help to resolve some bug reports. - agk Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:12 +00:00
Alasdair G Kergon	69267a30be	dm: trigger change uevent on rename Insert a missing KOBJ_CHANGE notification when a device is renamed. Cc: Scott James Remnant <scott@ubuntu.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:11 +00:00
Milan Broz	adfe47702c	dm crypt: fix write endio Fix BIO_UPTODATE test for write io. Cc: stable@kernel.org Cc: dm-crypt@saout.de Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:10 +00:00
Paul Mundt	d1622e8909	dm mpath: hp requires scsi With CONFIG_SCSI=n __scsi_print_sense() is never linked in. drivers/built-in.o: In function `hp_sw_end_io': dm-mpath-hp-sw.c:(.text+0x914f8): undefined reference to `__scsi_print_sense' Caught with a randconfig on current git. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-12-20 17:32:09 +00:00
Jun'ichi Nomura	512875bd96	dm: table detect io beyond device This patch fixes a panic on shrinking a DM device if there is outstanding I/O to the part of the device that is being removed. (Normally this doesn't happen - a filesystem would be resized first, for example.) The bug is that __clone_and_map() assumes dm_table_find_target() always returns a valid pointer. It may fail if a bio arrives from the block layer but its target sector is no longer included in the DM btree. This patch appends an empty entry to table->targets[] which will be returned by a lookup beyond the end of the device. After calling dm_table_find_target(), __clone_and_map() and target_message() check for this condition using dm_target_is_valid(). Sample test script to trigger oops:	2007-12-20 17:32:08 +00:00
Dan Williams	6c55be8b96	raid5: fix unending write sequence <debug output from Joel's system> handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000 check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000 check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000 check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000 locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 for sector 7629696, rmw=0 rcw=0 </debug> These blocks were prepared to be written out, but were never handled in ops_run_biodrain(), so they remain locked forever. The operations flags are all clear which means handle_stripe() thinks nothing else needs to be done. This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it should not have been. This patch cleans up cases where the code looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. Report from Joel: Resync done. Patch fix this bug. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Cc: <stable@kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:39 -08:00
Alan D. Brunelle	2ad8b1ef11	Add UNPLUG traces to all appropriate places Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-09 13:41:32 +01:00
Neil Brown	def6ae26a9	md: fix misapplied patch in raid5.c commit `4ae3f847e4` ("md: raid5: fix clearing of biofill operations") did not get applied correctly, presumably due to substantial similarities between handle_stripe5 and handle_stripe6. This patch moves the chunk of new code from handle_stripe6 (where it isn't needed (yet)) to handle_stripe5. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Dan Williams" <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-05 15:12:32 -08:00
Vasily Averin	5ec140e600	dm: bounce_pfn limit added Device mapper uses its own bounce_pfn that may differ from one on underlying device. In that way dm can build incorrect requests that contain sg elements greater than underlying device is able to handle. This is the cause of slab corruption in i2o layer, occurred on i386 arch when very long direct IO requests are addressed to dm-over-i2o device. Signed-off-by: Vasily Averin <vvs@sw.ru> Cc: <stable@kernel.org> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-02 08:47:25 +01:00
Al Viro	ca5cd877ae	x86 merge fallout: uml Don't undef __i386__/__x86_64__ in uml anymore, make sure that (few) places that required adjusting the ifdefs got those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-29 07:41:32 -07:00
Herbert Xu	68e3f5dd4d	[CRYPTO] users: Fix up scatterlist conversion errors This patch fixes the errors made in the users of the crypto layer during the sg_init_table conversion. It also adds a few conversions that were missing altogether. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-27 00:52:07 -07:00
Jens Axboe	642f149031	SG: Change sg_set_page() to take length and offset argument Most drivers need to set length and offset as well, so may as well fold those three lines into one. Add sg_assign_page() for those two locations that only needed to set the page, where the offset/length is set outside of the function context. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-24 11:20:47 +02:00
Dan Williams	4ae3f847e4	md: raid5: fix clearing of biofill operations ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the 'pending' and 'ack' bits. Since the test_and_ack_op() macro only checks against 'complete' it can get an inconsistent snapshot of pending work. Move the clearing of these bits to handle_stripe5(), under the lock. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Stable <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-23 08:32:06 -07:00
NeilBrown	85bfb4da8c	md: fix an unsigned compare to allow creation of bitmaps with v1.0 metadata As page->index is unsigned, this all becomes an unsigned comparison, which almost always returns an error. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Stable <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-23 08:32:06 -07:00
Jens Axboe	45711f1af6	[SG] Update drivers to use sg helpers Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-22 21:19:53 +02:00
Linus Torvalds	c00046c279	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial * git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (74 commits) fix do_sys_open() prototype sysfs: trivial: fix sysfs_create_file kerneldoc spelling mistake Documentation: Fix typo in SubmitChecklist. Typo: depricated -> deprecated Add missing profile=kvm option to Documentation/kernel-parameters.txt fix typo about TBI in e1000 comment proc.txt: Add /proc/stat field small documentation fixes Fix compiler warning in smount example program from sharedsubtree.txt docs/sysfs: add missing word to sysfs attribute explanation documentation/ext3: grammar fixes Documentation/java.txt: typo and grammar fixes Documentation/filesystems/vfs.txt: typo fix include/asm-*/system.h: remove unused set_rmb(), set_wmb() macros trivial copy_data_pages() tidy up Fix typo in arch/x86/kernel/tsc_32.c file link fix for Pegasus USB net driver help remove unused return within void return function Typo fixes retrun -> return x86 hpet.h: remove broken links ...	2007-10-19 20:36:17 -07:00
Milan Broz	80fd662683	dm crypt: tidy pending Add crypt prefix to dec_pending to avoid confusing it in backtraces with the dm core function of the same name. No functional change here. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:28 +01:00
Mike Anderson	b15546f942	dm mpath: send uevents This patch adds calls to dm_path_event for a failed path and a reinstated path. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:27 +01:00
Mike Anderson	7a8c3d3b92	dm: uevent generate events This patch adds support for the dm_path_event dm_send_event functions which create and send udev events. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:26 +01:00
Mike Anderson	51e5b2bd34	dm: add uevent to core This patch adds a uevent skeleton to device-mapper. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:24 +01:00
Mike Anderson	96a1f7dba6	dm: export name and uuid This patch adds a function to obtain a copy of a mapped device's name and uuid. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:23 +01:00
Jonathan Brassow	aa5617c553	dm raid1: add mirror_set to struct mirror Store a pointer to the owning mirror_set structure within each mirror structure for a subsequent patch to use. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:22 +01:00
Jonathan Brassow	6b3df0d7a5	dm log: split suspend There are now two phases to a suspend in device-mapper - presuspend and postsuspend. This patch removes the single 'suspend' in the logging API and replaces it with 'presuspend' and 'postsuspend' functions to align it better with core device-mapper. A subsequent patch will make use of 'presuspend'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:21 +01:00
Dave Wysochanski	fe97e2aa05	dm mpath: hp retry if not ready This patch adds retries to the hp hardware handler, and utilizes the MP_RETRY flag of dm-multipath. For now in the hp handler, if we get a pg_init completed with a check condition we just assume we can retry the pg_init command. We make this assumption because of incomplete data on specific check condition code of the HP hardware, and because testing has shown the HP path initialization command to be idempotent. The number of times we retry is settable via the "pg_init_retries" multipath map feature. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:20 +01:00
Dave Wysochanski	16ebbf3584	dm mpath: add hp handler This patch adds the most basic dm-multipath hardware support for the HP active/passive arrays. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:19 +01:00
Dave Wysochanski	c9e45581ad	dm mpath: add retry pg init This patch allows a failed path group initialisation command to be retried. It adds a generic MP_RETRY flag and a "pg_init_retries" feature to device-mapper multipath which limits the number of retries. 1. A hw handler sends a path initialization command to the storage and the command completes with an error code indicating the command should be retried. 2. The hardware handler calls dm_pg_init_complete() with MP_RETRY set in err_flags to ask the dm multipath core to retry. 3. If the retry limit has not been exceeded, pg_init() is retried. Otherwise fail_path() is called. If you are using the userspace multipath-tools or device-mapper-multipath package, you can set pg_init_retries in the 'device' section of your /etc/multipath.conf file. For example: features "2 pg_init_retries 7" The number of PG retries attempted is reported in the 'dmsetup status' output. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:18 +01:00
Milan Broz	636d5786c4	dm crypt: tidy labels Replace numbers with names in labels in error paths, to avoid confusion when new one get added between existing ones. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:17 +01:00
Milan Broz	d469f84197	dm crypt: tidy whitespace Clean up, convert some spaces to tabs. No functional change here. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:15 +01:00
Milan Broz	cabf08e4d3	dm crypt: add post processing queue Add post-processing queue (per crypt device) for read operations. Current implementation uses only one queue for all operations and this can lead to starvation caused by many requests waiting for memory allocation. But the needed memory-releasing operation is queued after these requests (in the same queue). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:14 +01:00
Milan Broz	9934a8bea2	dm crypt: use per device singlethread workqueues Use a separate single-threaded workqueue for each crypt device instead of one global workqueue. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:13 +01:00
Alasdair G Kergon	d336416ff1	dm mpath: emc fix an error message Correct an error message, reported by Michael Wood <michael@frogfoot.com>. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:12 +01:00
Alasdair G Kergon	051814c69f	dm: bio_list macro renaming Remove BIO_LIST and DEFINE_BIO_LIST macros that gain us nothing since contents are initialised to NULL. Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:11 +01:00
Jesper Juhl	bb56acf840	dm io:ctl remove vmalloc void cast In drivers/md/dm-ioctl.c::copy_params() there's a call to vmalloc() where we currently cast the return value, but that's pretty pointless given that vmalloc() returns "void *". Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:10 +01:00
Milan Broz	9e4e5f87eb	dm: tidy bio_io_error usage Use bio_io_error() in only two places and tidy the code, preparing for later patches. There is no functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:09 +01:00
Matthias Kaehlcke	def5b5b26e	kcopyd use mutex instead of semaphore Kcopyd uses a semaphore as mutex. Use the mutex API instead of the (binary) semaphore, Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2007-10-20 02:01:08 +01:00
Dmitry Monakhov	094262db9e	dm: use kzalloc Convert kmalloc() + memset() to kzalloc(). Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:07 +01:00
vignesh babu	6f3c3f0afa	dm: use is_power_of_2 Replacing n & (n - 1) for power of 2 check by is_power_of_2(n) Signed-off-by: vignesh babu <vignesh.babu@wipro.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:06 +01:00
Jun'ichi Nomura	ae9da83f6d	dm: fix thaw_bdev This patch fixes a bd_mount_sem counter corruption bug in device-mapper. thaw_bdev() should be called only when freeze_bdev() was called for the device. Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter. struct block_device with the corrupted semaphore may remain in slab cache and be reused later. Attached patch will fix it by calling unlock_fs() instead. unlock_fs() will determine whether it should call thaw_bdev() by checking the device is frozen or not. Easy reproducer is: #!/bin/sh while [ 1 ]; do dmsetup --notable create a dmsetup --nolockfs suspend a dmsetup remove a done It's not easy to see the effect of corrupted semaphore. So I have tested with putting printk below in bdev_alloc_inode(): if (atomic_read(&ei->bdev.bd_mount_sem.count) != 1) printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n", atomic_read(&ei->bdev.bd_mount_sem.count), &ei->bdev); Without the patch, I saw something like: Incorrect semaphore count = 17 (f2ab91c0) With the patch, the message didn't appear. The bug was introduced in 2.6.16 with this bug fix: commit `d9dde59ba0` Date: Fri Feb 24 13:04:24 2006 -0800 [PATCH] dm: missing bdput/thaw_bdev at removal Need to unfreeze and release bdev otherwise the bdev inode with inconsistent state is reused later and cause problem. and backported to 2.6.15.5. It occurs only in free_dev(), which is called only when the dm device is removed. The buggy code is executed only if md->suspended_bdev is non-NULL and that can happen only when the device was suspended without noflush. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2007-10-20 02:01:05 +01:00
Milan Broz	79662d1ea3	dm delay: fix status Fix missing space in dm-delay target status output if separate read and write delay are configured. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:04 +01:00
Dmitry Monakhov	2e64a0f928	dm delay: fix ctr error paths Add missing 'dm_put_device' to dm-delay target constructor. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:02 +01:00
Dmitry Monakhov	a72cf737e0	dm raid1: fix leakage Add missing 'dm_io_client_destroy' to alloc_context error path. Reorganize mirror constructor error path in order to prevent workqueue leakage. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:01 +01:00
Dmitry Monakhov	815f9e3270	dm crypt: missing kfree in ctr error path Insert missing kfree() in crypt_iv_essiv_ctr() error path. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:01:01 +01:00
Dmitry Monakhov	55b42c5ae9	dm crypt: drop device ref in ctr error path Add a missing 'dm_put_device' in an error path in crypt target constructor. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:00:59 +01:00
Milan Broz	027d50f92e	dm io:ctl use constant struct size Make size of dm_ioctl struct always 312 bytes on all supported architectures. This change retains compatibility with already-compiled code because it uses an embedded offset to locate the payload that follows the structure. On 64-bit architectures there is no change at all; on 32-bit we are increasing the size of dm-ioctl from 308 to 312 bytes. Currently with 32-bit userspace / 64-bit kernel on x86_64 some ioctls (including rename, message) are incorrectly rejected by the comparison against 'param + 1'. This breaks userspace lvrename and multipath 'fail_if_no_path' changes, for example. (BTW Device-mapper uses its own versioning and ignores the ioctl size bits. Only the generic ioctl compat code on mixed arches checks them, and that will continue to accept both sizes for now, but we intend to list 308 as deprecated and eventually remove it.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Guido Guenther <agx@sigxcpu.org> Cc: Kevin Corry <kevcorry@us.ibm.com> Cc: stable@kernel.org	2007-10-20 02:00:58 +01:00
Bryn M. Reeves	c7ac86de6a	dm mpath: rdac fix init race Re-order the initialisation of dm-rdac to avoid registering the hw handler before the workqueue has been initialised. Closes a race that would potentially give an oops. Signed-off-by: Bryn M. Reeves <breeves@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2007-10-20 02:00:57 +01:00
Jan Engelhardt	96de0e252c	Convert files to UTF-8 and some cleanups * Convert files to UTF-8. * Also correct some people's names (one example is Eißfeldt, which was found in a source file. Given that the author used an ß at all in a source file indicates that the real name has in fact a 'ß' and not an 'ss', which is commonly used as a substitute for 'ß' when limited to 7bit.) * Correct town names (Goettingen -> Göttingen) * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313) Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:21:04 +02:00
Robert P. J. Day	3a4fa0a25d	Fix misspellings of "system", "controller", "interrupt" and "necessary". Fix the various misspellings of "system", controller", "interrupt" and "[un]necessary". Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:10:43 +02:00
Pavel Emelyanov	ba25f9dcc4	Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
NeilBrown	cf7a44168d	md: make sure read errors are auto-corrected during a 'check' resync in raid1 Whenever a read error is found, we should attempt to overwrite with correct data to 'fix' it. However when do a 'check' pass (which compares data blocks that are successfully read, but doesn't normally overwrite) we don't do that. We should. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Iustin Pop	d7f3d291a0	md: expose the degraded status of an assembled array through sysfs The 'degraded' attribute is useful to quickly determine if the array is degraded, instead of parsing 'mdadm -D' output or relying on the other techniques (number of working devices against number of defined devices, etc.). The md code already keeps track of this attribute, so it's useful to export it. Signed-off-by: Iustin Pop <iusty@k1024.org> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
NeilBrown	2b12ab6d33	md: 'sync_action' in sysfs returns wrong value for readonly arrays When an array is started read-only, MD_RECOVERY_NEEDED can be set but no recovery will be running. This causes 'sync_action' to report the wrong value. We could remove the test for MD_RECOVERY_NEEDED, but doing so would leave a small gap after requesting a sync action, where 'sync_action' would still report the old value. So make sure that for a read-only array, 'sync_action' always returns 'idle'. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
NeilBrown	8299d7f7c0	md: fix a bug in some never-used code. http://bugzilla.kernel.org/show_bug.cgi?id=3277 There is a seq_printf here that isn't being passed a 'seq'. Howeve as the code is inside #ifdef MD_DEBUG, nobody noticed. Also remove some extra spaces. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Michael J. Evans	4d936ec1fd	md: software Raid autodetect dev list not array In current release kernels the md module (Software RAID) uses a static array (dev_t[128]) to store partition/device info temporarily for autostart. I discovered this (and that the devices are added as disks/partitions are discovered at boot) while I was debugging why only one of my MD arrays would come up whole, while all the others were short a disk. I eventually discovered that it was enumerating through all of 9 of my 11 hds (2 had only 4 partitions apiece) while the other 9 have 15 partitions (I wanted 64 per drive...). The last partition of the 8th drive in my 9 drive raid 5 sets wasn't added, thus making the final md array short both a parity and data disk, and it was started later, elsewhere. This patch replaces that static array with a list. [akpm@linux-foundation.org: removed unused var] Signed-off-by: Michael J. Evans <mjevans1983@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Neil Brown	644bd2f048	Fix memory leak in dm-crypt dm-crypt used the ->bi_size member in the bio endio handling to free the appropriate pages, but it frees all of it from both call paths. With the ->bi_end_io() changes, ->bi_size was always 0 since we don't do partial completes. This caused dm-crypt to leak memory. Fix this by removing the size argument from crypt_free_buffer_pages(). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-16 13:48:46 +02:00
Jens Axboe	fd5d806266	block: convert blkdev_issue_flush() to use empty barriers Then we can get rid of ->issue_flush_fn() and all the driver private implementations of that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-16 11:05:02 +02:00
Geert Uytterhoeven	0dddfd46f0	dm: emc_endio returns void emc_endio returns void: linux/drivers/md/dm-emc.c: In function 'emc_endio': linux/drivers/md/dm-emc.c:58: warning: 'return' with a value, in function returning void Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-13 09:41:03 -07:00
Greg Kroah-Hartman	19c38de88a	kobjects: fix up improper use of the kobject name field A number of different drivers incorrect access the kobject name field directly. This is not correct as the name might not be in the array. Use the proper accessor function instead.	2007-10-12 14:51:02 -07:00
NeilBrown	6712ecf8f6	Drop 'size' argument from bio_endio and bi_end_io As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:57 +02:00
NeilBrown	66846572bf	Stop exporting blk_rq_bio_prep blk_rq_bio_prep is exported for use in exactly one place. That place can benefit from using the new blk_rq_append_bio instead. So - change dm-emc to call blk_rq_append_bio - stop exporting blk_rq_bio_prep, and - initialise rq_disk in blk_rq_bio_prep, as dm-emc needs it. Signed-off-by: Neil Brown <neilb@suse.de> diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:56 +02:00
Dan Williams	e4d84909dd	raid5: fix 2 bugs in ops_complete_biofill 1/ ops_complete_biofill tried to avoid calling handle_stripe since all the state necessary to return read completions is available. However the process of determining whether more read requests are pending requires locking the stripe (to block add_stripe_bio from updating dev->toead). ops_complete_biofill can run in tasklet context, so rather than upgrading all the stripe locks from spin_lock to spin_lock_bh this patch just unconditionally reschedules handle_stripe after completing the read request. 2/ ops_complete_biofill needlessly qualified processing R5_Wantfill with dev->toread. The result being that the 'biofill' pending bit is cleared before handling the pending read-completions on dev->read. R5_Wantfill can be unconditionally handled because the 'biofill' pending bit prevents new R5_Wantfill requests from being seen by ops_run_biofill and ops_complete_biofill. Found-by: Yuri Tikhonov <yur@emcraft.com> [neilb@suse.de: simpler fix for bug 1 than moving code] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-09-24 13:23:35 -07:00
aherrman@arcor.de	2123a09f3f	Fix kernel buuild with (CONFIG_COMPAT && ! CONFIG_BLOCK) Commit `02a5e0acb3` ("BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n") broke the kernel build for the CONFIG_COMPAT && !CONFIG_BLOCK case: CC fs/compat_ioctl.o In file included from include/linux/raid/md_k.h:19, from include/linux/raid/md.h:54, from fs/compat_ioctl.c:25: include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:40: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:48: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h:51: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:64: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_merge_: include/linux/raid/../../../drivers/md/dm-bio-list.h:78: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h: In bio_list_: include/linux/raid/../../../drivers/md/dm-bio-list.h:90: error: dereferencing pointer to incomplete type include/linux/raid/../../../drivers/md/dm-bio-list.h:94: error: dereferencing pointer to incomplete type make[1]: * [fs/compat_ioctl.o] Error 1 make: * [fs] Error 2 Signed-off-by: Andreas Herrmann <aherrman@arcor.de> Acked-By: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-14 13:56:47 -07:00
NeilBrown	a2e0855182	md: fix some bugs with growing raid5/raid6 arrays. The recent changed to raid5 to allow offload of parity calculation etc introduced some bugs in the code for growing (i.e. adding a disk to) raid5 and raid6. This fixes them Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-09-11 17:21:19 -07:00
Andrew Vasquez	f99ba18a96	dm-mpath-rdac: don't stomp on a requests transfer bit Without this, we get qla2xxx complaining about "ISP System Error". What's happening here is the firmware is detecting a Xfer-ready from the storage when in fact the data-direction for a mode-select should be a write (DATA_OUT). The following patch fixes the problem (typo). Verified by Brian, as well. Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com> Verified-by: Brian De Wolf <bldewolf@csupomona.edu> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-27 16:15:44 -07:00
Randy Dunlap	6e106b0d97	DM_MULTIPATH_RDAC: "scsi_normalize_sense" undefined DM_MULTIPATH_RDAC uses SCSI API(s) and is for a SCSI device, so add SCSI to its depends on to prevent build errors. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> [ Tested and Verified by Chandra Seetharaman ] Acked-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-24 16:10:39 -07:00
NeilBrown	a88aa7865b	md: correctly update sysfs when a raid1 is reshaped When a raid1 array is reshaped (number of drives changed), the list of devices is compacted, so that slots for missing devices are filled with working devices from later slots. This requires the "rd%d" symlinks in sysfs to be updated. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
NeilBrown	918f02383f	md: make sure a re-add after a restart honours bitmap when resyncing Commit `1757128438` was slightly bad. If an array has a write-intent bitmap, and you remove a drive, then readd it, only the changed parts should be resynced. However after the above commit, this only works if the array has not been shut down and restarted. This is because it sets 'fullsync' at little more often than it should. This patch is more careful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-08-22 19:52:46 -07:00
Alan D. Brunelle	c7149d6bce	Fix remap handling by blktrace This patch provides more information concerning REMAP operations on block IOs. The additional information provides clearer details at the user level, and supports post-processing analysis in btt. o Adds in partition remaps on the same device. o Fixed up the remap information in DM to be in the right order o Sent up mapped-from and mapped-to device information Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-08-11 22:34:48 +02:00
Arne Redlich	f6f953aa99	md: handle writes to broken raid10 arrays gracefully When writing to a broken array, raid10 currently happily emits empty bio lists. IOW, the master bio will never be completed, sending writers to UNINTERRUPTIBLE_SLEEP forever. Signed-off-by: Arne Redlich <agr@powerkom-dd.de> Acked-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-31 15:39:38 -07:00
Maik Hampel	14e713446a	md: raid10: fix use-after-free of bio In case of read errors raid10d tries to print a nice error message, unfortunately using data from an already put bio. Signed-off-by: Maik Hampel <m.hampel@gmx.de> Acked-By: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-31 15:39:38 -07:00
Jens Axboe	165125e1e4	[BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-07-24 09:28:11 +02:00
Milan Broz	80b16c192e	dm io: fix panic on large request Flush workqueue before releasing bioset and mopools in dm-crypt. There can be finished but not yet released request. Call chain causing oops: run workqueue dec_pending bio_endio(...); <remove device request - remove mempool> mempool_free(io, cc->io_pool); This usually happens when cryptsetup create temporary luks mapping in the beggining of crypt device activation. When dm-core calls destructor crypt_dtr, no new request are possible. Signed-off-by: Milan Broz <mbroz@redhat.com> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Alasdair G Kergon <agk@redhat.com> Cc: Christophe Saout <christophe@saout.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-21 17:49:14 -07:00
Dan Williams	eb0645a8b1	async_tx: fix kmap_atomic usage in async_memcpy Andrew Morton: [async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap slot for both src add dest and we get either corrupted data or a BUG. Evgeniy Polyakov: Btw, shouldn't it always be kmap_atomic() even if flag is not set. That pages are usual one returned by alloc_page(). So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC flags. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-20 08:44:19 -07:00
Paul Mundt	20c2df83d2	mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's `c59def9f22` change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2007-07-20 10:11:58 +09:00
Yoann Padioleau	dd00cc486a	some kmalloc/memset ->kzalloc (tree wide) Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc). Here is a short excerpt of the semantic patch performing this transformation: @@ type T2; expression x; identifier f,fld; expression E; expression E1,E2; expression e1,e2,e3,y; statement S; @@ x = - kmalloc + kzalloc (E1,E2) ... when != $x->fld=E;\\|y=f(...,x,...);\\|f(...,x,...);\\|x=E;\\|while(...) S\\|for(e1;e2;e3) S$ - memset((T2)x,0,E1); @@ expression E1,E2,E3; @@ - kzalloc(E1 * E2,E3) + kcalloc(E1,E2,E3) [akpm@linux-foundation.org: get kcalloc args the right way around] Signed-off-by: Yoann Padioleau <padator@wanadoo.fr> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Acked-by: Russell King <rmk@arm.linux.org.uk> Cc: Bryan Wu <bryan.wu@analog.com> Acked-by: Jiri Slaby <jirislaby@gmail.com> Cc: Dave Airlie <airlied@linux.ie> Acked-by: Roland Dreier <rolandd@cisco.com> Cc: Jiri Kosina <jkosina@suse.cz> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Acked-by: Pierre Ossman <drzeus-list@drzeus.cx> Cc: Jeff Garzik <jeff@garzik.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Greg KH <greg@kroah.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:50 -07:00
Jesper Juhl	851a8a7fd4	dm: fix memory leak in dm_create_persistent() when starting metadata update thread fails If, in dm_create_persistent(), the call to create_singlethread_workqueue() fails then we'll return without freeing the memory allocated to 'ps', thus leaking sizeof(struct pstore) bytes. This patch fixes the leak. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-18 08:38:22 -07:00
NeilBrown	4ad1366376	md: change bitmap_unplug and others to void functions bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	f0d76d70bc	md: check that internal bitmap does not overlap other data We current completely trust user-space to set up metadata describing an consistant array. In particlar, that the metadata, data, and bitmap do not overlap. But userspace can be buggy, and it is better to report an error than corrupt data. So put in some appropriate checks. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	713f6ab18b	md: improve the is_mddev_idle test fix Don't use 'unsigned' variable to track sync vs non-sync IO, as the only thing we want to do with them is a signed comparison, and fix up the comment which had become quite wrong. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
NeilBrown	df968c4e8d	md: improve message about invalid superblock during autodetect People try to use raid auto-detect with version-1 superblocks (which is not supported) and get confused when they are told they have an invalid superblock. So be more explicit, and say it it is not a valid v0.90 superblock. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
Jan Engelhardt	afd44034ac	Use menuconfig objects II - MD Change Kconfig objects from "menu, config" into "menuconfig" so that the user can disable the whole feature without having to enter the menu first. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:15 -07:00
Akinobu Mita	00d59405cf	unregister_blkdev() delete redundant messages in callers No need to warn unregister_blkdev() failure by the callers. (The previous patch makes unregister_blkdev() print error message in error case) Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:03 -07:00
Rafael J. Wysocki	8314418629	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Linus Torvalds	e030dbf91a	Merge branch 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop * 'ioat-md-accel-for-linus' of git://lost.foo-projects.org/~dwillia2/git/iop: (28 commits) ioatdma: add the unisys "i/oat" pci vendor/device id ARM: Add drivers/dma to arch/arm/Kconfig iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver iop13xx: surface the iop13xx adma units to the iop-adma driver dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines md: remove raid5 compute_block and compute_parity5 md: handle_stripe5 - request io processing in raid5_run_ops md: handle_stripe5 - add request/completion logic for async expand ops md: handle_stripe5 - add request/completion logic for async read ops md: handle_stripe5 - add request/completion logic for async check ops md: handle_stripe5 - add request/completion logic for async compute ops md: handle_stripe5 - add request/completion logic for async write ops md: common infrastructure for running operations with raid5_run_ops md: raid5_run_ops - run stripe operations outside sh->lock raid5: replace custom debug PRINTKs with standard pr_debug raid5: refactor handle_stripe5 and handle_stripe6 (v3) async_tx: add the async_tx api xor: make 'xor_blocks' a library routine for use with async_tx dmaengine: make clients responsible for managing channels dmaengine: refactor dmaengine around dma_async_tx_descriptor ...	2007-07-13 10:52:27 -07:00
Dan Williams	f6dff381af	md: remove raid5 compute_block and compute_parity5 replaced by raid5_run_ops Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:18 -07:00
Dan Williams	830ea01673	md: handle_stripe5 - request io processing in raid5_run_ops I/O submission requests were already handled outside of the stripe lock in handle_stripe. Now that handle_stripe is only tasked with finding work, this logic belongs in raid5_run_ops. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	f0a50d3754	md: handle_stripe5 - add request/completion logic for async expand ops When a stripe is being expanded bulk copying takes place to move the data from the old stripe to the new. Since raid5_run_ops only operates on one stripe at a time these bulk copies are handled in-line under the stripe lock. In the dma offload case we poll for the completion of the operation. After the data has been copied into the new stripe the parity needs to be recalculated across the new disks. We reuse the existing postxor functionality to carry out this calculation. By setting STRIPE_OP_POSTXOR without setting STRIPE_OP_BIODRAIN the completion path in handle stripe can differentiate expand operations from normal write operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	b5e98d65d3	md: handle_stripe5 - add request/completion logic for async read ops When a read bio is attached to the stripe and the corresponding block is marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy the data from the stripe cache to the bio buffer. handle_stripe flags the blocks to be operated on with the R5_Wantfill flag. If new read requests arrive while raid5_run_ops is running they will not be handled until handle_stripe is scheduled to run again. Changelog: * cleanup to_read and to_fill accounting * do not fail reads that have reached the cache Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	e89f89629b	md: handle_stripe5 - add request/completion logic for async check ops Check operations are scheduled when the array is being resynced or an explicit 'check/repair' command was sent to the array. Previously check operations would destroy the parity block in the cache such that even if parity turned out to be correct the parity block would be marked !R5_UPTODATE at the completion of the check. When the operation can be carried out by a dma engine the assumption is that it can check parity as a read-only operation. If raid5_run_ops notices that the check was handled by hardware it will preserve the R5_UPTODATE status of the parity disk. When a check operation determines that the parity needs to be repaired we reuse the existing compute block infrastructure to carry out the operation. Repair operations imply an immediate write back of the data, so to differentiate a repair from a normal compute operation the STRIPE_OP_MOD_REPAIR_PD flag is added. Changelog: * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	f38e12199a	md: handle_stripe5 - add request/completion logic for async compute ops handle_stripe will compute a block when a backing disk has failed, or when it determines it can save a disk read by computing the block from all the other up-to-date blocks. Previously a block would be computed under the lock and subsequent logic in handle_stripe could use the newly up-to-date block. With the raid5_run_ops implementation the compute operation is carried out a later time outside the lock. To preserve the old functionality we take advantage of the dependency chain feature of async_tx to flag the block as R5_Wantcompute and then let other parts of handle_stripe operate on the block as if it were up-to-date. raid5_run_ops guarantees that the block will be ready before it is used in another operation. However, this only works in cases where the compute and the dependent operation are scheduled at the same time. If a previous call to handle_stripe sets the R5_Wantcompute flag there is no facility to pass the async_tx dependency chain across successive calls to raid5_run_ops. The req_compute variable protects against this case. Changelog: * remove the req_compute BUG_ON Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:17 -07:00
Dan Williams	e33129d841	md: handle_stripe5 - add request/completion logic for async write ops After handle_stripe5 decides whether it wants to perform a read-modify-write, or a reconstruct write it calls handle_write_operations5. A read-modify-write operation will perform an xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the new data into the stripe (biodrain) and perform a postxor operation across all up-to-date blocks to generate the new parity. A reconstruct write is run when all blocks are already up-to-date in the cache so all that is needed is a biodrain and postxor. On the completion path STRIPE_OP_PREXOR will be set if the operation was a read-modify-write. The STRIPE_OP_BIODRAIN flag is used in the completion path to differentiate write-initiated postxor operations versus expansion-initiated postxor operations. Completion of a write triggers i/o to the drives. Changelog: * make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:16 -07:00
Dan Williams	d84e0f10d3	md: common infrastructure for running operations with raid5_run_ops All the handle_stripe operations that are to be transitioned to use raid5_run_ops need a method to coherently gather work under the stripe-lock and hand that work off to raid5_run_ops. The 'get_stripe_work' routine runs under the lock to read all the bits in sh->ops.pending that do not have the corresponding bit set in sh->ops.ack. This modified 'pending' bitmap is then passed to raid5_run_ops for processing. The transition from 'ack' to 'completion' does not need similar protection as the existing release_stripe infrastructure will guarantee that handle_stripe will run again after a completion bit is set, and handle_stripe can tolerate a sh->ops.completed bit being set while the lock is held. A call to async_tx_issue_pending_all() is added to raid5d to kick the offload engines once all pending stripe operations work has been submitted. This enables batching of the submission and completion of operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:16 -07:00
Dan Williams	91c0092484	md: raid5_run_ops - run stripe operations outside sh->lock When the raid acceleration work was proposed, Neil laid out the following attack plan: 1/ move the xor and copy operations outside spin_lock(&sh->lock) 2/ find/implement an asynchronous offload api The raid5_run_ops routine uses the asynchronous offload api (async_tx) and the stripe_operations member of a stripe_head to carry out xor+copy operations asynchronously, outside the lock. To perform operations outside the lock a new set of state flags is needed to track new requests, in-flight requests, and completed requests. In this new model handle_stripe is tasked with scanning the stripe_head for work, updating the stripe_operations structure, and finally dropping the lock and calling raid5_run_ops for processing. The following flags outline the requests that handle_stripe can make of raid5_run_ops: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate a missing block in the cache from the other blocks STRIPE_OP_PREXOR - subtract existing data as part of the read-modify-write process STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct STRIPE_OP_IO - submit i/o to the member disks (note this was already performed outside the stripe lock, but it made sense to add it as an operation type The flow is: 1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending 2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the operation to the async_tx api 3/ async_tx triggers the completion callback routine to set sh->ops.complete and release the stripe 4/ handle_stripe runs again to finish the operation and optionally submit new operations that were previously blocked Note this patch just defines raid5_run_ops, subsequent commits (one per major operation type) modify handle_stripe to take advantage of this routine. Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the raid5_run_ops workqueue * call bi_end_io for reads in ops_complete_biofill, saves a call to handle_stripe * explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown * fix race between async engines and bi_end_io call for reads, Neil Brown * remove unnecessary spin_lock from ops_complete_biofill * remove test_and_set/test_and_clear BUG_ONs, Neil Brown * remove explicit interrupt handling for channel switching, this feature was absorbed (i.e. it is now implicit) by the async_tx api * use return_io in ops_complete_biofill Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	45b4233caa	raid5: replace custom debug PRINTKs with standard pr_debug Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in favor of the global DEBUG definition. To get local debug messages just add '#define DEBUG' to the top of the file. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	a445685647	raid5: refactor handle_stripe5 and handle_stripe6 (v3) handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes: * v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path * v3: removed the unused 'dirty' field from struct stripe_head_state * v3: coalesced open coded bi_end_io routines into return_io() Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:15 -07:00
Dan Williams	9bc89cd82d	async_tx: add the async_tx api The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor async_<operation>(..., struct dma_async_tx_descriptor depend_tx, dma_async_tx_callback cb_fn, void cb_param) { struct dma_chan chan = async_tx_find_channel(depend_tx, <operation>); struct dma_device device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor tx = device ? device->device_prep_dma_<operation>(chan, len, int_en) : NULL; if (tx) { / run <operation> asynchronously / ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { / run <operation> synchronously / ... <operation> ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select the API when MD is enabled * BUG_ON xor source counts <= 1 * implicitly handle hardware concerns like channel switching and interrupts, Neil Brown * remove the per operation type list, and distribute operation capabilities evenly amongst the available channels * simplify async_tx_find_channel to optimize the fast path * introduce the channel_table_initialized flag to prevent early calls to the api * reorganize the code to mimic crypto * include mm.h as not all archs include it in dma-mapping.h * make the Kconfig options non-user visible, Adrian Bunk * move async_tx under crypto since it is meant as 'core' functionality, and the two may share algorithms in the future * move large inline functions into c files * checkpatch.pl fixes * gpl v2 only correction Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>	2007-07-13 08:06:14 -07:00
Dan Williams	685784aaf3	xor: make 'xor_blocks' a library routine for use with async_tx The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <bunk@stusta.de> Cc: NeilBrown <neilb@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2007-07-13 08:06:14 -07:00
Chandra Seetharaman	dd172d72ad	dm mpath: rdac This patch supports LSI/Engenio devices in RDAC mode. Like dm-emc it requires userspace support. In your multipath.conf file you must have: path_checker rdac hardware_handler "1 rdac" prio_callout "/sbin/mpath_prio_tpc /dev/%n" And you also then must have a updated multipath tools release which has rdac support. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:23 -07:00
Jonathan Brassow	fc1ff9588a	dm raid1: handle log failure When writing to a mirror, the log must be updated first. Failure to update the log could result in the log not properly reflecting the state of the mirror if the machine should crash. We change the return type of the rh_flush function to give us the ability to check if a log write was successful. If the log write was unsuccessful, we fail the writes to avoid the case where the log does not properly reflect the state of the mirror. A follow-up patch - which is dependent on the ability to requeue I/O's to core device-mapper - will requeue the I/O's for retry (allowing the mirror to be reconfigured.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	f44db678ed	dm raid1: handle resync failures Device-mapper mirroring currently takes a best effort approach to recovery - failures during mirror synchronization are completely ignored. This means that regions are marked 'in-sync' and 'clean' and removed from the hash list. Future reads and writes that query the region will incorrectly interpret the region as in-sync. This patch handles failures during the recovery process. If a failure occurs, the region is marked as 'not-in-sync' (aka RH_NOSYNC) and added to a new list 'failed_recovered_regions'. Regions on the 'failed_recovered_regions' list are not marked as 'clean' upon removal from the list. Furthermore, if the DM_RAID1_HANDLE_ERRORS flag is set, the region is marked as 'not-in-sync'. This action prevents any future read-balancing from choosing an invalid device because of the 'not-in-sync' status. If "handle_errors" is not specified when creating a mirror (leaving the DM_RAID1_HANDLE_ERRORS flag unset), failures will be ignored exactly as they would be without this patch. This is to preserve backwards compatibility with user-space tools, such as 'pvmove'. However, since future read-balancing policies will rely on the correct sync status of a region, a user must choose "handle_errors" when using read-balancing. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	d0d444c7d4	dm: add ratelimit logging macros Add ratelimit extension to dm logging macros. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Stefan Bader	07a83c47cf	dm: disable barriers This patch causes device-mapper to reject any barrier requests. This is done since most of the targets won't handle this correctly anyway. So until the situation improves it is better to reject these requests at the first place. Since barrier requests won't get to the targets, the checks there can be removed. Cc: stable@kernel.org Signed-off-by: Stefan Bader <shbader@de.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Jonathan Brassow	943317efdb	dm raid1: clear region outside spinlock A clear_region function is permitted to block (in practice, rare) but gets called in rh_update_states() with a spinlock held. The bits being marked and cleared by the above functions are used to update the on-disk log, but are never read directly. We can perform these operations outside the spinlock since the bits are only changed within one thread viz. - mark_region in rh_inc() - clear_region in rh_update_states(). So, we grab the clean_regions list items via list_splice() within the spinlock and defer clear_region() until we iterate over the list for deletion - similar to how the recovered_regions list is already handled. We then move the flush() call down to ensure it encapsulates the changes which are done by the later calls to clear_region(). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	0764147b11	dm snapshot: permit invalid activation Allow invalid snapshots to be activated instead of failing. This allows userspace to reinstate any given snapshot state - for example after an unscheduled reboot - and clean up the invalid snapshot at its leisure. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00
Milan Broz	fcac03abd3	dm snapshot: fix invalidation deadlock Process persistent exception store metadata IOs in a separate thread. A snapshot may become invalid while inside generic_make_request(). A synchronous write is then needed to update the metadata while still inside that function. Since the introduction of md-dm-reduce-stack-usage-with-stacked-block-devices.patch this has to be performed by a separate thread to avoid deadlock. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-12 15:01:08 -07:00

... 3 4 5 6 7 ...

1045 Commits