Commit Graph

888 Commits

Author SHA1 Message Date
Tejun Heo
9934c8c045 block: implement and enforce request peek/start/fetch
Till now block layer allowed two separate modes of request execution.
A request is always acquired from the request queue via
elv_next_request().  After that, drivers are free to either dequeue it
or process it without dequeueing.  Dequeue allows elv_next_request()
to return the next request so that multiple requests can be in flight.

Executing requests without dequeueing has its merits mostly in
allowing drivers for simpler devices which can't do sg to deal with
segments only without considering request boundary.  However, the
benefit this brings is dubious and declining while the cost of the API
ambiguity is increasing.  Segment based drivers are usually for very
old or limited devices and as converting to dequeueing model isn't
difficult, it doesn't justify the API overhead it puts on block layer
and its more modern users.

Previous patches converted all block low level drivers to dequeueing
model.  This patch completes the API transition by...

* renaming elv_next_request() to blk_peek_request()

* renaming blkdev_dequeue_request() to blk_start_request()

* adding blk_fetch_request() which is combination of peek and start

* disallowing completion of queued (not started) requests

* applying new API to all LLDs

Renamings are for consistency and to break out of tree code so that
it's apparent that out of tree drivers need updating.

[ Impact: block request issue API cleanup, no functional change ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: unsik Kim <donari75@gmail.com>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Laurent Vivier <Laurent@lvivier.info>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:52:18 +02:00
Tejun Heo
a2dec7b363 block: hide request sector and data_len
Block low level drivers for some reason have been pretty good at
abusing block layer API.  Especially struct request's fields tend to
get violated in all possible ways.  Make it clear that low level
drivers MUST NOT access or manipulate rq->sector and rq->data_len
directly by prefixing them with double underscores.

This change is also necessary to break build of out-of-tree codes
which assume the previous block API where internal fields can be
manipulated and rq->data_len carries residual count on completion.

[ Impact: hide internal fields, block API change ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:55 +02:00
Tejun Heo
2e46e8b27a block: drop request->hard_* and *nr_sectors
struct request has had a few different ways to represent some
properties of a request.  ->hard_* represent block layer's view of the
request progress (completion cursor) and the ones without the prefix
are supposed to represent the issue cursor and allowed to be updated
as necessary by the low level drivers.  The thing is that as block
layer supports partial completion, the two cursors really aren't
necessary and only cause confusion.  In addition, manual management of
request detail from low level drivers is cumbersome and error-prone at
the very least.

Another interesting duplicate fields are rq->[hard_]nr_sectors and
rq->{hard_cur|current}_nr_sectors against rq->data_len and
rq->bio->bi_size.  This is more convoluted than the hard_ case.

rq->[hard_]nr_sectors are initialized for requests with bio but
blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
initialized for all request but blk_rq_bytes() uses it only for pc
requests.  This causes good amount of confusion throughout block layer
and its drivers and determining the request length has been a bit of
black magic which may or may not work depending on circumstances and
what the specific LLD is actually doing.

rq->{hard_cur|current}_nr_sectors represent the number of sectors in
the contiguous data area at the front.  This is mainly used by drivers
which transfers data by walking request segment-by-segment.  This
value always equals rq->bio->bi_size >> 9.  However, data length for
pc requests may not be multiple of 512 bytes and using this field
becomes a bit confusing.

In general, having multiple fields to represent the same property
leads only to confusion and subtle bugs.  With recent block low level
driver cleanups, no driver is accessing or manipulating these
duplicate fields directly.  Drop all the duplicates.  Now rq->sector
means the current sector, rq->data_len the current total length and
rq->bio->bi_size the current segment length.  Everything else is
defined in terms of these three and available only through accessors.

* blk_recalc_rq_sectors() is collapsed into blk_update_request() and
  now handles pc and fs requests equally other than rq->sector update.
  This means that now pc requests can use partial completion too (no
  in-kernel user yet tho).

* bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
  now uses byte count as the primary data length.

* blk_rq_pos() is now guranteed to be always correct.  In-block users
  converted.

* blk_rq_bytes() is now guaranteed to be always valid as is
  blk_rq_sectors().  In-block users converted.

* blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
  More convenient one is used.

* blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
  pointer to request.

[ Impact: API cleanup, single way to represent one property of a request ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:54 +02:00
Tejun Heo
83096ebf12 block: convert to pos and nr_sectors accessors
With recent cleanups, there is no place where low level driver
directly manipulates request fields.  This means that the 'hard'
request fields always equal the !hard fields.  Convert all
rq->sectors, nr_sectors and current_nr_sectors references to
accessors.

While at it, drop superflous blk_rq_pos() < 0 test in swim.c.

[ Impact: use pos and nr_sectors accessors ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Adrian McMenamin <adrian@mcmen.demon.co.uk>
Acked-by: Mike Miller <mike.miller@hp.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Dario Ballabio <ballabio_dario@emc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: unsik Kim <donari75@gmail.com>
Cc: Laurent Vivier <Laurent@lvivier.info>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:54 +02:00
Tejun Heo
5b93629b45 block: implement blk_rq_pos/[cur_]sectors() and convert obvious ones
Implement accessors - blk_rq_pos(), blk_rq_sectors() and
blk_rq_cur_sectors() which return rq->hard_sector, rq->hard_nr_sectors
and rq->hard_cur_sectors respectively and convert direct references of
the said fields to the accessors.

This is in preparation of request data length handling cleanup.

Geert	: suggested adding const to struct request * parameter to accessors
Sergei	: spotted error in patch description

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Ackec-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
Tejun Heo
c3a4d78c58 block: add rq->resid_len
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion.  This duality creates some
headaches.

First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing.  It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count.  This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.

Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred.  The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all.  This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.

This patch adds rq->resid_len which is used only for residual count.

While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.

Boaz	: spotted missing conversion in osd
Sergei	: spotted too early conversion to blk_rq_bytes() in ide-tape

[ Impact: cleanup residual count handling, report 0 resid by default ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
Tejun Heo
4c94dece1b block: don't init rq fields unnecessarily
blk_get_request() always returns properly zeroed requests.  Don't set
fields to zero/NULL unnecessarily.

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 08:14:50 +02:00
Nikanth Karthikesan
9eb55b030c block: catch trying to use more bits than request->cmd_flags has
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:37 +02:00
Jens Axboe
c69d48540c block: include discard requests in IO accounting
We currently don't do merging on discard requests, but we potentially
could. If we do, then we need to include discard requests in the IO
accounting, or merging would end up decrementing in_flight IO counters
for an IO which never incremented them.

So enable accounting for discard requests.

Problem found by Nikanth Karthikesan <knikanth@suse.de>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:37 +02:00
Jens Axboe
c2553b5844 block: make blk_do_io_stat() do the full "is this rq accountable" checks
We currently check for file system requests outside of blk_do_io_stat(rq),
but we may as well just include it.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:37 +02:00
Tejun Heo
731ec497e5 block: kill rq->data
Now that all block request data transfer is done via bio, rq->data
isn't used.  Kill it.

While at it, make the roles of rq->special and buffer clear.

[ Impact: drop now unncessary field from struct request ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
2009-04-28 07:37:36 +02:00
Tejun Heo
40cbbb781d block: implement and use [__]blk_end_request_all()
There are many [__]blk_end_request() call sites which call it with
full request length and expect full completion.  Many of them ensure
that the request actually completes by doing BUG_ON() the return
value, which is awkward and error-prone.

This patch adds [__]blk_end_request_all() which takes @rq and @error
and fully completes the request.  BUG_ON() is added to to ensure that
this actually happens.

Most conversions are simple but there are a few noteworthy ones.

* cdrom/viocd: viocd_end_request() replaced with direct calls to
  __blk_end_request_all().

* s390/block/dasd: dasd_end_request() replaced with direct calls to
  __blk_end_request_all().

* s390/char/tape_block: tapeblock_end_request() replaced with direct
  calls to blk_end_request_all().

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-04-28 07:37:35 +02:00
Tejun Heo
b243ddcbe9 block: move rq->start_time initialization to blk_rq_init()
rq->start_time was initialized in init_request_from_bio() so special
requests didn't have start_time set.  This has been okay as start_time
has been used only for fs requests; however, there is no indication of
this actually is the case or not.  Set rq->start_time in blk_rq_init()
and guarantee that all initialized rq's have its start_time set.  This
improves consistency at virtually no cost and future changes will make
use of the timestamp for !bio requests.

[ Impact: rq->start_time is valid for all requests ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:35 +02:00
Tejun Heo
2e60e02297 block: clean up request completion API
Request completion has gone through several changes and became a bit
messy over the time.  Clean it up.

1. end_that_request_data() is a thin wrapper around
   end_that_request_data_first() which checks whether bio is NULL
   before doing anything and handles bidi completion.
   blk_update_request() is a thin wrapper around
   end_that_request_data() which clears nr_sectors on the last
   iteration but doesn't use the bidi completion.

   Clean it up by moving the initial bio NULL check and nr_sectors
   clearing on the last iteration into end_that_request_data() and
   renaming it to blk_update_request(), which makes blk_end_io() the
   only user of end_that_request_data().  Collapse
   end_that_request_data() into blk_end_io().

2. There are four visible completion variants - blk_end_request(),
   __blk_end_request(), blk_end_bidi_request() and end_request().
   blk_end_request() and blk_end_bidi_request() uses blk_end_request()
   as the backend but __blk_end_request() and end_request() use
   separate implementation in __blk_end_request() due to different
   locking rules.

   blk_end_bidi_request() is identical to blk_end_io().  Collapse
   blk_end_io() into blk_end_bidi_request(), separate out request
   update into internal helper blk_update_bidi_request() and add
   __blk_end_bidi_request().  Redefine [__]blk_end_request() as thin
   inline wrappers around [__]blk_end_bidi_request().

3. As the whole request issue/completion usages are about to be
   modified and audited, it's a good chance to convert completion
   functions return bool which better indicates the intended meaning
   of return values.

4. The function name end_that_request_last() is from the days when it
   was a public interface and slighly confusing.  Give it a proper
   internal name - blk_finish_request().

5. Add description explaning that blk_end_bidi_request() can be safely
   used for uni requests as suggested by Boaz Harrosh.

The only visible behavior change is from #1.  nr_sectors counts are
cleared after the final iteration no matter which function is used to
complete the request.  I couldn't find any place where the code
assumes those nr_sectors counters contain the values for the last
segment and this change is good as it makes the API much more
consistent as the end result is now same whether a request is
completed using [__]blk_end_request() alone or in combination with
blk_update_request().

API further cleaned up per Christoph's suggestion.

[ Impact: cleanup, rq->*nr_sectors always updated after req completion ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Christoph Hellwig <hch@infradead.org>
2009-04-28 07:37:35 +02:00
Tejun Heo
0b302d5aa7 block: kill blk_end_request_callback()
With recent IDE updates, blk_end_request_callback() doesn't have any
user now.  Kill it.

[ Impact: removal of unused convoluted interface ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:34 +02:00
Tejun Heo
158dbda006 block: reorganize request fetching functions
Impact: code reorganization

elv_next_request() and elv_dequeue_request() are public block layer
interface than actual elevator implementation.  They mostly deal with
how requests interact with block layer and low level drivers at the
beginning of rqeuest processing whereas __elv_next_request() is the
actual eleveator request fetching interface.

Move the two functions to blk-core.c.  This prepares for further
interface cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:34 +02:00
Tejun Heo
5efccd17ce block: reorder request completion functions
Reorder request completion functions such that

* All request completion functions are located together.

* Functions which are used by only one caller is put right above the
  caller.

* end_request() is put after other completion functions but before
  blk_update_request().

This change is for completion function cleanup which will follow.

[ Impact: cleanup, code reorganization ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:34 +02:00
Tejun Heo
2eef33e439 block: clean up misc stuff after block layer timeout conversion
* In blk_rq_timed_out_timer(), else { if } to else if

* In blk_add_timer(), simplify if/else block

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:34 +02:00
Tejun Heo
10732f5661 block: cleanup REQ_SOFTBARRIER usages
blk_insert_request() doesn't need to worry about REQ_SOFTBARRIER.
Don't set it.  Combined with recent ide updates, REQ_SOFTBARRIER is
now only used in elevator proper and for discard requests.

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:34 +02:00
Tejun Heo
e4025f6c21 block: don't set REQ_NOMERGE unnecessarily
RQ_NOMERGE_FLAGS already clears defines which REQ flags aren't
mergeable.  There is no reason to specify it superflously.  It only
adds to confusion.  Don't set REQ_NOMERGE for barriers and requests
with specific queueing directive.  REQ_NOMERGE is now exclusively used
by the merging code.

[ Impact: cleanup ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:33 +02:00
Tejun Heo
a7f5579234 block: kill blk_start_queueing()
blk_start_queueing() is identical to __blk_run_queue() except that it
doesn't check for recursion.  None of the current users depends on
blk_start_queueing() running request_fn directly.  Replace usages of
blk_start_queueing() with [__]blk_run_queue() and kill it.

[ Impact: removal of mostly duplicate interface function ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:33 +02:00
Tejun Heo
a538cd03be block: merge blk_invoke_request_fn() into __blk_run_queue()
__blk_run_queue wraps blk_invoke_request_fn() such that it
additionally removes plug and bails out early if the queue is empty.
Both extra operations have their own pending mechanisms and don't
cause any harm correctness-wise when they are done superflously.

The only user of blk_invoke_request_fn() being blk_start_queue(),
there isn't much reason to keep both functions around.  Merge
blk_invoke_request_fn() into __blk_run_queue() and make
blk_start_queue() use __blk_run_queue() instead.

[ Impact: merge two subtly different internal functions ]

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-04-28 07:37:33 +02:00
Bartlomiej Zolnierkiewicz
db29a6b496 block: enable by default support for large devices and files on 32-bit archs
Enable by default support for large devices and files (CONFIG_LBD):

- With 1TB disks being a commodity hardware it is quite easy to hit 2TB
  limitation while building RAIDs etc. and many distros have been using
  CONFIG_LBD=y by default already (at least Fedora 10 and openSUSE 11.1).

- This should also prevent a subtle ext4 filesystem compatibility issue:
  mke2fs.ext4 defaults to creating filesystems with huge_files feature
  enabled and such filesystems cannot be later mounted read-write on
  machines with CONFIG_LBD=n (it should be quite easy to hit this issue
  when trying to use filesystem created using distro kernel on system
  running the self-build kernel, think about USB disk enclosures & co.).

While at it:

- Clarify config option help text w.r.t. mounting ext4 filesystems
  (they can be mounted with CONFIG_LBD=n but in the read-only mode).

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:33 +02:00
Tejun Heo
924cec7789 block: clear req->errors on bio completion only for fs requests
Impact: subtle behavior change

For fs requests, rq is only carrier of bios and rq error status as a
whole doesn't mean much.  This is the reason why rq->errors is being
cleared on each partial completion of a request as on each partial
completion the error status is transferred to the respective bios.

For pc requests, rq->errors is used to carry error status to the
issuer and thus __end_that_request_first() doesn't clear it on such
cases.

The condition was fine till now as only fs and pc requests have used
bio and thus the bio completion path.  However, future changes will
unify data accesses to bio and all non fs users care about rq error
status.  Clear rq->errors on bio completion only for fs requests.

In general, the implicit clearing is a bit too subtle especially as
the meaning of rq->errors is completely dependent on low level
drivers.  Unifying / cleaning up rq->errors usage and letting llds
manage it would be better.  TODO comment added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
2009-04-28 07:37:28 +02:00
Jens Axboe
f2d1f0ae78 cfq-iosched: cache prio_tree root in cfqq->p_root
Currently we look it up from ->ioprio, but ->ioprio can change if
either the process gets its IO priority changed explicitly, or if
cfq decides to temporarily boost it. So if we are unlucky, we can
end up attempting to remove a node from a different rbtree root than
where it was added.

Fix this by using ->org_ioprio as the prio_tree index, since that
will only change for explicit IO priority settings (not for a boost).
Additionally cache the rbtree root inside the cfqq, then we don't have
to add code to reinsert the cfqq in the prio_tree if IO priority changes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24 08:54:22 +02:00
Jens Axboe
3ac6c9f8a6 cfq-iosched: fix bug with aliased request and cooperation detection
cfq_prio_tree_lookup() should return the direct match, yet it always
returns zero. Fix that.

cfq_prio_tree_add() assumes that we don't get a direct match, while
it is very possible that we do. Using O_DIRECT, you can have different
cfqq with matching requests, since you don't have the page cache
to serialize things for you. Fix this bug by only adding the cfqq if
there isn't an existing match.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24 08:54:22 +02:00
Jens Axboe
26a2ac009c cfq-iosched: clear ->prio_trees[] on cfqd alloc
Not strictly needed, but we should make it clear that we init the
rbtree roots here.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24 08:54:22 +02:00
Hannes Reinecke
17d5c8ca75 block: fix intermittent dm timeout based oops
Very rarely under stress testing of dm, oopses are occuring as
something tampers with an old stack frame.  This has been traced back
to blk_abort_queue() leaving a timeout_list pointing to the stack.
The reason is that sometimes blk_abort_request() won't delete the
timer (if the request is marked as complete but before the timer has
been removed, a small race window).  Fix this by splicing back from
the ususally empty list to the q->timeout_list.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24 08:54:21 +02:00
Jerome Marchand
42dad7647a block: simplify I/O stat accounting
This simplifies I/O stat accounting switching code and separates it
completely from I/O scheduler switch code.

Requests are accounted according to the state of their request queue
at the time of the request allocation. There is no need anymore to
flush the request queue when switching I/O accounting state.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-24 08:54:21 +02:00
Jeff Moyer
04dc6e71a2 cfq-iosched: use the default seek distance when there aren't enough seek samples
If the cfq io context doesn't have enough samples yet to provide a mean
seek distance, then use the default threshold we have for seeky IO instead
of defaulting to 0.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:11 +02:00
Jeff Moyer
4d00aa47e2 cfq-iosched: make seek_mean converge more quickly
Right now, depending on the first sector to which a process issues I/O,
the seek time may start out way out of whack. So make sure we start
with 0 sectors in seek, instead of the offset of the first request
issued.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:11 +02:00
Jens Axboe
b759113499 block: make blk_abort_queue() ignore non-request based devices
There's nothing to do for those devices, since the timeout handling is
based on requests.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:10 +02:00
Tejun Heo
71982a409f block: include empty disks in /proc/diskstats
/proc/diskstats used to show stats for all disks whether they're
zero-sized or not and their non-zero partitions.  Commit
074a7aca7a accidentally changed the
behavior such that it doesn't print out zero sized disks.  This patch
implements DISK_PITER_INCL_EMPTY_PART0 flag to partition iterator and
uses it in diskstats_show() such that empty part0 is shown in
/proc/diskstats.

Reported and bisectd by Dianel Collins.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Collins <solemnwarning@solemnwarning.no-ip.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:10 +02:00
Tejun Heo
cd0aca2d55 block: fix queue bounce limit setting
Impact: don't set GFP_DMA in q->bounce_gfp unnecessarily

All DMA address limits are expressed in terms of the last addressable
unit (byte or page) instead of one plus that.  However, when
determining bounce_gfp for 64bit machines in blk_queue_bounce_limit(),
it compares the specified limit against 0x100000000UL to determine
whether it's below 4G ending up falsely setting GFP_DMA in
q->bounce_gfp.

As DMA zone is very small on x86_64, this makes larger SG_IO transfers
very eager to trigger OOM killer.  Fix it.  While at it, rename the
parameter to @dma_mask for clarity and convert comment to proper
winged style.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:09 +02:00
Tejun Heo
25636e282f block: fix SG_IO vector request data length handling
Impact: fix SG_IO behavior such that it matches the documentation

SG_IO howto says that if ->dxfer_len and sum of iovec disagress, the
shorter one wins.  However, the current implementation returns -EINVAL
for such cases.  Trim iovc if it's longer than ->dxfer_len.

This patch uses iov_*() helpers which take struct iovec * by casting
struct sg_iovec * to it.  sg_iovec is always identical to iovec and
this will be further cleaned up with later patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:09 +02:00
Jens Axboe
a36e71f996 cfq-iosched: add close cooperator code
If we have processes that are working in close proximity to each
other on disk, we don't want to idle wait. Instead allow the close
process to issue a request, getting better aggregate bandwidth.
The anticipatory scheduler has similar checks, noop and deadline do
not need it since they don't care about process <-> io mappings.

The code for CFQ is a little more involved though, since we split
request queues into per-process contexts.

This fixes a performance problem with eg dump(8), since it uses
several processes in some silly attempt to speed IO up. Even if
dump(8) isn't really a valid case (it should be fixed by using
CLONE_IO), there are other cases where we see close processes
and where idling ends up hurting performance.

Credit goes to Jeff Moyer <jmoyer@redhat.com> for writing the
initial implementation.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:15:11 +02:00
Jens Axboe
9481ffdc61 cfq-iosched: log responsible 'cfqq' in idle timer arm
Makes it easier to read the traces.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:14:13 +02:00
Jens Axboe
2d87072296 cfq-iosched: tweak kick logic a bit more
We only kick the dispatch for an idling queue, if we think it's a
(somewhat) fully merged request. Also allow a kick if we have other
busy queues in the system, since we don't want to risk waiting for
a potential merge in that case. It's better to get some work done and
proceed.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:12:46 +02:00
Jens Axboe
40bb54d197 cfq-iosched: no need to save interrupts in cfq_kick_queue()
It's called from the workqueue handlers from process context, so
we always have irqs enabled when entered.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:11:10 +02:00
Nikanth Karthikesan
15afd1cc7b block: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_KERNEL implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
FUJITA Tomonori
91e463c8f5 block: fix SG_IO to return a proper error value
blk_rq_unmap_user() returns -EFAULT if a program passes an invalid
address to kernel. SG_IO path needs to pass the returned value to user
space instead of ignoring it.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Jens Axboe
d6ceb25e8d cfq-iosched: don't delay queue kick for a merged request
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> reports that commit
b029195dda introduced a regression
of about 50% with sequential threaded read workloads. The test
case is:

tiotest -k0 -k1 -k3 -f 80 -t 32

which starts 32 threads each reading a 80MB file. Twiddle the kick
queue logic so that we do start IO immediately, if it appears to be
a fully merged request. We can't really detect that, so just check
if the request is bigger than a page or not. The assumption is that
since single bio issues will first queue a single request with just
one page attached and then later do merges on that, if we already
have more than a page worth of data in the request, then the request
is most likely good to go.

Verified that this doesn't cause a regression with the test case that
commit b029195dda was fixing. It does not,
we still see maximum sized requests for the queue-then-merge cases.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:12 +02:00
Jens Axboe
1d6bfbdf38 as-iosched: get rid of private REQ_SYNC/REQ_ASYNC defines
We can just use the block layer BLK_RW_SYNC/ASYNC defines now.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:10 +02:00
Jens Axboe
ff6657c6c8 cfq-iosched: get rid of private SYNC/ASYNC defines
We can just use the block layer BLK_RW_SYNC/ASYNC defines now.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:10 +02:00
Jens Axboe
b0b78f81a5 cfq-iosched: use rw_is_sync() to see if rw flags are sync or not
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:10 +02:00
Jens Axboe
f600abe2de block: fix bad spelling of quiesce
Credit goes to Andrew Morton for spotting this one.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:09 +02:00
Linus Torvalds
c93f216b5b Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  branch tracer, intel-iommu: fix build with CONFIG_BRANCH_TRACER=y
  branch tracer: Fix for enabling branch profiling makes sparse unusable
  ftrace: Correct a text align for event format output
  Update /debug/tracing/README
  tracing/ftrace: alloc the started cpumask for the trace file
  tracing, x86: remove duplicated #include
  ftrace: Add check of sched_stopped for probe_sched_wakeup
  function-graph: add proper initialization for init task
  tracing/ftrace: fix missing include string.h
  tracing: fix incorrect return type of ns2usecs()
  tracing: remove CALLER_ADDR2 from wakeup tracer
  blktrace: fix pdu_len when tracing packet command requests
  blktrace: small cleanup in blk_msg_write()
  blktrace: NUL-terminate user space messages
  tracing: move scripts/trace/power.pl to scripts/tracing/power.pl
2009-04-07 14:10:10 -07:00
Linus Torvalds
6a5d263866 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  loop: mutex already unlocked in loop_clr_fd()
  cfq-iosched: don't let idling interfere with plugging
  block: remove unused REQ_UNPLUG
  cfq-iosched: kill two unused cfqq flags
  cfq-iosched: change dispatch logic to deal with single requests at the time
  mflash: initial support
  cciss: change to discover first memory BAR
  cciss: kernel scan thread for MSA2012
  cciss: fix residual count for block pc requests
  block: fix inconsistency in I/O stat accounting code
  block: elevator quiescing helpers
2009-04-07 11:06:41 -07:00
Linus Torvalds
132ea5e9aa Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  sata_mv: shorten register names
  sata_mv: workaround errata SATA#13
  sata_mv: cosmetic renames
  sata_mv: workaround errata SATA#26
  sata_mv: workaround errata PCI#7
  sata_mv: replace 0x1f with ATA_PIO4 (v2)
  sata_mv: fix irq mask races
  sata_mv: revert SoC irq breakage
  libata: ahci enclosure management bios workaround
  ata: Add TRIM infrastructure
  ata_piix: VGN-BX297XP wants the controller power up on suspend
  libata: Remove some redundant casts from pata_octeon_cf.c
  pata_artop: typo
2009-04-07 08:53:02 -07:00
Jens Axboe
b029195dda cfq-iosched: don't let idling interfere with plugging
When CFQ is waiting for a new request from a process, currently it'll
immediately restart queuing when it sees such a request. This doesn't
work very well with streamed IO, since we then end up splitting IO
that would otherwise have been merged nicely. For a simple dd test,
this causes 10x as many requests to be issued as we should have.
Normally this goes unnoticed due to the low overhead of requests
at the device side, but some hardware is very sensitive to request
sizes and there it can cause big slow downs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-07 11:38:31 +02:00