[ Upstream commit a958937ff166fc60d1c3a721036f6ff41bfa2821 ]
When a stacked block device inserts a request into another block device
using blk_insert_cloned_request, the request's nr_phys_segments field gets
recalculated by a call to blk_recalc_rq_segments in
blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to
handle multi-segment discards. For disk types which can handle
multi-segment discards like nvme, this results in discard requests which
claim a single segment when it should report several, triggering a warning
in nvme and causing nvme to fail the discard from the invalid state.
WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core]
...
nvme_setup_cmd+0x217/0x270 [nvme_core]
nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop]
__blk_mq_try_issue_directly+0xe7/0x1b0
blk_mq_request_issue_directly+0x41/0x70
? blk_account_io_start+0x40/0x50
dm_mq_queue_rq+0x200/0x3e0
blk_mq_dispatch_rq_list+0x10a/0x7d0
? __sbitmap_queue_get+0x25/0x90
? elv_rb_del+0x1f/0x30
? deadline_remove_request+0x55/0xb0
? dd_dispatch_request+0x181/0x210
__blk_mq_do_dispatch_sched+0x144/0x290
? bio_attempt_discard_merge+0x134/0x1f0
__blk_mq_sched_dispatch_requests+0x129/0x180
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x47/0xe0
__blk_mq_delay_run_hw_queue+0x15b/0x170
blk_mq_sched_insert_requests+0x68/0xe0
blk_mq_flush_plug_list+0xf0/0x170
blk_finish_plug+0x36/0x50
xlog_cil_committed+0x19f/0x290 [xfs]
xlog_cil_process_committed+0x57/0x80 [xfs]
xlog_state_do_callback+0x1e0/0x2a0 [xfs]
xlog_ioend_work+0x2f/0x80 [xfs]
process_one_work+0x1b6/0x350
worker_thread+0x53/0x3e0
? process_one_work+0x350/0x350
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
This patch fixes blk_recalc_rq_segments to be aware of devices which can
have multi-segment discards. It calculates the correct discard segment
count by counting the number of bio as each discard bio is considered its
own segment.
Fixes: 1e739730c5 ("block: optionally merge discontiguous discard bios into a single request")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Link: https://lore.kernel.org/r/20210211143807.GA115624@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9ec491447b90ad6a4056a9656b13f0b3a1e83043 ]
register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN
but enables uevents at the end again in order to announce disk after
possible partitions are created.
When the device is removed the uevents are still on and user land sees
'remove' messages for devices which were never 'add'ed to the system.
KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)
Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the
uevents at all.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 97f433c3601a24d3513d06f575a389a2ca4e11e4 upstream.
We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb
The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.
The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.
In order to fix the bug, we round down max_sectors to logical_block_size.
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 41e76c85660c022c6bf5713bfb6c21e64a487cec upstream.
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:
[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read
[reader]
numjobs=4
directory=/mnt
where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).
Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.
Fixes: 058fdecc6d ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 388c705b95f23f317fa43e6abf9ff07b583b721a ]
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:
sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c635caef410aa757befbd8857c1eadde5cc22ed ]
On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.
[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700] blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701] cgwb_release_workfn+0x104/0x158
[ 4757.010702] process_one_work+0x1bc/0x3f0
[ 4757.010704] worker_thread+0x164/0x468
[ 4757.010705] kthread+0x108/0x138
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit bba91cdba6 which is
commit b0beb28097fa04177b3769f4bb7a0d0d9c4ae76e upstream.
It breaks things in 5.4.y, so let's drop it.
Reported-by: Andres Freund <andres@anarazel.de>
Cc: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
CC: Jens Axboe <axboe@kernel.dk>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ]
BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit aebf5db917055b38f4945ed6d621d9f07a44ff30 upstream.
Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.
Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fa4d0f1992a96f6d7c988ef423e3127e613f6ac9 upstream.
With the current implementation the following race can happen:
* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
blk_mq_unfreeze_queue().
* blk_queue_enter() calls blk_queue_pm_only() and that function returns
true.
* blk_queue_enter() calls blk_pm_request_resume() and that function does
not call pm_request_resume() because the queue runtime status is
RPM_ACTIVE.
* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.
Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Stanley Chu <stanley.chu@mediatek.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ab3cee3762e5e69f27c302c43691289fdfc12316 ]
In blk_mq_dispatch_rq_list(), if blk_mq_sched_needs_restart() returns
true and the driver returns BLK_STS_RESOURCE then we'll kick the
queue. However, there's another case where we might need to kick it.
If we were unable to get budget we can be in much the same state as
when the driver returns BLK_STS_RESOURCE, so we should treat it the
same.
It should be noted that even if we add a whole bunch of extra kicking
to the queue in other patches this patch is still important.
Specifically any kicking that happened before we re-spliced leftover
requests into 'hctx->dispatch' wouldn't have found any work, so we
really need to make sure we kick ourselves after we've done the
splicing.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c92a41031a6d57395889b5c87cea359220a24d2a ]
Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7a1d926dc4076aadad187614500afcd8de78818 ]
There is no need for the function __blkdev_reset_all_zones() as
REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones()
bio loop with an early break from the loop. This patch removes this
function and modifies blkdev_reset_zones(), simplifying the code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f255c19b3ab46d3cad3b1b2e1036f4c926cb1d0c ]
Similarly to commit 457e490f2b ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail. This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.
This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.
A similar solution exists in blkcg_init_queue for the same situation.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f4ac712e4fe009635344b9af5d890fe25fcc8c0d ]
syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.
[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 47ce030b7ac5a5259a9a5919f230b52497afc31a ]
blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.
Fixes: 1b97871b50 ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2b8bd423614c595540eaadcfbc702afe8e155e50 upstream.
Currently io_ticks is approximated by adding one at each start and end of
requests if jiffies counter has changed. This works perfectly for requests
shorter than a jiffy or if one of requests starts/ends at each jiffy.
If disk executes just one request at a time and they are longer than two
jiffies then only first and last jiffies will be accounted.
Fix is simple: at the end of request add up into io_ticks jiffies passed
since last update rather than just one jiffy.
Example: common HDD executes random read 4k requests around 12ms.
fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
iostat -x 10 sdb
Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
Before:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43
After:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99
Now io_ticks does not loose time between start and end of requests, but
for queue-depth > 1 some I/O time between adjacent starts might be lost.
For load estimation "%util" is not as useful as average queue length,
but it clearly shows how often disk queue is completely empty.
Fixes: 5b18b5a737 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From: "Banerjee, Debabrata" <dbanerje@akamai.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e8a8a185051a460e3eb0617dca33f996f4e31516 ]
Yang Yang reported the following crash caused by requeueing a flush
request in Kyber:
[ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
...
[ 2.517468] pc : clear_bit+0x18/0x2c
[ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
[ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
...
[ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
[ 2.517602] Call trace:
[ 2.517606] clear_bit+0x18/0x2c
[ 2.517619] kyber_finish_request+0x74/0x80
[ 2.517627] blk_mq_requeue_request+0x3c/0xc0
[ 2.517637] __scsi_queue_insert+0x11c/0x148
[ 2.517640] scsi_softirq_done+0x114/0x130
[ 2.517643] blk_done_softirq+0x7c/0xb0
[ 2.517651] __do_softirq+0x208/0x3bc
[ 2.517657] run_ksoftirqd+0x34/0x60
[ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
[ 2.517667] kthread+0x110/0x120
[ 2.517669] ret_from_fork+0x10/0x18
This happens because Kyber doesn't track flush requests, so
kyber_finish_request() reads a garbage domain token. Only call the
scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
the finish_request() hook in blk_mq_free_request()). Now that we're
handling it in blk-mq, also remove the check from BFQ.
Reported-by: Yang Yang <yang.yang@vivo.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2cd896a5e86fc326bda8614b96c0401dcc145868 ]
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.
Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.
WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6
Call Trace:
__remove_mapping+0x154/0x320 (unreliable)
iomap_releasepage+0x80/0x180
try_to_release_page+0x94/0xe0
invalidate_inode_page+0xc8/0x110
invalidate_mapping_pages+0x1dc/0x540
generic_fadvise+0x3c8/0x450
xfs_file_fadvise+0x2c/0xe0 [xfs]
vfs_fadvise+0x3c/0x60
ksys_fadvise64_64+0x68/0xe0
sys_fadvise64+0x28/0x40
system_call_exception+0xf8/0x1c0
system_call_common+0xf0/0x278
Fixes: cc90bc68422 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5aeac7c4b16069aae49005f0a8d4526baa83341b upstream.
ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de1b0ee490eafdf65fac9eef9925391a8369f2dc upstream.
If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.
Initialize the default value just like we do for ->ra_pages.
Cc: stable@vger.kernel.org
Fixes: 9491ae4aad ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream.
SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.
memory barriers have to be used for ordering the following two pair of OPs:
1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()
2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().
Without the added memory barrier, either:
1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side
or
2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.
IO hang in ltp/fs_fill test is reported by kernel test robot:
https://lkml.org/lkml/2020/7/26/77
Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e4b469c66f3cbb81c2e94d31123d7bcdf3c1dabd upstream.
A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.
Fixes: 9cc5169cd4 ("block: Improve physical block alignment of split bios")
Reported-by: Eric Deal <eric.deal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 27029b4b18aa5d3b060f0bf2c26dae254132cfce ]
Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit db03f88fae8a2c8007caafa70287798817df2875 ]
c616cbee97 ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.
Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.
High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.
Fixes: c616cbee97 ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2de791ab4918969d8108f15238a701968375f235 ]
Changes from v1:
- update commit description with proper ref-accounting justification
commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
-> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity
-> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);
So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
bfq_group leak trace caused by bad commit:
-> blkg_alloc
-> bfq_pq_alloc
-> bfqg_get (+1)
->bfq_activate_bfqq
->bfq_activate_requeue_entity
-> __bfq_activate_entity
->bfq_get_entity
->bfqg_and_blkg_get (+1) <==== : Note1
->bfq_del_bfqq_busy
->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
-> bfq_forget_entity(is_in_service = true)
entity->on_st_or_in_serv = false <=== :Note2
if (is_in_service)
return; ==> do not touch reference
-> blkcg_css_offline
-> blkcg_destroy_blkgs
-> blkg_destroy
-> bfq_pd_offline
-> __bfq_deactivate_entity
if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
-> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq will leak forever, see test-case below.
##TESTCASE_BEGIN:
#!/bin/bash
max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler
grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
mkdir -p /sys/fs/cgroup/blkio/a
echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:
Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d81665198b83e55a28339d1f3e4890ed8a434556 ]
If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used. Use a slightly more
obvious test that the two pages are compatible with each other.
Fixes: 52d52d1c98 ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 943b40c832beb71115e38a1c4d99b640b5342738 ]
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
return false for discard request, then normal request merge is applied.
However, only queue_max_segments() is checked, so max discard segment
limit isn't respected.
Check max discard segment limit in the request merge code for fixing
the issue.
Discard request failure of virtio_blk is fixed.
Fixes: 6984046608 ("block: fix the DISCARD request merge")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 4a2f704eb2d831a2d73d7f4cdd54f45c49c3c353 upstream.
Commit 429120f3df2d starts to take account of segment's start dma address
when computing max segment size, and data type of 'unsigned long'
is used to do that. However, the segment mask may be 0xffffffff, so
the figured out segment size may be overflowed in case of zero physical
address on 32bit arch.
Fix the issue by returning queue_max_segment_size() directly when that
happens.
Fixes: 429120f3df2d ("block: fix splitting segments on boundary masks")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 429120f3df2dba2bf3a4a19f4212a53ecefc7102 upstream.
We ran into a problem with a mpt3sas based controller, where we would
see random (and hard to reproduce) file corruption). The issue seemed
specific to this controller, but wasn't specific to the file system.
After a lot of debugging, we find out that it's caused by segments
spanning a 4G memory boundary. This shouldn't happen, as the default
setting for segment boundary masks is 4G.
Turns out there are two issues in get_max_segment_size():
1) The default segment boundary mask is bypassed
2) The segment start address isn't taken into account when checking
segment boundary limit
Fix these two issues by removing the bypass of the segment boundary
check even if the mask is set to the default value, and taking into
account the actual start address of the request when checking if a
segment needs splitting.
Cc: stable@vger.kernel.org # v5.1+
Reviewed-by: Chris Mason <clm@fb.com>
Tested-by: Chris Mason <clm@fb.com>
Fixes: dcebd75592 ("block: use bio_for_each_bvec() to compute multi-page bvec count")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dropped const on the page pointer, ppc page_to_phys() doesn't mark the
page as const...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[ Upstream commit bfe373f608cf81b7626dfeb904001b0e867c5110 ]
Else there may be magic numbers in /sys/kernel/debug/block/*/state.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 05a4fed69ff00a8bd83538684cb602a4636b07a7 upstream.
dm-multipath is the only user of blk_mq_queue_inflight(). When
dm-multipath calls blk_mq_queue_inflight() to check if it has
outstanding IO it can get a false negative. The reason for this is
blk_mq_rq_inflight() doesn't consider requests that are no longer
MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
called or finished yet) as "inflight".
This causes request-based dm-multipath's dm_wait_for_completion() to
return before all outstanding dm-multipath requests have actually
completed. This breaks DM multipath's suspend functionality because
blk-mq requests complete after DM's suspend has finished -- which
shouldn't happen.
Fix this by considering any request not in the MQ_RQ_IDLE state
(so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
blk_mq_rq_inflight().
Fixes: 3c94d83cb3 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0b8eb629a700c0ef15a437758db8255f8444e76c ]
Release bip using kfree() in error path when that was allocated
by kmalloc().
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fe35ec58f0d339221643287bbb7cee15c93a5389 ]
There is an issue when tune the number for read and write queues,
if the total queue count was not changed. The hctx->type cannot
be updated, since __blk_mq_update_nr_hw_queues will return directly
if the total queue count has not been changed.
Reproduce:
dmesg | grep "default/read/poll"
[ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default
tune the write queues to 24:
echo 24 > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller
dmesg | grep "default/read/poll"
[ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues
cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
48 default
The driver's hardware queue mapping is not same as block layer.
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a75ca9303175d36af93c0937dd9b1a6422908b8d upstream.
commit e7bf90e5af ("block/bio-integrity: fix a memory leak bug") added
a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However,
the object will be freed in bio_integrity_free() since 'bio->bi_opf' and
'bio->bi_integrity' were set previousy in bio_integrity_alloc().
Fixes: commit e7bf90e5af ("block/bio-integrity: fix a memory leak bug")
Signed-off-by: yu kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ]
When the QoS targets are met and nothing is being throttled, there's
no way to tell how saturated the underlying device is - it could be
almost entirely idle, at the cusp of saturation or anywhere inbetween.
Given that there's no information, it's best to keep vrate as-is in
this state. Before 7cd806a9a9 ("iocost: improve nr_lagging
handling"), this was the case - if the device isn't missing QoS
targets and nothing is being throttled, busy_level was reset to zero.
While fixing nr_lagging handling, 7cd806a9a9 ("iocost: improve
nr_lagging handling") broke this. Now, while the device is hitting
QoS targets and nothing is being throttled, vrate keeps getting
adjusted according to the existing busy_level.
This led to vrate keeping climing till it hits max when there's an IO
issuer with limited request concurrency if the vrate started low.
vrate starts getting adjusted upwards until the issuer can issue IOs
w/o being throttled. From then on, QoS targets keeps getting met and
nothing on the system needs throttling and vrate keeps getting
increased due to the existing busy_level.
This patch makes the following changes to the busy_level logic.
* Reset busy_level if nr_shortages is zero to avoid the above
scenario.
* Make non-zero nr_lagging block lowering nr_level but still clear
positive busy_level if there's clear non-saturation signal - QoS
targets are met and nr_shortages is non-zero. nr_lagging's role is
preventing adjusting vrate upwards while there are long-running
commands and it shouldn't keep busy_level positive while there's
clear non-saturation signal.
* Restructure code for clarity and add comments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andy Newell <newella@fb.com>
Fixes: 7cd806a9a9 ("iocost: improve nr_lagging handling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aa880ad690ab6d4c53934af85fb5a43e69ecb0f5 ]
When we increase hardware queue count, blk_mq_update_queue_map will
reset the mapping between cpu and hardware queue base on the hardware
queue count(set->nr_hw_queues). The mapping cannot be reset if it
encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
continue using it, then blk_mq_map_swqueue will touch a invalid memory,
because the mapping points to a wrong hctx.
blktest block/030:
null_blk: module loaded
Increasing nr_hw_queues to 8 fails, fallback to 1
==================================================================
BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
Read of size 8 at addr 0000000000000128 by task nproc/8541
CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0xa5/0xe6
__kasan_report.cold+0x65/0xbb
kasan_report+0x45/0x60
check_memory_region+0x15e/0x1c0
__kasan_check_read+0x15/0x20
blk_mq_map_swqueue+0x2f2/0x830
__blk_mq_update_nr_hw_queues+0x3df/0x690
blk_mq_update_nr_hw_queues+0x32/0x50
nullb_device_submit_queues_store+0xde/0x160 [null_blk]
configfs_write_file+0x1c4/0x250 [configfs]
__vfs_write+0x4c/0x90
vfs_write+0x14b/0x2d0
ksys_write+0xdd/0x180
__x64_sys_write+0x47/0x50
do_syscall_64+0x6f/0x310
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b0beb28097fa04177b3769f4bb7a0d0d9c4ae76e ]
This reverts commit c58c1f83436b501d45d4050fd1296d71a9760bcb.
io_uring does do the right thing for this case, and we're still returning
-EAGAIN to userspace for the cases we don't support. Revert this change
to avoid doing endless spins of resubmits.
Cc: stable@vger.kernel.org # v5.6
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.
However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.
1. The iocg is in the process of being deactivated by the periodic timer.
2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
without anything because it still sees that the iocg is already active.
3. The iocg is deactivated.
4. The bio from #2 is over budget but needs to be forced. It increases
abs_vdebt and goes over the threshold and enables use_delay.
5. IO control is enabled for the iocg's subtree and now IOs are attributed
to the descendant cgroups and the iocg itself no longer issues IOs.
This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.
The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.
This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vlad Dmitriev <vvd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Fixes: e1518f63f2 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5fe56de799ad03e92d794c7936bf363922b571df ]
If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
dispatch loop, but the request may keep the driver tag, evaulated
in 'nxt' in the previous loop iteration.
Fix by putting the driver tag for that request.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.
Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:
/tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
, u32[]* __tracepoint_arg_missed_ppm
That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d38a87fbb77fb9ff2ff4e914162a8ae6453eff5 upstream.
In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
flush the rb tree that contains all idle entities belonging to the pd
(cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
invoked before bfq_reparent_active_queues(). Yet the latter may happen
to add some entities to the idle tree. It happens if, in some of the
calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
the queue to move is empty and gets expired.
This commit simply reverses the invocation order between
bfq_flush_idle_tree() and bfq_reparent_active_queues().
Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 576682fa52cbd95deb3773449566274f206acc58 upstream.
bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
entity represents just a bfq_queue in an entity tree). Yet, the input
entity is guaranteed to always be a leaf entity only in two-level
entity trees. In this respect, because of the error fixed by
commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
to actually have only two levels. After the latter commit, this does not
hold any longer.
This commit fixes this problem by modifying
bfq_reparent_leaf_entity(), so that it searches an active leaf entity
down the path that stems from the input entity. Such a leaf entity is
guaranteed to exist when bfq_reparent_leaf_entity() is invoked.
Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c8997736650060594845e42c5d01d3118aec8d25 upstream.
A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
goal of this put is to release a process reference to a bfq_queue. But
process-reference releases may trigger also some extra operation, and,
to this goal, are handled through bfq_release_process_ref(). So, turn
the invocation of bfq_put_queue() into an invocation of
bfq_release_process_ref().
Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]
In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
not in bfqd-lock critical section. The bfqq, which is not
equal to NULL in bfq_idle_slice_timer, may be freed after passing
to bfq_idle_slice_timer_body. So we will access the freed memory.
In addition, considering the bfqq may be in race, we should
firstly check whether bfqq is in service before doing something
on it in bfq_idle_slice_timer_body func. If the bfqq in race is
not in service, it means the bfqq has been expired through
__bfq_bfqq_expire func, and wait_request flags has been cleared in
__bfq_bfqd_reset_in_service func. So we do not need to re-clear the
wait_request of bfqq which is not in service.
KASAN log is given as follows:
[13058.354613] ==================================================================
[13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
[13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
[13058.354646]
[13058.354655] CPU: 96 PID: 19767 Comm: fork13
[13058.354661] Call trace:
[13058.354667] dump_backtrace+0x0/0x310
[13058.354672] show_stack+0x28/0x38
[13058.354681] dump_stack+0xd8/0x108
[13058.354687] print_address_description+0x68/0x2d0
[13058.354690] kasan_report+0x124/0x2e0
[13058.354697] __asan_load8+0x88/0xb0
[13058.354702] bfq_idle_slice_timer+0xac/0x290
[13058.354707] __hrtimer_run_queues+0x298/0x8b8
[13058.354710] hrtimer_interrupt+0x1b8/0x678
[13058.354716] arch_timer_handler_phys+0x4c/0x78
[13058.354722] handle_percpu_devid_irq+0xf0/0x558
[13058.354731] generic_handle_irq+0x50/0x70
[13058.354735] __handle_domain_irq+0x94/0x110
[13058.354739] gic_handle_irq+0x8c/0x1b0
[13058.354742] el1_irq+0xb8/0x140
[13058.354748] do_wp_page+0x260/0xe28
[13058.354752] __handle_mm_fault+0x8ec/0x9b0
[13058.354756] handle_mm_fault+0x280/0x460
[13058.354762] do_page_fault+0x3ec/0x890
[13058.354765] do_mem_abort+0xc0/0x1b0
[13058.354768] el0_da+0x24/0x28
[13058.354770]
[13058.354773] Allocated by task 19731:
[13058.354780] kasan_kmalloc+0xe0/0x190
[13058.354784] kasan_slab_alloc+0x14/0x20
[13058.354788] kmem_cache_alloc_node+0x130/0x440
[13058.354793] bfq_get_queue+0x138/0x858
[13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
[13058.354801] bfq_init_rq+0x1f4/0x1180
[13058.354806] bfq_insert_requests+0x264/0x1c98
[13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
[13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
[13058.354826] blk_flush_plug_list+0x230/0x548
[13058.354830] blk_finish_plug+0x60/0x80
[13058.354838] read_pages+0xec/0x2c0
[13058.354842] __do_page_cache_readahead+0x374/0x438
[13058.354846] ondemand_readahead+0x24c/0x6b0
[13058.354851] page_cache_sync_readahead+0x17c/0x2f8
[13058.354858] generic_file_buffered_read+0x588/0xc58
[13058.354862] generic_file_read_iter+0x1b4/0x278
[13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
[13058.354972] __vfs_read+0x238/0x320
[13058.354976] vfs_read+0xbc/0x1c0
[13058.354980] ksys_read+0xdc/0x1b8
[13058.354984] __arm64_sys_read+0x50/0x60
[13058.354990] el0_svc_common+0xb4/0x1d8
[13058.354994] el0_svc_handler+0x50/0xa8
[13058.354998] el0_svc+0x8/0xc
[13058.354999]
[13058.355001] Freed by task 19731:
[13058.355007] __kasan_slab_free+0x120/0x228
[13058.355010] kasan_slab_free+0x10/0x18
[13058.355014] kmem_cache_free+0x288/0x3f0
[13058.355018] bfq_put_queue+0x134/0x208
[13058.355022] bfq_exit_icq_bfqq+0x164/0x348
[13058.355026] bfq_exit_icq+0x28/0x40
[13058.355030] ioc_exit_icq+0xa0/0x150
[13058.355035] put_io_context_active+0x250/0x438
[13058.355038] exit_io_context+0xd0/0x138
[13058.355045] do_exit+0x734/0xc58
[13058.355050] do_group_exit+0x78/0x220
[13058.355054] __wake_up_parent+0x0/0x50
[13058.355058] el0_svc_common+0xb4/0x1d8
[13058.355062] el0_svc_handler+0x50/0xa8
[13058.355066] el0_svc+0x8/0xc
[13058.355067]
[13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
[13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
[13058.355077] The buggy address belongs to the page:
[13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
[13058.366175] flags: 0x2ffffe0000008100(slab|head)
[13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
[13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
[13058.370789] page dumped because: kasan: bad access detected
[13058.370791]
[13058.370792] Memory state around the buggy address:
[13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
[13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370808] ^
[13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[13058.370817] ==================================================================
[13058.370820] Disabling lock debugging due to kernel taint
Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
--
V2->V3: rewrite the comment as suggested by Paolo Valente
V1->V2: add one comment, and add Fixes and Reported-by tag.
Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reported-by: Wang Wang <wangwang2@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Feilong Lin <linfeilong@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]
There is a potential race between ioc_release_fn() and
ioc_clear_queue() as shown below, due to which below kernel
crash is observed. It also can result into use-after-free
issue.
context#1: context#2:
ioc_release_fn() __ioc_clear_queue() gets the same icq
->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock);
->ioc_destroy_icq(icq);
->list_del_init(&icq->q_node);
->call_rcu(&icq->__rcu_head,
icq_free_icq_rcu);
->spin_unlock(&ioc->lock);
->ioc_destroy_icq(icq);
->hlist_del_init(&icq->ioc_node);
This results into below crash as this memory
is now used by icq->__rcu_head in context#1.
There is a chance that icq could be free'd
as well.
22150.386550: <6> Unable to handle kernel write to read-only memory
at virtual address ffffffaa8d31ca50
...
Call trace:
22150.607350: <2> ioc_destroy_icq+0x44/0x110
22150.611202: <2> ioc_clear_queue+0xac/0x148
22150.615056: <2> blk_cleanup_queue+0x11c/0x1a0
22150.619174: <2> __scsi_remove_device+0xdc/0x128
22150.623465: <2> scsi_forget_host+0x2c/0x78
22150.627315: <2> scsi_remove_host+0x7c/0x2a0
22150.631257: <2> usb_stor_disconnect+0x74/0xc8
22150.635371: <2> usb_unbind_interface+0xc8/0x278
22150.639665: <2> device_release_driver_internal+0x198/0x250
22150.644897: <2> device_release_driver+0x24/0x30
22150.649176: <2> bus_remove_device+0xec/0x140
22150.653204: <2> device_del+0x270/0x460
22150.656712: <2> usb_disable_device+0x120/0x390
22150.660918: <2> usb_disconnect+0xf4/0x2e0
22150.664684: <2> hub_event+0xd70/0x17e8
22150.668197: <2> process_one_work+0x210/0x480
22150.672222: <2> worker_thread+0x32c/0x4c8
Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
indicate this icq is once marked as destroyed. Also, ensure
__ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
that icq doesn't get free'd up while it is still using it.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Co-developed-by: Pradeep P V K <ppvk@codeaurora.org>
Signed-off-by: Pradeep P V K <ppvk@codeaurora.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>