Skip to content
Snippets Groups Projects
  1. Jun 29, 2018
    • Jens Axboe's avatar
      blk-mq: don't queue more if we get a busy return · 1f57f8d4
      Jens Axboe authored
      
      Some devices have different queue limits depending on the type of IO. A
      classic case is SATA NCQ, where some commands can queue, but others
      cannot. If we have NCQ commands inflight and encounter a non-queueable
      command, the driver returns busy. Currently we attempt to dispatch more
      from the scheduler, if we were able to queue some commands. But for the
      case where we ended up stopping due to BUSY, we should not attempt to
      retrieve more from the scheduler. If we do, we can get into a situation
      where we attempt to queue a non-queueable command, get BUSY, then
      successfully retrieve more commands from that scheduler and queue those.
      This can repeat forever, starving the non-queuable command indefinitely.
      
      Fix this by NOT attempting to pull more commands from the scheduler, if
      we get a BUSY return. This should also be more optimal in terms of
      letting requests stay in the scheduler for as long as possible, if we
      get a BUSY due to the regular out-of-tags condition.
      
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1f57f8d4
  2. Jun 28, 2018
  3. Jun 23, 2018
  4. Jun 20, 2018
  5. Jun 19, 2018
    • Bart Van Assche's avatar
      Revert "block: Add warning for bi_next not NULL in bio_endio()" · 9c24c10a
      Bart Van Assche authored
      
      Commit 0ba99ca4 ("block: Add warning for bi_next not NULL in
      bio_endio()") breaks the dm driver. end_clone_bio() detects whether
      or not a bio is the last bio associated with a request by checking
      the .bi_next field. Commit 0ba99ca4 clears that field before
      end_clone_bio() has had a chance to inspect that field. Hence revert
      commit 0ba99ca4.
      
      This patch avoids that KASAN reports the following complaint when
      running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):
      
      ==================================================================
      BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
      Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9
      
      CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       dump_stack+0xa4/0xf5
       print_address_description+0x6f/0x270
       kasan_report+0x241/0x360
       __asan_load4+0x78/0x80
       bio_advance+0x11b/0x1d0
       blk_update_request+0xa7/0x5b0
       scsi_end_request+0x56/0x320 [scsi_mod]
       scsi_io_completion+0x7d6/0xb20 [scsi_mod]
       scsi_finish_command+0x1c0/0x280 [scsi_mod]
       scsi_softirq_done+0x19a/0x230 [scsi_mod]
       blk_mq_complete_request+0x160/0x240
       scsi_mq_done+0x50/0x1a0 [scsi_mod]
       srp_recv_done+0x515/0x1330 [ib_srp]
       __ib_process_cq+0xa0/0xf0 [ib_core]
       ib_poll_handler+0x38/0xa0 [ib_core]
       irq_poll_softirq+0xe8/0x1f0
       __do_softirq+0x128/0x60d
       run_ksoftirqd+0x3f/0x60
       smpboot_thread_fn+0x352/0x460
       kthread+0x1c1/0x1e0
       ret_from_fork+0x24/0x30
      
      Allocated by task 1918:
       save_stack+0x43/0xd0
       kasan_kmalloc+0xad/0xe0
       kasan_slab_alloc+0x11/0x20
       kmem_cache_alloc+0xfe/0x350
       mempool_alloc_slab+0x15/0x20
       mempool_alloc+0xfb/0x270
       bio_alloc_bioset+0x244/0x350
       submit_bh_wbc+0x9c/0x2f0
       __block_write_full_page+0x299/0x5a0
       block_write_full_page+0x16b/0x180
       blkdev_writepage+0x18/0x20
       __writepage+0x42/0x80
       write_cache_pages+0x376/0x8a0
       generic_writepages+0xbe/0x110
       blkdev_writepages+0xe/0x10
       do_writepages+0x9b/0x180
       __filemap_fdatawrite_range+0x178/0x1c0
       file_write_and_wait_range+0x59/0xc0
       blkdev_fsync+0x46/0x80
       vfs_fsync_range+0x66/0x100
       do_fsync+0x3d/0x70
       __x64_sys_fsync+0x21/0x30
       do_syscall_64+0x77/0x230
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 9:
       save_stack+0x43/0xd0
       __kasan_slab_free+0x137/0x190
       kasan_slab_free+0xe/0x10
       kmem_cache_free+0xd3/0x380
       mempool_free_slab+0x17/0x20
       mempool_free+0x63/0x160
       bio_free+0x81/0xa0
       bio_put+0x59/0x60
       end_bio_bh_io_sync+0x5d/0x70
       bio_endio+0x1a7/0x360
       blk_update_request+0xd0/0x5b0
       end_clone_bio+0xa3/0xd0 [dm_mod]
       bio_endio+0x1a7/0x360
       blk_update_request+0xd0/0x5b0
       scsi_end_request+0x56/0x320 [scsi_mod]
       scsi_io_completion+0x7d6/0xb20 [scsi_mod]
       scsi_finish_command+0x1c0/0x280 [scsi_mod]
       scsi_softirq_done+0x19a/0x230 [scsi_mod]
       blk_mq_complete_request+0x160/0x240
       scsi_mq_done+0x50/0x1a0 [scsi_mod]
       srp_recv_done+0x515/0x1330 [ib_srp]
       __ib_process_cq+0xa0/0xf0 [ib_core]
       ib_poll_handler+0x38/0xa0 [ib_core]
       irq_poll_softirq+0xe8/0x1f0
       __do_softirq+0x128/0x60d
      
      The buggy address belongs to the object at ffff8801300e0640
       which belongs to the cache bio-0 of size 200
      The buggy address is located 144 bytes inside of
       200-byte region [ffff8801300e0640, ffff8801300e0708)
      The buggy address belongs to the page:
      page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
      flags: 0x8000000000008100(slab|head)
      raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
      raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
       ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      >ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
       ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Fixes: 0ba99ca4 ("block: Add warning for bi_next not NULL in bio_endio()")
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9c24c10a
    • Christoph Hellwig's avatar
      block: fix timeout changes for legacy request drivers · 0cc61e64
      Christoph Hellwig authored
      
      blk_mq_complete_request can only be called for blk-mq drivers, but when
      removing the BLK_EH_HANDLED return value, two legacy request timeout
      methods incorrectly got switched to call blk_mq_complete_request.
      Call __blk_complete_request instead to reinstance the previous behavior.
      For that __blk_complete_request needs to be exported.
      
      Fixes: 1fc2b62e ("scsi_transport_fc: complete requests from ->timeout")
      Fixes: 0df0bb08 ("null_blk: complete requests from ->timeout")
      Reported-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0cc61e64
  6. Jun 15, 2018
  7. Jun 14, 2018
  8. Jun 11, 2018
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · a347c7ad
      Roman Pen authored
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a347c7ad
  9. Jun 09, 2018
  10. Jun 08, 2018
  11. Jun 07, 2018
  12. Jun 06, 2018
  13. Jun 05, 2018
  14. Jun 04, 2018
  15. Jun 03, 2018
  16. Jun 02, 2018
  17. Jun 01, 2018
  18. May 31, 2018
    • Davide Sapienza's avatar
      block, bfq: prevent soft_rt_next_start from being stuck at infinity · f6c3ca0e
      Davide Sapienza authored
      
      BFQ can deem a bfq_queue as soft real-time only if the queue
      - periodically becomes completely idle, i.e., empty and with
        no still-outstanding I/O request;
      - after becoming idle, gets new I/O only after a special reference
        time soft_rt_next_start.
      
      In this respect, after commit "block, bfq: consider also past I/O in
      soft real-time detection", the value of soft_rt_next_start can never
      decrease. This causes a problem with the following special updating
      case for soft_rt_next_start: to prevent queues that are not completely
      idle to be wrongly detected as soft real-time (when they become
      non-empty again), soft_rt_next_start is temporarily set to infinity
      for empty queues with still outstanding I/O requests. But, if such an
      update is actually performed, then, because of the above commit,
      soft_rt_next_start will be stuck at infinity forever, and the queue
      will have no more chance to be considered soft real-time.
      
      On slow systems, this problem does cause actual soft real-time
      applications to be occasionally not detected as such.
      
      This commit addresses this issue by eliminating the pushing of
      soft_rt_next_start to infinity, and by changing the way non-empty
      queues are prevented from being wrongly detected as soft
      real-time. Simply, a queue that becomes non-empty again can now be
      detected as soft real-time only if it has no outstanding I/O request.
      
      Signed-off-by: default avatarDavide Sapienza <sapienza.dav@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6c3ca0e
    • Davide Sapienza's avatar
      block, bfq: increase weight-raising duration for interactive apps · d450542e
      Davide Sapienza authored
      
      The maximum possible duration of the weight-raising period for
      interactive applications is limited to 13 seconds, as this is the time
      needed to load the largest application that we considered when tuning
      weight raising. Unfortunately, in such an evaluation, we did not
      consider the case of very slow virtual machines.
      
      For example, on a QEMU/KVM virtual machine
      - running in a slow PC;
      - with a virtual disk stacked on a slow low-end 5400rpm HDD;
      - serving a heavy I/O workload, such as the sequential reading of
      several files;
      mplayer takes 23 seconds to start, if constantly weight-raised.
      
      To address this issue, this commit conservatively sets the upper limit
      for weight-raising duration to 25 seconds.
      
      Signed-off-by: default avatarDavide Sapienza <sapienza.dav@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d450542e
    • Paolo Valente's avatar
      block, bfq: remove slow-system class · e24f1c24
      Paolo Valente authored
      
      BFQ computes the duration of weight raising for interactive
      applications automatically, using some reference parameters. In
      particular, BFQ uses the best durations (see comments in the code for
      how these durations have been assessed) for two classes of systems:
      slow and fast ones. Examples of slow systems are old phones or systems
      using micro HDDs. Fast systems are all the remaining ones. Using these
      parameters, BFQ computes the actual duration of the weight raising,
      for the system at hand, as a function of the relative speed of the
      system w.r.t. the speed of a reference system, belonging to the same
      class of systems as the system at hand.
      
      This slow vs fast differentiation proved to be useful in the past, but
      happens to have little meaning with current hardware. Even worse, it
      does cause problems in virtual systems, where the speed of the system
      can vary frequently, and so widely to just confuse the class-detection
      mechanism, and, as we have verified experimentally, to cause BFQ to
      compute non-sensical weight-raising durations.
      
      This commit addresses this issue by removing the slow class and the
      class-detection mechanism.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e24f1c24
    • Paolo Valente's avatar
      block, bfq: add description of weight-raising heuristics · 4029eef1
      Paolo Valente authored
      
      A description of how weight raising works is missing in BFQ
      sources. In addition, the code for handling weight raising is
      scattered across a few functions. This makes it rather hard to
      understand the mechanism and its rationale. This commits adds such a
      description at the beginning of the main source file.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4029eef1
    • Adam Manzanares's avatar
      block: add ioprio_check_cap function · aa434577
      Adam Manzanares authored
      
      Aio per command iopriority support introduces a second interface between
      userland and the kernel capable of passing iopriority. The aio interface also
      needs the ability to verify that the submitting context has sufficient
      privileges to submit IOPRIO_RT commands. This patch creates the
      ioprio_check_cap function to be used by the ioprio_set system call and also by
      the aio interface.
      
      Signed-off-by: default avatarAdam Manzanares <adam.manzanares@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      aa434577
    • Filippo Muzzini's avatar
      block, bfq: remove the removal of 'next' rq in bfq_requests_merged · ac857e0d
      Filippo Muzzini authored
      
      Since bfq_finish_request() is always called on the request 'next',
      after bfq_requests_merged() is finished, and bfq_finish_request()
      removes 'next' from its bfq_queue if needed, it isn't necessary to do
      such a removal in advance in bfq_merged_requests().
      
      This commit removes such a useless 'next' removal.
      
      Signed-off-by: default avatarFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac857e0d
    • Paolo Valente's avatar
      block, bfq: remove wrong check in bfq_requests_merged · 8abfa4d6
      Paolo Valente authored
      
      The request rq passed to the function bfq_requests_merged is always in
      a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
      beginning of bfq_requests_merged always succeeds, and the control
      flow systematically skips to the end of the function.  This implies
      that the body of the function is never executed, i.e., the
      repositioning of rq is never performed.
      
      On the opposite end, a control is missing in the body of the function:
      'next' must be removed only if it is inside a bfq_queue.
      
      This commit removes the wrong check on rq, and adds the missing check
      on 'next'. In addition, this commit adds comments on
      bfq_requests_merged.
      
      Signed-off-by: default avatarFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8abfa4d6
    • Filippo Muzzini's avatar
      block, bfq: remove wrong lock in bfq_requests_merged · a12bffeb
      Filippo Muzzini authored
      
      In bfq_requests_merged(), there is a deadlock because the lock on
      bfqq->bfqd->lock is held by the calling function, but the code of
      this function tries to grab the lock again.
      
      This deadlock is currently hidden by another bug (fixed by next commit
      for this source file), which causes the body of bfq_requests_merged()
      to be never executed.
      
      This commit removes the deadlock by removing the lock/unlock pair.
      
      Signed-off-by: default avatarFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a12bffeb
    • Jens Axboe's avatar
      block: fixup bioset_integrity_create() call · 04c4950d
      Jens Axboe authored
      
      Missed converting the bioset_integrity_create() bounce bio set
      call.
      
      Fixes: 338aa96d ("block: convert bounce, q->bio_split to bioset_init()/mempool_init()")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04c4950d
  19. May 30, 2018
Loading