Skip to content
Snippets Groups Projects
  1. May 31, 2019
  2. May 29, 2019
  3. May 23, 2019
    • Bob Liu's avatar
      blk-mq: fix hang caused by freeze/unfreeze sequence · 7996a8b5
      Bob Liu authored
      The following is a description of a hang in blk_mq_freeze_queue_wait().
      The hang happens on attempt to freeze a queue while another task does
      queue unfreeze.
      
      The root cause is an incorrect sequence of percpu_ref_resurrect() and
      percpu_ref_kill() and as a result those two can be swapped:
      
       CPU#0                         CPU#1
       ----------------              -----------------
       q1 = blk_mq_init_queue(shared_tags)
      
                                      q2 = blk_mq_init_queue(shared_tags):
                                        blk_mq_add_queue_tag_set(shared_tags):
                                          blk_mq_update_tag_set_depth(shared_tags):
      				     list_for_each_entry()
                                            blk_mq_freeze_queue(q1)
                                             > percpu_ref_kill()
                                             > blk_mq_freeze_queue_wait()
      
       blk_cleanup_queue(q1)
        blk_mq_freeze_queue(q1)
         > percpu_ref_kill()
                       ^^^^^^ freeze_depth can't guarantee the order
      
                                            blk_mq_unfreeze_queue()
                                              > percpu_ref_resurrect()
      
         > blk_mq_freeze_queue_wait()
                       ^^^^^^ Hang here!!!!
      
      This wrong sequence raises kernel warning:
      percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
      WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0
      
      But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
      which waits for a zero of a q_usage_counter, which never happens
      because percpu-ref was reinited (instead of being killed) and stays in
      PERCPU state forever.
      
      How to reproduce:
       - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
       - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
       - cpu1: python Script.py 1; taskset the corresponding process running on cpu1
      
       Script.py:
       ------
       #!/usr/bin/python3
      
      import os
      import sys
      
      while True:
          on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
          off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
          os.system(on)
          os.system(off)
      ------
      
      This bug was first reported and fixed by Roman, previous discussion:
      [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
      [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
      [3] https://patchwork.kernel.org/patch/9268199/
      
      
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7996a8b5
    • Christoph Hellwig's avatar
      block: remove the bi_seg_{front,back}_size fields in struct bio · 6869875f
      Christoph Hellwig authored
      
      At this point these fields aren't used for anything, so we can remove
      them.
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6869875f
    • Christoph Hellwig's avatar
      block: remove the segment size check in bio_will_gap · 200a9aff
      Christoph Hellwig authored
      
      We fundamentally do not have a maximum segement size for devices with a
      virt boundary.  So don't bother checking it, especially given that the
      existing checks didn't properly work to start with as we never fully
      update the front/back segment size and miss the bi_seg_front_size that
      wuld have been required for some cases.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      200a9aff
    • Christoph Hellwig's avatar
      block: force an unlimited segment size on queues with a virt boundary · 09324d32
      Christoph Hellwig authored
      
      We currently fail to update the front/back segment size in the bio when
      deciding to allow an otherwise gappy segement to a device with a
      virt boundary.  The reason why this did not cause problems is that
      devices with a virt boundary fundamentally don't use segments as we
      know it and thus don't care.  Make that assumption formal by forcing
      an unlimited segement size in this case.
      
      Fixes: f6970f83 ("block: don't check if adjacent bvecs in one bio can be mergeable")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09324d32
    • Christoph Hellwig's avatar
      block: don't decrement nr_phys_segments for physically contigous segments · eded341c
      Christoph Hellwig authored
      
      Currently ll_merge_requests_fn, unlike all other merge functions,
      reduces nr_phys_segments by one if the last segment of the previous,
      and the first segment of the next segement are contigous.  While this
      seems like a nice solution to avoid building smaller than possible
      requests it causes a mismatch between the segments actually present
      in the request and those iterated over by the bvec iterators, including
      __rq_for_each_bio.  This can for example mistrigger the single segment
      optimization in the nvme-pci driver, and might lead to mismatching
      nr_phys_segments number when recalculating the number of request
      when inserting a cloned request.
      
      We could possibly work around this by making the bvec iterators take
      the front and back segment size into account, but that would require
      moving them from the bio to the bio_iter and spreading this mess
      over all users of bvecs.  Or we could simply remove this optimization
      under the assumption that most users already build good enough bvecs,
      and that the bio merge patch never cared about this optimization
      either.  The latter is what this patch does.
      
      dff824b2 ("nvme-pci: optimize mapping of small single segment requests").
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eded341c
  4. May 16, 2019
  5. May 04, 2019
    • Ming Lei's avatar
      block: don't drain in-progress dispatch in blk_cleanup_queue() · 66215664
      Ming Lei authored
      
      Now freeing hw queue resource is moved to hctx's release handler,
      we don't need to worry about the race between blk_cleanup_queue and
      run queue any more.
      
      So don't drain in-progress dispatch in blk_cleanup_queue().
      
      This is basically revert of c2856ae2 ("blk-mq: quiesce queue before
      freeing queue").
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      66215664
    • Ming Lei's avatar
      blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release · 1b97871b
      Ming Lei authored
      
      hctx is always released after requeue is freed.
      
      With holding queue's kobject refcount, it is safe for driver to run queue,
      so one run queue might be scheduled after blk_sync_queue() is done.
      
      So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
      for avoiding run released queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b97871b
    • Ming Lei's avatar
      blk-mq: always free hctx after request queue is freed · 2f8f1336
      Ming Lei authored
      
      In normal queue cleanup path, hctx is released after request queue
      is freed, see blk_mq_release().
      
      However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
      of hw queues shrinking. This way is easy to cause use-after-free,
      because: one implicit rule is that it is safe to call almost all block
      layer APIs if the request queue is alive; and one hctx may be retrieved
      by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
      finally use-after-free is triggered.
      
      Fixes this issue by always freeing hctx after releasing request queue.
      If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
      a per-queue list to hold them, then try to resuse these hctxs if numa
      node is matched.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f8f1336
    • Ming Lei's avatar
      blk-mq: split blk_mq_alloc_and_init_hctx into two parts · 7c6c5b7c
      Ming Lei authored
      
      Split blk_mq_alloc_and_init_hctx into two parts, and one is
      blk_mq_alloc_hctx() for allocating all hctx resources, another
      is blk_mq_init_hctx() for initializing hctx, which serves as
      counter-part of blk_mq_exit_hctx().
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: Martin K . Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c6c5b7c
    • Ming Lei's avatar
      blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b
      Ming Lei authored
      Once blk_cleanup_queue() returns, tags shouldn't be used any more,
      because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
      ("blk-mq: Fix a use-after-free") fixes this issue exactly.
      
      However, that commit introduces another issue. Before 45a9c9d9,
      we are allowed to run queue during cleaning up queue if the queue's
      kobj refcount is held. After that commit, queue can't be run during
      queue cleaning up, otherwise oops can be triggered easily because
      some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
      
      We have invented ways for addressing this kind of issue before, such as:
      
      	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
      	c2856ae2 ("blk-mq: quiesce queue before freeing queue")
      
      But still can't cover all cases, recently James reports another such
      kind of issue:
      
      	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
      
      
      
      This issue can be quite hard to address by previous way, given
      scsi_run_queue() may run requeues for other LUNs.
      
      Fixes the above issue by freeing hctx's resources in its release handler, and this
      way is safe becasue tags isn't needed for freeing such hctx resource.
      
      This approach follows typical design pattern wrt. kobject's release handler.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reported-by: default avatarJames Smart <james.smart@broadcom.com>
      Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7e2d94b
    • Ming Lei's avatar
      blk-mq: move cancel of requeue_work into blk_mq_release · fbc2a15e
      Ming Lei authored
      
      With holding queue's kobject refcount, it is safe for driver
      to schedule requeue. However, blk_mq_kick_requeue_list() may
      be called after blk_sync_queue() is done because of concurrent
      requeue activities, then requeue work may not be completed when
      freeing queue, and kernel oops is triggered.
      
      So moving the cancel of requeue_work into blk_mq_release() for
      avoiding race between requeue and freeing queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fbc2a15e
    • Ming Lei's avatar
      blk-mq: grab .q_usage_counter when queuing request from plug code path · e87eb301
      Ming Lei authored
      
      Just like aio/io_uring, we need to grab 2 refcount for queuing one
      request, one is for submission, another is for completion.
      
      If the request isn't queued from plug code path, the refcount grabbed
      in generic_make_request() serves for submission. In theroy, this
      refcount should have been released after the sumission(async run queue)
      is done. blk_freeze_queue() works with blk_sync_queue() together
      for avoiding race between cleanup queue and IO submission, given async
      run queue activities are canceled because hctx->run_work is scheduled with
      the refcount held, so it is fine to not hold the refcount when
      running the run queue work function for dispatch IO.
      
      However, if request is staggered into plug list, and finally queued
      from plug code path, the refcount in submission side is actually missed.
      And we may start to run queue after queue is removed because the queue's
      kobject refcount isn't guaranteed to be grabbed in flushing plug list
      context, then kernel oops is triggered, see the following race:
      
      blk_mq_flush_plug_list():
              blk_mq_sched_insert_requests()
                      insert requests to sw queue or scheduler queue
                      blk_mq_run_hw_queue
      
      Because of concurrent run queue, all requests inserted above may be
      completed before calling the above blk_mq_run_hw_queue. Then queue can
      be freed during the above blk_mq_run_hw_queue().
      
      Fixes the issue by grab .q_usage_counter before calling
      blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
      safe because the queue is absolutely alive before inserting request.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Tested-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e87eb301
  6. May 02, 2019
  7. Apr 30, 2019
  8. Apr 25, 2019
  9. Apr 24, 2019
  10. Apr 23, 2019
  11. Apr 22, 2019
Loading