- Oct 13, 2014
-
-
Jens Axboe authored
All other allocs are done on the specific node, somehow the cpumask for hw queue runs was missed. Fix that by using zalloc_cpumask_var_node() in blk_mq_init_queue(). Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Gu Zheng authored
bip_slab is created with SLAB_PANIC, so the fail handler is unneeded. Signed-off-by:
Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Robert Elliott authored
In __get_request calls to printk_ratelimited, include the function name so the callbacks suppressed message matches the messages that are printed, and add "dev" before the device name so it matches other block layer messages. Signed-off-by:
Robert Elliott <elliott@hp.com> Reviewed-by:
Webb Scales <webbnh@hp.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Robert Elliott authored
In blk_update_request, change the printk_ratelimited prefix from end_request to blk_update_request so it matches the name printed if rate limiting occurs. Old: [10234.933106] blk_update_request: 174 callbacks suppressed [10234.934940] end_request: critical target error, dev sdr, sector 16 [10234.949788] end_request: critical target error, dev sdr, sector 16 New: [16863.445173] blk_update_request: 398 callbacks suppressed [16863.447029] blk_update_request: critical target error, dev sdr, sector 1442066176 [16863.449383] blk_update_request: critical target error, dev sdr, sector 802802888 [16863.451680] blk_update_request: critical target error, dev sdr, sector 1609535456 Signed-off-by:
Robert Elliott <elliott@hp.com> Reviewed-by:
Webb Scales <webbnh@hp.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 09, 2014
-
-
Ming Lei authored
It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt if the bio is cloned. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Tested-by:
Jeff Mahoney <jeffm@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Mike Snitzer authored
The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit fdfb4c8c ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Acked-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 07, 2014
-
-
Bart Van Assche authored
Eliminate a backwards goto statement from bt_clear_tag(). Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
We currently divide the queue depth by 4 as our batch wakeup count, but we split the wakeups over BT_WAIT_QUEUES number of wait queues. This defaults to 8. If the product of the resulting batch wake count and BT_WAIT_QUEUES is higher than the device queue depth, we can get into a situation where a task goes to sleep waiting for a request, but never gets woken up. Reported-by:
Bart Van Assche <bvanassche@acm.org> Fixes: 4bb659b1 Cc: stable@kernel.org Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 03, 2014
-
-
Junichi Nomura authored
Users of bio_clone_fast() do not want bios with their own bvecs. Allocating a bvec mempool as part of the bioset intended for such users is a waste of memory. bioset_create_nobvec() creates a bioset that doesn't have the bvec mempool. Signed-off-by:
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Junichi Nomura authored
Request cloning clones bios in the request to track the completion of each bio. For that purpose, we can use bio_clone_fast() instead of bio_clone() to avoid unnecessary allocation and copy of bvecs. This patch reduces memory footprint of request-based device-mapper (about 1-4KB for each request) and is a preparation for further reduction of memory usage by removing unused bvec mempool. Signed-off-by:
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 01, 2014
-
-
Hannes Reinecke authored
The rq_complete tracepoint was never issued for empty requests, causing the resulting blktrace information to never show any completion for those request. Signed-off-by:
Hannes Reinecke <hare@suse.de> Acked-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Sep 27, 2014
-
-
Rasmus Villemoes authored
The kernel used to contain two functions for length-delimited, case-insensitive string comparison, strnicmp with correct semantics and a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper for the new strncasecmp to avoid breaking existing users. To allow the compat wrapper strnicmp to be removed at some point in the future, and to avoid the extra indirection cost, do s/strnicmp/strncasecmp/g. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
The T10 Protection Information format is also used by some devices that do not go through the SCSI layer (virtual block devices, NVMe). Relocate the relevant functions to a block layer library that can be used without involving SCSI. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
We'd occasionally merge requests with conflicting integrity flags. Introduce a merge helper which checks that the requests have compatible integrity payloads. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Make the choice of checksum a per-I/O property by introducing a flag that can be inspected by the SCSI layer. There are several reasons for this: 1. It allows us to switch choice of checksum without unloading and reloading the HBA driver. 2. During error recovery we need to be able to tell the HBA that checksums read from disk should not be verified and converted to IP checksums. 3. For error injection purposes we need to be able to write a bad guard tag to storage. Since the storage device only supports T10 CRC we need to be able to disable IP checksum conversion on the HBA. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Move flags affecting the integrity code out of the bio bi_flags and into the block integrity payload. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
So far we have relied on the app tag size to determine whether a disk has been formatted with T10 protection information or not. However, not all target devices provide application tag storage. Add a flag to the block integrity profile that indicates whether the disk has been formatted with protection information. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Add a BLK_ prefix to the integrity profile flags. Also rename the flags to be more consistent with the generate/verify terminology in the rest of the integrity code. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Instead of the "operate" parameter we pass in a seed value and a pointer to a function that can be used to process the integrity metadata. The generation function is changed to have a return value to fit into this scheme. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Now that the protection interval has been detached from the sector size we need to be able to handle sizes that are different from 4K and 512. Make the interval calculation generic. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
The protection interval is not necessarily tied to the logical block size of a block device. Stop using the terms "sector" and "sectors". Going forward we will use the term "seed" to describe the initial reference tag value for a given I/O. "Interval" will be used to describe the portion of the data buffer that a given piece of protection information is associated with. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
bip_buf is not really needed so we can remove it. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
None of the filesystems appear interested in using the integrity tagging feature. Potentially because very few storage devices actually permit using the application tag space. Remove the tagging functions. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
For commands like REQ_COPY we need a way to pass extra information along with each bio. Like integrity metadata this information must be available at the bottom of the stack so bi_private does not suffice. Rename the existing bi_integrity field to bi_special and make it a union so we can have different bio extensions for each class of command. We previously used bi_integrity != NULL as a way to identify whether a bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the indicator now that bi_special can contain different things. In addition, bio_integrity(bio) will now return a pointer to the integrity payload (when applicable). Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
bdev_integrity_enabled() is only used by bio_integrity_enabled(). Combine these two functions. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Sep 25, 2014
-
-
Ming Lei authored
This patch supports to run one single flush machinery for each blk-mq dispatch queue, so that: - current init_request and exit_request callbacks can cover flush request too, then the buggy copying way of initializing flush request's pdu can be fixed - flushing performance gets improved in case of multi hw-queue In fio sync write test over virtio-blk(4 hw queues, ioengine=sync, iodepth=64, numjobs=4, bs=4K), it is observed that througput gets increased a lot over my test environment: - throughput: +70% in case of virtio-blk over null_blk - throughput: +30% in case of virtio-blk over SSD image The multi virtqueue feature isn't merged to QEMU yet, and patches for the feature can be found in below tree: git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4 And simply passing 'num_queues=4 vectors=5' should be enough to enable multi queue(quad queue) feature for QEMU virtio-blk. Suggested-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(), so that this function can find the corresponding blk_flush_queue bound with current mq context since the flush queue will become per hw-queue. For legacy queue, the parameter can be simply 'NULL'. For multiqueue case, the parameter should be set as the context from which the related request is originated. With this context info, the hw queue and related flush queue can be found easily. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Just figuring out flush queue at the entry of kicking off flush machinery and request's completion handler, then pass it through. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Now mission of the two helpers is over, and just call blk_alloc_flush_queue() and blk_free_flush_queue() directly. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
This patch introduces 'struct blk_flush_queue' and puts all flush machinery related fields into this structure, so that - flush implementation details aren't exposed to driver - it is easy to convert to per dispatch-queue flush machinery This patch is basically a mechanical replacement. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
This patch trys to use local variable to access flush request, so that we can convert to per-queue flush machinery a bit easier. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
These fields are always used with the flush request, so initialize them together. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
These two temporary functions are introduced for holding flush initialization and de-initialization, so that we can introduce 'flush queue' easier in the following patch. And once 'flush queue' and its allocation/free functions are ready, they will be removed for sake of code readability. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
It is reasonable to allocate flush req in blk_mq_init_flush(). Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Failure of initializing one hctx isn't handled, so this patch introduces blk_mq_init_hctx() and its pair to handle it explicitly. Also this patch makes code cleaner. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Sep 24, 2014
-
-
Tejun Heo authored
blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take above ten seconds when scsi-mq is used. [ 0.949892] scsi host0: Virtio SCSI HBA [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz <stall> [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0 [ 16.194099] osd: LOADED open-osd 0.2.1 [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This is also the reason why request_queues start in bypass mode which is ended on blk_register_queue() as shutting down a fully functional queue also involves a RCU grace period and the queues for non-existent SCSI devices never reach registration. blk-mq basically needs to do the same thing - start the mq in a degraded mode which is faster to shut down and then make it fully functional only after the queue reaches registration. percpu_ref recently grew facilities to force atomic operation until explicitly switched to percpu mode, which can be used for this purpose. This patch makes blk-mq initialize q->mq_usage_counter in atomic mode and switch it to percpu mode only once blk_register_queue() is reached. Note that this issue was previously worked around by 0a30288d ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") for v3.17. The temp fix was reverted in preparation of adding persistent atomic mode to percpu_ref by 9eca8046 ("Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe""). This patch and the prerequisite percpu_ref changes will be merged during v3.18 devel cycle. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count") Reviewed-by:
Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
Tejun Heo authored
With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
Tejun Heo authored
This reverts commit 0a30288d, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>
-
Tejun Heo authored
blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Tested-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Sep 22, 2014
-
-
Jens Axboe authored
Commit 2da78092 changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by:
Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-