- 08 Jan, 2018 10 commits
-
-
Minwoo Im authored
The local variable __size__ will be set a bit later in a for-loop. Remove the explicit initialization at the beginning of this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Roy Shterman authored
NVMe transport driver module unload may (and usually does) trigger iteration over the active controllers and delete them all (sometimes under a mutex). However, a controller can be created concurrently with module unload which can lead to leakage of resources (most important char device node leakage) in case the controller creation occured after the unload delete and drain sequence. To protect against this, we take a module reference to guarantee that the nvme transport driver is not unloaded while creating a controller. Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
The current fc transport add_port routine validates that there is a matching port to the target port config. It then takes a reference on the targetport. The del_port removes the reference. Unfortunately, if the LLDD undergoes a hw reset or driver unload and wants to unreg the targetport, due to the reference, the targetport effectively can't be removed. It requires the admin to remove the port from the nvmet config first, which calls the del_port. Note: it appears nvmetcli clear skips over the del_port call (I'm not attempting to change that). There's no real reason to take the reference. With FC, there is nothing to enable or disable as the presence of the FC targetport implicitly means its enabled, and removal of the targtport means its disabled. Change add_port to simply validate and change remove_port to a noop. No references are taken on the targetport. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
The split between what the host accesses on its flows vs what the target side accesses was flawed. Abort handling didn't properly clear initiator vs target structure cross-reference and locks weren't used for synchronization. Thus, there were issues of freeing structures too soon and access after free. A couple of these existed pre the IN_ISR mods, but when the target upcalls were converted to work items, thus adding delays between the 2 sides of accesses, the problems became pronounced. Resolve by: - tracking io state mainly in the tgt-side io structure. - make the tgt-side io structure released by reference not by code flow. - when changing initiator structures, use locks for synchronization - aborts are clearly tracked for which side saw the abort, and after seeing the abort, cross-references are cleared under lock. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
The existing fcloop driver expects the target side upcalls to the transport to context switch, thus the calls into the nvmet layer are not done in the calling context of the host/initiator down calls. The xxx_IN_ISR feature flags are used to select this logic. The xxx_IN_ISR feature flags should go away in the nvmet_fc transport as no other lldd utilizes them. Both Broadcom and Cavium lldds have their own non-ISR deferred handlers thus the nvmet calls can be made directly. This patch converts the paths that make the target upcalls (command receive, abort receive) such that they schedule a work item rather than expecting the transport to schedule the work item. The patch also cleans up the following: - The completion path from target to host scheduled a host work element called "work". Rename it "tio_done_work" for code clarity. - The abort io path called a iniwork item to call the host side io done. This is no longer needed as the abort routine can make the same call. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
The current fcloop driver gets its lport structure from the private area co-allocated with the fc_localport. All is fine except the teardown path, which wants to wait on the completion, which is marked complete by the delete_localport callback performed after unregister_localport. The issue is, the nvme_fc transport frees the localport structure immediately after delete_localport is called, meaning the original routine is trying to wait on a complete that was just freed. Change such that a lport struct is allocated coincident with the addition and registration of a localport. The private area of the localport now contains just a backpointer to the real lport struct. Now, the completion can be waited for, and after completing, the new structure can be kfree'd. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
James Smart authored
A test case revealed a race condition of an i/o completing on a thread parallel to the delete_association generating the aborts for the outstanding ios on the controller. The i/o completion was freeing the target fcloop context, thus the abort task referenced the just-freed memory. Correct by clearing the target/initiator cross pointers in the io completion and abort tasks before calling the callbacks. On aborts that detect already finished io's, ensure the complete context is called. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
It is a bit chatty to report on each queue, log it only for debug purposes. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
It is a bit chatty to report on every deleted queue, so keep it for debug purposes only. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
Sagi Grimberg authored
We already do that when we are notified in device removal which is triggered when unregistering as an ib client. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
-
- 06 Jan, 2018 26 commits
-
-
Ming Lei authored
In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags can be allocated, and q->nr_hw_queue is used, and race is inevitable, for example: blk_mq_init_sched() may trigger use-after-free on hctx, which is freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased. This patch fixes the race be holding q->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its previous map isn't cleared yet, and may point to one stale hw queue index. This patch fixes the following issue by clearing the mapping table before setting it up in blk_mq_pci_map_queues(). This patches fixes this following issue reported by Zhang Yi: [ 101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f [ 101.211487] IP: blk_mq_map_swqueue+0xbc/0x200 [ 101.216346] PGD 0 P4D 0 [ 101.219171] Oops: 0000 [#1] SMP [ 101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod [ 101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1 [ 101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 [ 101.301001] Workqueue: nvme-wq nvme_reset_work [nvme] [ 101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9 [ 101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200 [ 101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282 [ 101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000 [ 101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008 [ 101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280 [ 101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008 [ 101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558 [ 101.364314] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000 [ 101.373342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0 [ 101.387714] Call Trace: [ 101.390445] blk_mq_update_nr_hw_queues+0xbf/0x130 [ 101.395791] nvme_reset_work+0x6f4/0xc06 [nvme] [ 101.400848] ? pick_next_task_fair+0x290/0x5f0 [ 101.405807] ? __switch_to+0x1f5/0x430 [ 101.409988] ? put_prev_entity+0x2f/0xd0 [ 101.414365] process_one_work+0x141/0x340 [ 101.418836] worker_thread+0x47/0x3e0 [ 101.422921] kthread+0xf5/0x130 [ 101.426424] ? rescuer_thread+0x380/0x380 [ 101.430896] ? kthread_associate_blkcg+0x90/0x90 [ 101.436048] ret_from_fork+0x1f/0x30 [ 101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44 [ 101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70 [ 101.468205] CR2: 0000000000000098 [ 101.471907] ---[ end trace 5fe710f98228a3ca ]--- [ 101.482489] Kernel panic - not syncing: Fatal exception [ 101.488505] Kernel Offset: disabled [ 101.497752] ---[ end Kernel panic - not syncing: Fatal exception Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@lst.de> Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Dispatch may still be in-progress after queue is frozen, so we have to quiesce queue before switching IO scheduler and updating nr_requests. Also when switching io schedulers, blk_mq_run_hw_queue() may still be called somewhere(such as from nvme_reset_work()), and io scheduler's per-hctx data may not be setup yet, so cause oops even inside blk_mq_hctx_has_pending(), such as it can be run just between: ret = e->ops.mq.init_sched(q, e); AND ret = e->ops.mq.init_hctx(hctx, i) inside blk_mq_init_sched(). This reverts commit 7a148c2f(block: don't call blk_mq_quiesce_queue() after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending won't be called if queue is quiesced. Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 7a148c2f(block: don't call blk_mq_quiesce_queue() after queue is frozen) Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
After queue is frozen, dispatch still may happen, for example: 1) requests are submitted from several contexts 2) requests from all these contexts are inserted to queue, but may dispatch to LLD in one of these paths, but other paths sill need to move on even all these requests are completed(that means blk_mq_freeze_queue_wait() returns at that time) 3) dispatch after queue freezing still moves on and causes use-after-free, because request queue is freed This patch quiesces queue after it is frozen, and makes sure all in-progress dispatch are completed. This patch fixes the following kernel crash when running heavy IOs vs. deleting device: [ 36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 36.720318] IP: kyber_has_work+0x14/0x40 [ 36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0 [ 36.721584] Oops: 0000 [#1] PREEMPT SMP [ 36.722105] Dumping ftrace buffer: [ 36.722570] (ftrace buffer empty) [ 36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs [ 36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714 [ 36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014 [ 36.736688] RIP: 0010:kyber_has_work+0x14/0x40 [ 36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202 [ 36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0 [ 36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8 [ 36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480 [ 36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8 [ 36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000 [ 36.744189] FS: 00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000 [ 36.746617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0 [ 36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 36.752796] Call Trace: [ 36.753992] blk_mq_do_dispatch_sched+0x7f/0xe0 [ 36.755110] blk_mq_sched_dispatch_requests+0x119/0x190 [ 36.756179] __blk_mq_run_hw_queue+0x83/0x90 [ 36.757144] __blk_mq_delay_run_hw_queue+0xaf/0x110 [ 36.758046] blk_mq_run_hw_queue+0x24/0x70 [ 36.758845] blk_mq_flush_plug_list+0x1e7/0x270 [ 36.759676] blk_flush_plug_list+0xd6/0x240 [ 36.760463] blk_finish_plug+0x27/0x40 [ 36.761195] do_io_submit+0x19b/0x780 [ 36.761921] ? entry_SYSCALL_64_fastpath+0x1a/0x7d [ 36.762788] entry_SYSCALL_64_fastpath+0x1a/0x7d [ 36.763639] RIP: 0033:0x7f9b9699f697 [ 36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1 [ 36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697 [ 36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000 [ 36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420 [ 36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70 [ 36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000 [ 36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 <48> 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10 [ 36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0 [ 36.776012] CR2: 0000000000000008 [ 36.776690] ---[ end trace 4045cbce364ff2a4 ]--- [ 36.777527] Kernel panic - not syncing: Fatal exception [ 36.778526] Dumping ftrace buffer: [ 36.779313] (ftrace buffer empty) [ 36.780081] Kernel Offset: disabled [ 36.780877] ---[ end Kernel panic - not syncing: Fatal exception Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Don't pass in the hardware queue to __dd_dispatch_request(), since it leads the reader to believe that we are returning a request for that specific hardware queue. That's not how mq-deadline works, the state for determining which request to serve next is shared across all hardware queues for a device. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Use the sgl_alloc_order() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Nicholas A. Bellinger <nab@linux-iscsi.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Bart Van Assche authored
Many kernel drivers contain code that allocates and frees both a scatterlist and the pages that populate that scatterlist. Introduce functions in lib/scatterlist.c that perform these tasks instead of duplicating this functionality in multiple drivers. Only include these functions in the build if CONFIG_SGL_ALLOC=y to avoid that the kernel size increases if this functionality is not used. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Wang Long authored
The @head can be wb->b_dirty_time, so update the comment. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Wang Long <wanglong19@meituan.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Arnd Bergmann authored
When CONFIG_KASAN is set, all the local variables in this function are allocated on the stack together, leading to a warning about possible kernel stack overflow: drivers/block/DAC960.c: In function 'DAC960_gam_ioctl': drivers/block/DAC960.c:7061:1: error: the frame size of 2240 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] By splitting up the function into smaller chunks, we can avoid that and make the code slightly more readable at the same time. The coding style in this file is completely nonstandard, and I chose to not touch that at all, leaving the unconventional intendation unchanged to make it easier to review the diff. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
In this case, 'sectors' can't be zero at all, so remove the check and let the bio be split. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
When merging one bvec into segment, if the bvec is too big to merge, current policy is to move the whole bvec into another new segment. This patchset changes the policy into trying to maximize size of front segments, that means in above situation, part of bvec is merged into current segment, and the remainder is put into next segment. This patch prepares for support multipage bvec because it can be quite common to see this case and we should try to make front segments in full size. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
It is enough to check and compute bio->bi_seg_front_size just after the 1st segment is found, but current code checks that for each bvec, which is inefficient. This patch follows the way in __blk_recalc_rq_segments() for computing bio->bi_seg_front_size, and it is more efficient and code becomes more readable too. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
The bio is always freed after running crypt_free_buffer_pages(), so it isn't necessary to clear bv->bv_page. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Commit 17347cec(Btrfs: change how we iterate bios in endio) mentioned that for dio the submitted bio may be fast cloned, we can't access the bvec table directly for a cloned bio, so use bio_get_first_bvec() to retrieve the 1st bvec. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org Cc: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Acked: David Sterba <dsterba@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no longer valid once we start enabling multipage bvecs. correct once we start to enable multipage bvec. Use bio_nr_pages() to do that instead. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
bcache is the only user of bio_alloc_pages(), so move this function into bcache, and avoid it being misused in the future. Also rename it to bch_bio_allo_pages() since it is bcache only. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
All direct access to bvec table are safe even after multipage bvec is supported. Cc: linux-bcache@vger.kernel.org Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
For BIO based DM, some targets aren't ready for dealing with bigger incoming bio than 1Mbyte, such as crypt target. Cc: Mike Snitzer <snitzer@redhat.com> Cc:dm-devel@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Firstly this patch introduces BVEC_ITER_ALL_INIT for iterating one bio from start to end. As we need to support multipage bvecs, don't access bio->bi_io_vec in copy_to_high_bio_irq(), and just use the standard iterator for that. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
We will support multipage bvecs in the future, so change to iterator way for getting bv_page of bvec from original bio. Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
This patch converts 3 users to bio_last_bvec_all(), so that we can go ahead and convert to multipage bvec. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
This patch converts to bio_first_bvec_all() & bio_first_page_all() for retrieving the 1st bvec/page, and prepares for supporting multipage bvec. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
The following helpers are introduced for converting current users of direct access to bvec table, and prepares for supporting multipage bvec: bio_pages_all() bio_first_bvec_all() bio_first_page_all() bio_last_bvec_all() All are named as bio_*_all() to following bio_for_each_segment_all(), they can only be used on bio of !bio_flagged(bio, BIO_CLONED), that means the whole bvec table is covered. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 05 Jan, 2018 4 commits
-
-
Paolo Valente authored
Commit a33801e8 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs: one reported in [1], plus a similar one in another function. This commit removes both batches, in the way suggested in [1]. [1] https://www.spinics.net/lists/linux-block/msg20043.html Fixes: a33801e8 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
BFQ privileges the I/O of soft real-time applications, such as video players, to guarantee to these application a high bandwidth and a low latency. In this respect, it is not easy to correctly detect when an application is soft real-time. A particularly nasty false positive is that of an I/O-bound application that occasionally happens to meet all requirements to be deemed as soft real-time. After being detected as soft real-time, such an application monopolizes the device. Fortunately, BFQ will realize soon that the application is actually not soft real-time and suspend every privilege. Yet, the application may happen again to be wrongly detected as soft real-time, and so on. As highlighted by our tests, this problem causes BFQ to occasionally fail to guarantee a high responsiveness, in the presence of heavy background I/O workloads. The reason is that the background workload happens to be detected as soft real-time, more or less frequently, during the execution of the interactive task under test. To give an idea, because of this problem, Libreoffice Writer occasionally takes 8 seconds, instead of 3, to start up, if there are sequential reads and writes in the background, on a Kingston SSDNow V300. This commit addresses this issue by leveraging the following facts. The reason why some applications are detected as soft real-time despite all BFQ checks to avoid false positives, is simply that, during high CPU or storage-device load, I/O-bound applications may happen to do I/O slowly enough to meet all soft real-time requirements, and pass all BFQ extra checks. Yet, this happens only for limited time periods: slow-speed time intervals are usually interspersed between other time intervals during which these applications do I/O at a very high speed. To exploit these facts, this commit introduces a little change, in the detection of soft real-time behavior, to systematically consider also the recent past: the higher the speed was in the recent past, the later next I/O should arrive for the application to be considered as soft real-time. At the beginning of a slow-speed interval, the minimum arrival time allowed for the next I/O usually happens to still be so high, to fall *after* the end of the slow-speed period itself. As a consequence, the application does not risk to be deemed as soft real-time during the slow-speed interval. Then, during the next high-speed interval, the application cannot, evidently, be deemed as soft real-time (exactly because of its speed), and so on. This extra filtering proved to be rather effective: in the above test, the frequency of false positives became so low that the start-up time was 3 seconds in all iterations (apart from occasional outliers, caused by page-cache-management issues, which are out of the scope of this commit, and cannot be solved by an I/O scheduler). Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Angelo Ruocco authored
When two or more processes do I/O in a way that the their requests are sequential in respect to one another, BFQ merges the bfq_queues associated with the processes. This way the overall I/O pattern becomes sequential, and thus there is a boost in througput. These cooperating processes usually start or restart to do I/O shortly after each other. So, in order to avoid merging non-cooperating processes, BFQ ensures that none of these queues has been in weight raising for too long. In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged only shortly after being created", BFQ checks whether any queue (and not only weight-raised ones) is doing I/O continuously from too long to be merged. This new additional check makes the first one useless: a queue doing I/O from long enough, if being weight-raised, is also a queue in weight raising for too long to be merged. Accordingly, this commit removes the first check. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Paolo Valente authored
In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-