Commits · 0db10944a76ba09f37d43b99d0fe085a18307f22 · nexedi / linux

20 Apr, 2017 25 commits

nfs: Convert to separately allocated bdi · 0db10944

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Anna Schumaker <anna.schumaker@netapp.com>
CC: linux-nfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

0db10944

ncpfs: Convert to separately allocated bdi · a0349ec0

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Petr Vandrovec <petr@vandrovec.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

a0349ec0

nilfs2: Convert to properly refcounting bdi · 0546c537

Jan Kara authored Apr 12, 2017

Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: linux-nilfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>

0546c537

gfs2: Convert to properly refcounting bdi · 95fe66de

Jan Kara authored Apr 12, 2017

Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

95fe66de

fuse: Get rid of bdi_initialized · 7fbbe972

Jan Kara authored Apr 12, 2017

It is not needed anymore since bdi is initialized whenever superblock
exists.

CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

7fbbe972

fuse: Convert to separately allocated bdi · 5f7f7543

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

5f7f7543

exofs: Convert to separately allocated bdi · c7f01477

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Boaz Harrosh <ooo@electrozaur.com>
CC: Benny Halevy <bhalevy@primarydata.com>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

c7f01477

coda: Convert to separately allocated bdi · a5695a79

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Jan Harkes <jaharkes@cs.cmu.edu>
CC: coda@cs.cmu.edu
CC: codalist@coda.cs.cmu.edu
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

a5695a79

mtd: Convert to dynamically allocated bdi infrastructure · fa06052d

Jan Kara authored Apr 12, 2017

MTD already allocates backing_dev_info dynamically. Convert it to use
generic infrastructure for this including proper refcounting. We drop
mtd->backing_dev_info as its only use was to pass mtd_bdi pointer from
one file into another and if we wanted to keep that in a clean way, we'd
have to make mtd hold and drop bdi reference as needed which seems
pointless for passing one global pointer...

CC: David Woodhouse <dwmw2@infradead.org>
CC: Brian Norris <computersforpeace@gmail.com>
CC: linux-mtd@lists.infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

fa06052d

afs: Convert to separately allocated bdi · edd3ba94

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: David Howells <dhowells@redhat.com>
CC: linux-afs@lists.infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

edd3ba94

ecryptfs: Convert to separately allocated bdi · e836818b

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Tyler Hicks <tyhicks@canonical.com>
CC: ecryptfs@vger.kernel.org
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

e836818b

cifs: Convert to separately allocated bdi · 851ea086

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

851ea086

ceph: Convert to separately allocated bdi · 09dc9fc2

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: "Yan, Zheng" <zyan@redhat.com>
CC: Sage Weil <sage@redhat.com>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

09dc9fc2

btrfs: Convert to separately allocated bdi · 9e11ceee

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

9e11ceee

9p: Convert to separately allocated bdi · 71304feb

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.

CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

71304feb

lustre: Convert to separately allocated bdi · 9594caf2

Jan Kara authored Apr 12, 2017

Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Oleg Drokin <oleg.drokin@intel.com>
CC: Andreas Dilger <andreas.dilger@intel.com>
CC: James Simmons <jsimmons@infradead.org>
CC: lustre-devel@lists.lustre.org
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

9594caf2

fs: Get proper reference for s_bdi · 13eec236

Jan Kara authored Apr 12, 2017

So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

13eec236

fs: Provide infrastructure for dynamic BDIs in filesystems · fca39346

Jan Kara authored Apr 12, 2017

Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-mtd@lists.infradead.org
CC: linux-nfs@vger.kernel.org
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-nilfs@vger.kernel.org
CC: cluster-devel@redhat.com
CC: osd-dev@open-osd.org
CC: codalist@coda.cs.cmu.edu
CC: linux-afs@lists.infradead.org
CC: ecryptfs@vger.kernel.org
CC: linux-cifs@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-developer@lists.sourceforge.net
CC: lustre-devel@lists.lustre.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

fca39346

bdi: Export bdi_alloc_node() and bdi_put() · 62bf42ad

Jan Kara authored Apr 12, 2017

MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
these functions.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

62bf42ad

block: Unregister bdi on last reference drop · 5af110b2

Jan Kara authored Apr 12, 2017

Most users will want to unregister bdi when dropping last reference to a
bdi. Only a few users (like block devices) want to play more complex
tricks with bdi registration and unregistration. So unregister bdi when
the last reference to bdi is dropped and just make sure we don't
unregister the bdi the second time if it is already unregistered.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

5af110b2

bdi: Provide bdi_register_va() and bdi_alloc() · baf7a616

Jan Kara authored Apr 12, 2017

Add function that registers bdi and takes va_list instead of variable
number of arguments.

Add bdi_alloc() as simple wrapper for NUMA-unaware users allocating BDI.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

baf7a616

blk-throttle: fix unused variable warning with BLK_DEV_THROTTLING_LOW=n · 2bc19cd5

Jens Axboe authored Apr 20, 2017

We trigger this warning:

block/blk-throttle.c: In function ‘blk_throtl_bio’:
block/blk-throttle.c:2042:6: warning: variable ‘ret’ set but not used [-Wunused-but-set-variable]
  int ret;
      ^~~

since we only assign 'ret' if BLK_DEV_THROTTLING_LOW is off, we never
check it.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

2bc19cd5

bfq: fix compile error if CONFIG_CGROUPS=n · 659b3394

Jens Axboe authored Apr 20, 2017

If we don't have CGROUPS enabled, the compile ends in the
following misery:

In file included from ../block/bfq-iosched.c:105:0:
../block/bfq-iosched.h:819:22: error: array type has incomplete element type
 extern struct cftype bfq_blkcg_legacy_files[];
                      ^
../block/bfq-iosched.h:820:22: error: array type has incomplete element type
 extern struct cftype bfq_blkg_files[];
                      ^

Move the declarations under the right ifdef.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

659b3394

block, bfq: don't dereference bic before null checking it · 8c9ff1ad

Colin Ian King authored Apr 20, 2017

The call to bfq_check_ioprio_change will dereference bic, however,
the null check for bic is after this call.  Move the the null
check on bic to before the call to avoid any potential null
pointer dereference issues.

Detected by CoverityScan, CID#1430138 ("Dereference before null check")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

8c9ff1ad

ligtnvm: fix double blk_put_queue on same queue · 75ba4ada

Rakesh Pandit authored Apr 20, 2017

On an error path in NVM_DEV_CREATE ioctl blk_put_queue is being called
twice: one via blk_cleanup_queue and another via put_disk.  Straight fix
seems to remove queue pointer so that disk_release never ends up caling
blk_put_queue again.

  [  391.808827] WARNING: CPU: 1 PID: 1250 at lib/refcount.c:128 refcount_sub_and_test+0x70/0x80
  [  391.808830] refcount_t: underflow; use-after-free.
  [ 391.808832] Modules linked in: nf_conntrack_netbios_ns............
  [  391.809052] CPU: 1 PID: 1250 Comm: nvme Not tainted.........
  [  391.809057] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
             BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
  [  391.809060] Call Trace:
  [  391.809079]  dump_stack+0x63/0x86
  [  391.809094]  __warn+0xcb/0xf0
  [  391.809103]  warn_slowpath_fmt+0x5f/0x80
  [  391.809118]  refcount_sub_and_test+0x70/0x80
  [  391.809125]  refcount_dec_and_test+0x11/0x20
  [  391.809136]  kobject_put+0x1f/0x60
  [  391.809149]  blk_put_queue+0x15/0x20
  [  391.809159]  disk_release+0xae/0xf0
  [  391.809172]  device_release+0x32/0x90
  [  391.809184]  kobject_release+0x6a/0x170
  [  391.809196]  kobject_put+0x2f/0x60
  [  391.809206]  put_disk+0x17/0x20
  [  391.809219]  nvm_ioctl_dev_create.isra.16+0x897/0xa30
  [  391.809236]  nvm_ctl_ioctl+0x23c/0x4c0
  [  391.809248]  do_vfs_ioctl+0xa3/0x5f0
  [  391.809258]  SyS_ioctl+0x79/0x90
  [  391.809271]  entry_SYSCALL_64_fastpath+0x1a/0xa9
  [  391.809280] RIP: 0033:0x7f5d3ef363c7
  [  391.809286] RSP: 002b:00007ffc72ed8d78 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [  391.809296] RAX: ffffffffffffffda RBX: 00007ffc72edb552 RCX: 00007f5d3ef363c7
  [  391.809301] RDX: 00007ffc72ed8d90 RSI: 0000000040804c22 RDI: 0000000000000003
  [  391.809306] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000001
  [  391.809311] R10: 000000000000053f R11: 0000000000000206 R12: 0000000000000000
  [  391.809316] R13: 0000000000000000 R14: 00007ffc72edb58d R15: 00007ffc72edb581
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Fixes: 7d1ef2f4 "lightnvm: fix cleanup order of disk on init error"
Signed-off-by: Jens Axboe <axboe@fb.com>

75ba4ada

19 Apr, 2017 15 commits

block: Optimize ioprio_best() · 9a87182c

Bart Van Assche authored Apr 19, 2017

Since ioprio_best() translates IOPRIO_CLASS_NONE into IOPRIO_CLASS_BE
and since lower numerical priority values represent a higher priority
a simple numerical comparison is sufficient.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Tested-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>

9a87182c

block: Inline blk_rq_set_prio() · 0be0dee6

Bart Van Assche authored Apr 19, 2017

Since only a single caller remains, inline blk_rq_set_prio(). Initialize
req->ioprio even if no I/O priority has been set in the bio nor in the
I/O context.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com>
Tested-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>

0be0dee6

lightnvm: Use blk_init_request_from_bio() instead of open-coding it · 9460e280

Bart Van Assche authored Apr 19, 2017

This patch changes the behavior of the lightnvm driver as follows:
* REQ_FAILFAST_MASK is set for read-ahead requests.
* If no I/O priority has been set in the bio, the I/O priority is
  copied from the I/O context.
* The rq_disk member is initialized if bio->bi_bdev != NULL.
* The bio sector offset is copied into req->__sector instead of
  retaining the value -1 set by blk_mq_alloc_request().
* req->errors is initialized to zero.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

9460e280

null_blk: Use blk_init_request_from_bio() instead of open-coding it · 2644a3cc

Bart Van Assche authored Apr 19, 2017

This patch changes the behavior of the null_blk driver for the
LightNVM mode as follows:
* REQ_FAILFAST_MASK is set for read-ahead requests.
* If no I/O priority has been set in the bio, the I/O priority is
  copied from the I/O context.
* The rq_disk member is initialized if bio->bi_bdev != NULL.
* req->errors is initialized to zero.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

2644a3cc

block: Export blk_init_request_from_bio() · da8d7f07

Bart Van Assche authored Apr 19, 2017

Export this function such that it becomes available to block
drivers.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

da8d7f07

lightnvm: assume 64-bit lba numbers · ef697902

Arnd Bergmann authored Apr 19, 2017

The driver uses both u64 and sector_t to refer to offsets, and assigns between the
two. This causes one harmless warning when sector_t is 32-bit:

drivers/lightnvm/pblk-rb.c: In function 'pblk_rb_write_entry_gc':
include/linux/lightnvm.h:215:20: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
drivers/lightnvm/pblk-rb.c:324:22: note: in expansion of macro 'ADDR_EMPTY'

As the driver is already doing this inconsistently, changing the type
won't make it worse and is an easy way to avoid the warning.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

ef697902

block: make __blk_end_bidi_request private · d0fac025

Christoph Hellwig authored Apr 12, 2017

blk_insert_flush should be using __blk_end_request to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

d0fac025

block: remove blk_end_request_cur · fa1a15c0

Christoph Hellwig authored Apr 12, 2017

This function is not used anywhere in the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

fa1a15c0

block: remove blk_end_request_err and __blk_end_request_err · 314fe91b

Christoph Hellwig authored Apr 12, 2017

Both functions are entirely unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

314fe91b

block: remove the osdblk driver · 10081552

Christoph Hellwig authored Apr 12, 2017

This was just a proof of concept user for the SCSI OSD library, and
never had any real users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

10081552

block: Make writeback throttling defaults consistent for SQ devices · 8330cdb0

Jan Kara authored Apr 19, 2017

When CFQ is used as an elevator, it disables writeback throttling
because they don't play well together. Later when a different elevator
is chosen for the device, writeback throttling doesn't get enabled
again as it should. Make sure CFQ enables writeback throttling (if it
should be enabled by default) when we switch from it to another IO
scheduler.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

8330cdb0

block, bfq: split bfq-iosched.c into multiple source files · ea25da48

Paolo Valente authored Apr 19, 2017

The BFQ I/O scheduler features an optimal fair-queuing
(proportional-share) scheduling algorithm, enriched with several
mechanisms to boost throughput and reduce latency for interactive and
real-time applications. This makes BFQ a large and complex piece of
code. This commit addresses this issue by splitting BFQ into three
main, independent components, and by moving each component into a
separate source file:
1. Main algorithm: handles the interaction with the kernel, and
decides which requests to dispatch; it uses the following two further
components to achieve its goals.
2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
computes the schedule, using weights and budgets provided by the above
component.
3. cgroups support: handles group operations (creation, destruction,
move, ...).
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

ea25da48

block, bfq: remove all get and put of I/O contexts · 6fa3e8d3

Paolo Valente authored Apr 12, 2017

When a bfq queue is set in service and when it is merged, a reference
to the I/O context associated with the queue is taken. This reference
is then released when the queue is deselected from service or
split. More precisely, the release of the reference is postponed to
when the scheduler lock is released, to avoid nesting between the
scheduler and the I/O-context lock. In fact, such nesting would lead
to deadlocks, because of other code paths that take the same locks in
the opposite order. This postponing of I/O-context releases does
complicate code.

This commit addresses these issue by modifying involved operations in
such a way to not need to get the above I/O-context references any
more. Then it also removes any get and release of these references.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

6fa3e8d3

block, bfq: handle bursts of queue activations · e1b2324d

Arianna Avanzini authored Apr 12, 2017

Many popular I/O-intensive services or applications spawn or
reactivate many parallel threads/processes during short time
intervals. Examples are systemd during boot or git grep.  These
services or applications benefit mostly from a high throughput: the
quicker the I/O generated by their processes is cumulatively served,
the sooner the target job of these services or applications gets
completed. As a consequence, it is almost always counterproductive to
weight-raise any of the queues associated to the processes of these
services or applications: in most cases it would just lower the
throughput, mainly because weight-raising also implies device idling.

To address this issue, an I/O scheduler needs, first, to detect which
queues are associated with these services or applications. In this
respect, we have that, from the I/O-scheduler standpoint, these
services or applications cause bursts of activations, i.e.,
activations of different queues occurring shortly after each
other. However, a shorter burst of activations may be caused also by
the start of an application that does not consist in a lot of parallel
I/O-bound threads (see the comments on the function bfq_handle_burst
for details).

In view of these facts, this commit introduces:
1) an heuristic to detect (only) bursts of queue activations caused by
   services or applications consisting in many parallel I/O-bound
   threads;
2) the prevention of device idling and weight-raising for the queues
   belonging to these bursts.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

e1b2324d

block, bfq: boost the throughput with random I/O on NCQ-capable HDDs · e01eff01

Paolo Valente authored Apr 12, 2017

This patch is basically the counterpart, for NCQ-capable rotational
devices, of the previous patch. Exactly as the previous patch does on
flash-based devices and for any workload, this patch disables device
idling on rotational devices, but only for random I/O. In fact, only
with these queues disabling idling boosts the throughput on
NCQ-capable rotational devices. To not break service guarantees,
idling is disabled for NCQ-enabled rotational devices only when the
same symmetry conditions considered in the previous patches hold.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

e01eff01