• Tejun Heo's avatar
    blkcg: don't call into policy draining if root_blkg is already gone · 3f2c76f9
    Tejun Heo authored
    While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL.  If someone else starts to drain
    while the queue is in this state, the following oops happens.
    
      NULL pointer dereference at 0000000000000028
      IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
      PGD e4a1067 PUD b773067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
      CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
      RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
      RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
      R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
      R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
      FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
      Stack:
       ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
       ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
       ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
      Call Trace:
       [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
       [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
       [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
       [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
       [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
       [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
       [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
       [<ffffffff81454968>] kobject_cleanup+0x38/0x70
       [<ffffffff81454848>] kobject_put+0x28/0x60
       [<ffffffff81427505>] blk_put_queue+0x15/0x20
       [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
       [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
       [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
       [<ffffffff817930e2>] device_release+0x32/0xa0
       [<ffffffff81454968>] kobject_cleanup+0x38/0x70
       [<ffffffff81454848>] kobject_put+0x28/0x60
       [<ffffffff817934d7>] put_device+0x17/0x20
       [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
       [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
       [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
       [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
       [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
       [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
       [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
       [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
       [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
    
    776687bc ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.
    
    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
    Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
    Reported-by: default avatarJet Chen <jet.chen@intel.com>
    Cc: stable@vger.kernel.org
    Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment"
    
    This reverts commit 254c4407.
    
    It causes crashes with cryptsetup, even after a few iterations and
    updates. Drop it for now.
    
    blkcg: don't call into policy draining if root_blkg is already gone
    
    While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL.  If someone else starts to drain
    while the queue is in this state, the following oops happens.
    
      NULL pointer dereference at 0000000000000028
      IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
      PGD e4a1067 PUD b773067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
      CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
      RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
      RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
      R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
      R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
      FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
      Stack:
       ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
       ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
       ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
      Call Trace:
       [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
       [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
       [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
       [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
       [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
       [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
       [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
       [<ffffffff81454968>] kobject_cleanup+0x38/0x70
       [<ffffffff81454848>] kobject_put+0x28/0x60
       [<ffffffff81427505>] blk_put_queue+0x15/0x20
       [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
       [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
       [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
       [<ffffffff817930e2>] device_release+0x32/0xa0
       [<ffffffff81454968>] kobject_cleanup+0x38/0x70
       [<ffffffff81454848>] kobject_put+0x28/0x60
       [<ffffffff817934d7>] put_device+0x17/0x20
       [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
       [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
       [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
       [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
       [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
       [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
       [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
       [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
       [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
    
    776687bc ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.
    
    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
    Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
    Reported-by: default avatarJet Chen <jet.chen@intel.com>
    Cc: stable@vger.kernel.org
    Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    
    The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.
    
    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.
    
    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.
    
    The bug can be easily reproduced with the st driver:
    
    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE  to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
       dd: error writing `/dev/st0': Device or resource busy
    
    [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments]
    Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
    Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
    Tested-by: default avatarDongsu Park <dongsu.park@profitbricks.com>
    Tested-by: default avatarJet Chen <jet.chen@intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge
    
    SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved
    buffer in bytes as int type.  The value needs to be capped at the request
    queue's max_sectors.  But integer overflow is not correctly handled in
    the calculation when converting max_sectors from sectors to bytes.
    Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
    Cc: Douglas Gilbert <dgilbert@interlog.com>
    Cc: linux-scsi@vger.kernel.org
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX
    
    BLKSECTGET ioctl loads the request queue's max_sectors as unsigned
    short value to the argument pointer.  So if the max_sector is greater
    than USHRT_MAX, the upper 16 bits of that is just discarded.
    
    In such case, USHRT_MAX is more preferable than the lower 16 bits of
    max_sectors.
    Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
    Cc: Douglas Gilbert <dgilbert@interlog.com>
    Cc: linux-scsi@vger.kernel.org
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block/partitions/efi.c: kerneldoc fixing
    
    Adding function documentation and fixing kerneldoc warnings
    ('field: description' uniformization).
    
    Cc: Davidlohr Bueso <davidlohr@hp.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block/partitions/msdos.c: code clean-up
    
    checkpatch fixing:
    WARNING: Missing a blank line after declarations
    WARNING: space prohibited between function name and open parenthesis '('
    ERROR: spaces required around that '<' (ctx:VxV)
    
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block/partitions/amiga.c: replace nolevel printk by pr_err
    
    Also add no prefix pr_fmt to avoid any future default format update
    
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block/partitions/aix.c: replace count*size kzalloc by kcalloc
    
    kcalloc manages count*sizeof overflow.
    
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload
    
    Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from
    Martin introduces the function bip_integrity_vecs(get the useful vectors)
    to fix the issue about nr_vecs for inline integrity vectors that reported
    by David Milburn.
    
    But it seems that bip_integrity_vecs() will return the wrong number if the
    bio is not based on any bio_set for some reason(bio->bi_pool == NULL),
    because in that case, the bip_inline_vecs[0] is malloced directly.  So
    here we add the bip_max_vcnt to record the count of vector slots, and
    cleanup the function bip_integrity_vecs().
    Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
    Cc: Martin K. Petersen <martin.petersen@oracle.com>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    blk-mq: use percpu_ref for mq usage count
    
    Currently, blk-mq uses a percpu_counter to keep track of how many
    usages are in flight.  The percpu_counter is drained while freezing to
    ensure that no usage is left in-flight after freezing is complete.
    blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
    per-cpu gating mechanism.
    
    This type of code has relatively high chance of subtle bugs which are
    extremely difficult to trigger and it's way too hairy to be open coded
    in blk-mq.  percpu_ref can serve the same purpose after the recent
    changes.  This patch replaces the open-coded per-cpu usage counting
    and draining mechanism with percpu_ref.
    
    blk_mq_queue_enter() performs tryget_live on the ref and exit()
    performs put.  blk_mq_freeze_queue() kills the ref and waits until the
    reference count reaches zero.  blk_mq_unfreeze_queue() revives the ref
    and wakes up the waiters.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()
    
    Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
    anything and it's gonna be further simplified.  Let's flatten it into
    its caller.
    
    This patch doesn't make any functional change.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    blk-mq: decouble blk-mq freezing from generic bypassing
    
    blk_mq freezing is entangled with generic bypassing which bypasses
    blkcg and io scheduler and lets IO requests fall through the block
    layer to the drivers in FIFO order.  This allows forward progress on
    IOs with the advanced features disabled so that those features can be
    configured or altered without worrying about stalling IO which may
    lead to deadlock through memory allocation.
    
    However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
    doesn't make use of blkcg or ioscheds and it maps bypssing to
    freezing, which blocks request processing and drains all the in-flight
    ones.  This causes problems as bypassing assumes that request
    processing is online.  blk-mq works around this by conditionally
    allowing request processing for the problem case - during queue
    initialization.
    
    Another weirdity is that except for during queue cleanup, bypassing
    started on the generic side prevents blk-mq from processing new
    requests but doesn't drain the in-flight ones.  This shouldn't break
    anything but again highlights that something isn't quite right here.
    
    The root cause is conflating blk-mq freezing and generic bypassing
    which are two different mechanisms.  The only intersecting purpose
    that they serve is during queue cleanup.  Let's properly separate
    blk-mq freezing from generic bypassing and simply use it where
    necessary.
    
    * request_queue->mq_freeze_depth is added and
      blk_mq_[un]freeze_queue() now operate on this counter instead of
      ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
      but the counter is tested directly.  This will be further updated by
      later changes.
    
    * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
      blk_mq_freeze_queue().  Queue cleanup path now calls
      blk_mq_freeze_queue() directly.
    
    * blk_queue_enter()'s fast path condition is simplified to simply
      check @q->mq_freeze_depth.  Previously, the condition was
    
    	!blk_queue_dying(q) &&
    	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))
    
      mq_freeze_depth is incremented right after dying is set and
      blk_queue_init_done() exception isn't necessary as blk-mq doesn't
      start frozen, which only leaves the blk_queue_bypass() test which
      can be replaced by @q->mq_freeze_depth test.
    
    This change simplifies the code and reduces confusion in the area.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    block, blk-mq: draining can't be skipped even if bypass_depth was non-zero
    
    Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
    skip queue draining if bypass_depth was already above zero.  The
    assumption is that the one which bumped the bypass_depth should have
    performed draining already; however, there's nothing which prevents a
    new instance of bypassing/freezing from starting before the previous
    one finishes draining.  The current code may allow the later
    bypassing/freezing instances to complete while there still are
    in-flight requests which haven't finished draining.
    
    Fix it by draining regardless of bypass_depth.  We still skip draining
    from blk_queue_bypass_start() while the queue is initializing to avoid
    introducing excessive delays during boot.  INIT_DONE setting is moved
    above the initial blk_queue_bypass_end() so that bypassing attempts
    can't slip inbetween.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    blk-mq: fix a memory ordering bug in blk_mq_queue_enter()
    
    blk-mq uses a percpu_counter to keep track of how many usages are in
    flight.  The percpu_counter is drained while freezing to ensure that
    no usage is left in-flight after freezing is complete.
    
    blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
    per-cpu gating mechanism; unfortunately, it contains a subtle bug -
    smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
    fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
    if freezing happens inbetween the caller can slip through and freezing
    can be complete while there are active users.
    
    Use smp_mb() instead so that bypass_depth and mq_usage_counter
    modifications and tests are properly interlocked.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    
    Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.17/core
    
    Merge the percpu_ref changes from Tejun, he says they are stable now.
    
    percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
    
    Now that explicit invocation of percpu_ref_exit() is necessary to free
    the percpu counter, we can implement percpu_ref_reinit() which
    reinitializes a released percpu_ref.  This can be used implement
    scalable gating switch which can be drained and then re-opened without
    worrying about memory allocation failures.
    
    percpu_ref_is_zero() is added to be used in a sanity check in
    percpu_ref_exit().  As this function will be useful for other purposes
    too, make it a public interface.
    
    v2: Use smp_read_barrier_depends() instead of smp_load_acquire().  We
        only need data dep barrier and smp_load_acquire() is stronger and
        heavier on some archs.  Spotted by Lai Jiangshan.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
    
    percpu-refcount: require percpu_ref to be exited explicitly
    
    Currently, a percpu_ref undoes percpu_ref_init() automatically by
    freeing the allocated percpu area when the percpu_ref is killed.
    While seemingly convenient, this has the following niggles.
    
    * It's impossible to re-init a released reference counter without
      going through re-allocation.
    
    * In the similar vein, it's impossible to initialize a percpu_ref
      count with static percpu variables.
    
    * We need and have an explicit destructor anyway for failure paths -
      percpu_ref_cancel_init().
    
    This patch removes the automatic percpu counter freeing in
    percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
    generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
    is considered but it gets confusing with percpu_ref_kill() while
    "exit" clearly indicates that it's the counterpart of
    percpu_ref_init().
    
    All percpu_ref_cancel_init() users are updated to invoke
    percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
    added to the destruction path of all percpu_ref users.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
    Cc: Li Zefan <lizefan@huawei.com>
    
    percpu-refcount: use unsigned long for pcpu_count pointer
    
    percpu_ref->pcpu_count is a percpu pointer with a status flag in its
    lowest bit.  As such, it always goes through arithmetic operations
    which is very cumbersome to do on a pointer.  It has to be first
    casted to unsigned long and then back.
    
    Let's just make the field unsigned long so that we can skip the first
    casts.  While at it, rename it to pcpu_counter_ptr to clarify that
    it's a pointer value.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    
    percpu-refcount: add helpers for ->percpu_count accesses
    
    * All four percpu_ref_*() operations implemented in the header file
      perform the same operation to determine whether the percpu_ref is
      alive and extract the percpu pointer.  Factor out the common logic
      into __pcpu_ref_alive().  This doesn't change the generated code.
    
    * There are a couple places in percpu-refcount.c which masks out
      PCPU_REF_DEAD to obtain the percpu pointer.  Factor it out into
      pcpu_count_ptr().
    
    * The above changes make the WARN_ON_ONCE() conditional at the top of
      percpu_ref_kill_and_confirm() the only user of REF_STATUS().  Test
      PCPU_REF_DEAD directly and remove REF_STATUS().
    
    This patch doesn't introduce any functional change.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    
    percpu-refcount: one bit is enough for REF_STATUS
    
    percpu-refcount currently reserves two lowest bits of its percpu
    pointer to indicate its state; however, only one bit is used for
    PCPU_REF_DEAD.
    
    Simplify it by removing PCPU_STATUS_BITS/MASK and testing
    PCPU_REF_DEAD directly.  This also allows the compiler to choose a
    more efficient instruction depending on the architecture.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    
    percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
    
    ioctx_alloc() reaches inside percpu_ref and directly frees
    ->pcpu_count in its failure path, which is quite gross.  percpu_ref
    has been providing a proper interface to do this,
    percpu_ref_cancel_init(), for quite some time now.  Let's use that
    instead.
    
    This patch doesn't introduce any behavior changes.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    
    workqueue: stronger test in process_one_work()
    
    After the recent changes, when POOL_DISASSOCIATED is cleared, the
    running worker's local CPU should be the same as pool->cpu without any
    exception even during cpu-hotplug.  Update the sanity check in
    process_one_work() accordingly.
    
    This patch changes "(proposition_A && proposition_B && proposition_C)"
    to "(proposition_B && proposition_C)", so if the old compound
    proposition is true, the new one must be true too. so this will not
    hide any possible bug which can be caught by the old test.
    
    tj: Minor updates to the description.
    
    CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
    CC: Sasha Levin <sasha.levin@oracle.com>
    Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    workqueue: clear POOL_DISASSOCIATED in rebind_workers()
    
    The commit a9ab775b ("workqueue: directly restore CPU affinity of
    workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
    without also moving "pool->flags &= ~POOL_DISASSOCIATED".
    
    There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
    being moved together, but there isn't any benefit either. We move it
    into rebind_workers() and achieve these benefits:
    
    1) Better readability.  POOL_DISASSOCIATED is cleared in
       rebind_workers() as expected.
    
    2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
       running workers of the pool are on the local CPU (pool->cpu).
    
    tj: Cosmetic updates to the code and description.
    Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: Use ALIGN macro instead of hand coding alignment calculation
    Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    
    percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
    
    __verify_pcpu_ptr() is used to verify that a specified parameter is
    actually an percpu pointer by percpu accessor and operation
    implementations.  Currently, where it's called isn't clearly defined
    and we just ensure that it's invoked at least once for all accessors
    and operations.
    
    The lack of clarity on when it should be called isn't nice and given
    that this is a completely generic issue, there's no reason to make
    archs worry about it.
    
    This patch updates __verify_pcpu_ptr() invocations such that it's
    always invoked from the final generic wrapper once per access or
    operation.  As this is already the case for {raw|this}_cpu_*()
    definitions through __pcpu_size_*(), only the {raw|per|this}_cpu_ptr()
    accessors need to be updated.
    
    This change makes it unnecessary for archs to worry about
    __verify_pcpu_ptr().  x86's arch_raw_cpu_ptr() is updated accordingly.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    
    percpu: preffity percpu header files
    
    percpu macros are difficult to read.  It's partly because they're
    fairly complex but also because they simply lack visual and
    conventional consistency to an unusual degree.  The preceding patches
    tried to organize macro definitions consistently by their roles.  This
    patch makes the following cosmetic changes to improve overall
    readability.
    
    * Use consistent convention for multi-line macro definitions - "do {"
      or "({" are now put on their own lines and the line continuing '\'
      are all put on the same column.
    
    * Temp variables used inside macro are consistently given "__" prefix.
    
    * When a macro argument is passed to another macro or a function,
      putting extra parenthses around it doesn't help anything.  Don't put
      them.
    
    * _this_cpu_generic_*() are renamed to this_cpu_generic_*() so that
      they're consistent with raw_cpu_generic_*().
    
    * Reorganize raw_cpu_*() and this_cpu_*() definitions so that trivial
      wrappers are collected in one place after actual operation
      definitions.
    
    * Other misc cleanups including reorganizing comments.
    
    All changes in this patch are cosmetic and cause no functional
    difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: use raw_cpu_*() to define __this_cpu_*()
    
    __this_cpu_*() operations are the same as raw_cpu_*() operations
    except for the added __this_cpu_preempt_check().  Curiously, these
    were defined using __pcu_size_call_*() instead of being layered on top
    of raw_cpu_*().
    
    Let's layer them so that __this_cpu_*() are defined in terms of
    raw_cpu_*().  It's simpler and less error-prone this way.
    
    This patch doesn't introduce any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: reorder macros in percpu header files
    
    * In include/asm-generic/percpu.h, collect {raw|_this}_cpu_generic*()
      macros into one place.  They were dispersed through
      {raw|this}_cpu_*_N() definitions and the visiual inconsistency was
      making following the code unnecessarily difficult.
    
    * In include/linux/percpu-defs.h, move __verify_pcpu_ptr() later in
      the file so that it's right above accessor definitions where it's
      actually used.
    
    This is pure reorganization.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
    
    We're in the process of moving all percpu accessors and operations to
    include/linux/percpu-defs.h so that they're available to arch headers
    without having to include full include/linux/percpu.h which may cause
    cyclic inclusion dependency.
    
    This patch moves {raw|this}_cpu_*() definitions from
    include/linux/percpu.h to include/linux/percpu-defs.h.  The code is
    moved mostly verbatim; however, raw_cpu_*() are placed above
    this_cpu_*() which is more conventional as the raw operations may be
    used to defined other variants.
    
    This is pure reorganization.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
    
    {raw|this}_cpu_*_N() operations are expected to be provided by archs
    and the generic definitions are provided as fallbacks.  As such, these
    firmly belong to include/asm-generic/percpu.h.
    
    Move the generic definitions to include/asm-generic/percpu.h.  The
    code is moved mostly verbatim; however, raw_cpu_*_N() are placed above
    this_cpu_*_N() which is more conventional as the raw operations may be
    used to defined other variants.
    
    This is pure reorganization.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
    
    Currently, percpu allows two separate methods for overriding
    {raw|this}_cpu_*() ops - for a given operation, an arch can provide
    whole replacement or sized sub operations to override specific parts
    of it.  e.g. arch either can provide this_cpu_add() or
    this_cpu_add_4() to override only the 4 byte operation.
    
    While quite flexible on a glance, the dual-overriding scheme
    complicates the code path for no actual gain.  It compilcates the
    already complex operation definitions and if an arch wants to override
    all sizes, it can easily provide all variants anyway.  In fact, no
    arch is actually making use of whole operation override.
    
    Another oddity is that __this_cpu_*() operations are defined in the
    same way as raw_cpu_*() but ignores full overrides of the raw_cpu_*()
    and doesn't allow full operation override, so if an arch provides
    whole overrides for raw_cpu_*() operations __this_cpu_*() ends up
    using the generic implementations.
    
    More importantly, it takes away the layering between arch-specific and
    generic parts making it impossible for the generic part to implement
    arch-independent features on top of arch-specific overrides.
    
    This patch removes the support for whole operation overrides.  As no
    arch is using it, this doesn't cause any actual difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: reorganize include/linux/percpu-defs.h
    
    Reorganize for better readability.
    
    * Accessor definitions are collected into one place and SMP and UP now
      define them in the same order.
    
    * Definitions are layered when possible - e.g. per_cpu() is now
      defined in terms of this_cpu_ptr().
    
    * Rather pointless comment dropped.
    
    * per_cpu(), __raw_get_cpu_var() and __get_cpu_var() are defined in a
      way which can be shared between SMP and UP and moved out of
      CONFIG_SMP blocks.
    
    This patch doesn't introduce any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    
    percpu: move accessors from include/linux/percpu.h to percpu-defs.h
    
    include/linux/percpu-defs.h is gonna host all accessors and operations
    so that arch headers can make use of them too without worrying about
    circular dependency through include/linux/percpu.h.
    
    This patch moves the following accessors from include/linux/percpu.h
    to include/linux/percpu-defs.h.
    
    * get/put_cpu_var()
    * get/put_cpu_ptr()
    * per_cpu_ptr()
    
    This is pure reorgniazation.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
    
    The roles of the various percpu header files has become unclear.
    There are four header files involved.
    
     include/linux/percpu-defs.h
     include/linux/percpu.h
     include/asm-generic/percpu.h
     arch/*/include/asm/percpu.h
    
    The original intention for include/asm-generic/percpu.h is providing
    generic definitions for arch-overridable parts; however, it now hosts
    various stuff which can't be overridden by archs.
    
    Also, include/linux/percpu-defs.h was initially added to contain
    section and percpu variable definition macros so that arch header
    files can make use of them without worrying about introducing cyclic
    inclusion dependency by including include/linux/percpu.h; however,
    arch headers sometimes need to access percpu variables too and this is
    one of the reasons why some accessors were implemented in
    include/linux/asm-generic/percpu.h.
    
    Let's clear up the situation by making include/asm-generic/percpu.h
    contain only arch-overridable parts and moving accessors and
    operations into include/linux/percpu-defs.  Note that this patch only
    moves things from include/asm-generic/percpu.h.
    include/linux/percpu.h will be taken care of by later patches.
    
    This patch moves the followings.
    
    * SHIFT_PERCPU_PTR() / VERIFY_PERCPU_PTR()
    * per_cpu()
    * raw_cpu_ptr()
    * this_cpu_ptr()
    * __get_cpu_var()
    * __raw_get_cpu_var()
    * __this_cpu_ptr()
    * PER_CPU_[SHARED_]ALIGNED_SECTION
    * PER_CPU_[SHARED_]ALIGNED_SECTION
    * PER_CPU_FIRST_SECTION
    
    This patch is pure reorganization.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    percpu: introduce arch_raw_cpu_ptr()
    
    Currently, archs can override raw_cpu_ptr() directly; however, we
    wanna build a layer of indirection in the generic part of percpu so
    that we can implement generic features there without affecting archs.
    
    Introduce arch_raw_cpu_ptr() which is used to define raw_cpu_ptr() by
    generic percpu code.  The two are identical for now.  x86 is currently
    the only arch which overrides raw_cpu_ptr() and is converted to
    define arch_raw_cpu_ptr() instead.
    
    This doesn't introduce any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    
    percpu: disallow archs from overriding SHIFT_PERCPU_PTR()
    
    It has been about half a decade since all archs started using the
    dynamic percpu allocator and thus the same SHIFT_PERCPU_PTR()
    implementation.  There's no benefit in overriding SHIFT_PERCPU_PTR()
    anymore.
    
    Remove #ifndef around it to clarify that this is identical regardless
    of the arch.
    
    This patch doesn't cause any functional difference.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarChristoph Lameter <cl@linux.com>
    
    (cherry picked from commit 2a1b4cf2
    0b462c89)
    Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
    3f2c76f9
blk-cgroup.c 24.6 KB