1. 09 Jul, 2019 5 commits
  2. 05 Jul, 2019 1 commit
    • Dennis Zhou's avatar
      blk-iolatency: fix STS_AGAIN handling · c9b3007f
      Dennis Zhou authored
      The iolatency controller is based on rq_qos. It increments on
      rq_qos_throttle() and decrements on either rq_qos_cleanup() or
      rq_qos_done_bio(). a3fb01ba fixes the double accounting issue where
      blk_mq_make_request() may call both rq_qos_cleanup() and
      rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
      double decrement.
      
      The above works upstream as the only way we can get STS_AGAIN is from
      blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
      problem as bio_endio() skipping only happens on reserved tag allocation
      failures which can only be caused by driver bugs and already triggers
      WARN.
      
      However, the fix creates a not so great dependency on how STS_AGAIN can
      be propagated. Internally, we (Facebook) carry a patch that kills read
      ahead if a cgroup is io congested or a fatal signal is pending. This
      combined with chained bios progagate their bi_status to the parent is
      not already set can can cause the parent bio to not clean up properly
      even though it was successful. This consequently leaks the inflight
      counter and can hang all IOs under that blkg.
      
      To nip the adverse interaction early, this removes the rq_qos_cleanup()
      callback in iolatency in favor of cleaning up always on the
      rq_qos_done_bio() path.
      
      Fixes: a3fb01ba ("blk-iolatency: only account submitted bios")
      Debugged-by: default avatarTejun Heo <tj@kernel.org>
      Debugged-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c9b3007f
  3. 03 Jul, 2019 3 commits
  4. 01 Jul, 2019 3 commits
    • Pavel Begunkov's avatar
      sbitmap: Replace cmpxchg with xchg · 41723288
      Pavel Begunkov authored
      cmpxchg() with an immediate value could be replaced with less expensive
      xchg(). The same true if new value don't _depend_ on the old one.
      
      In the second block, atomic_cmpxchg() return value isn't checked, so
      after atomic_cmpxchg() ->  atomic_xchg() conversion it could be replaced
      with atomic_set(). Comparison with atomic_read() in the second chunk was
      left as an optimisation (if that was the initial intention).
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      41723288
    • Ming Lei's avatar
      block: fix .bi_size overflow · 79d08f89
      Ming Lei authored
      'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
      bytes.
      
      Before 07173c3e ("block: enable multipage bvecs"), one bio can
      include very limited pages, and usually at most 256, so the fs bio
      size won't be bigger than 1M bytes most of times.
      
      Since we support multi-page bvec, in theory one fs bio really can
      be added > 1M pages, especially in case of hugepage, or big writeback
      with too many dirty pages. Then there is chance in which .bi_size
      is overflowed.
      
      Fixes this issue by using bio_full() to check if the added segment may
      overflow .bi_size.
      
      Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: linux-xfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 07173c3e ("block: enable multipage bvecs")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79d08f89
    • Jens Axboe's avatar
      Merge tag 'v5.2-rc6' into for-5.3/block · 5be1f9d8
      Jens Axboe authored
      Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
      fix. Otherwise we end up having conflicts with future patches between
      for-5.3/block and master that touch this area. In particular, it makes
      the bio_full() fix hard to backport to stable.
      
      * tag 'v5.2-rc6': (482 commits)
        Linux 5.2-rc6
        Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
        Bluetooth: Fix regression with minimum encryption key size alignment
        tcp: refine memory limit test in tcp_fragment()
        x86/vdso: Prevent segfaults due to hoisted vclock reads
        SUNRPC: Fix a credential refcount leak
        Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
        net :sunrpc :clnt :Fix xps refcount imbalance on the error path
        NFS4: Only set creation opendata if O_CREAT
        ARM: 8867/1: vdso: pass --be8 to linker if necessary
        KVM: nVMX: reorganize initial steps of vmx_set_nested_state
        KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
        habanalabs: use u64_to_user_ptr() for reading user pointers
        nfsd: replace Jeff by Chuck as nfsd co-maintainer
        inet: clear num_timeout reqsk_alloc()
        PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
        net: mvpp2: debugfs: Add pmap to fs dump
        ipv6: Default fib6_type to RTN_UNICAST when not set
        net: hns3: Fix inconsistent indenting
        net/af_iucv: always register net_device notifier
        ...
      5be1f9d8
  5. 29 Jun, 2019 21 commits
  6. 28 Jun, 2019 7 commits
    • Douglas Anderson's avatar
      block, bfq: NULL out the bic when it's no longer valid · dbc3117d
      Douglas Anderson authored
      In reboot tests on several devices we were seeing a "use after free"
      when slub_debug or KASAN was enabled.  The kernel complained about:
      
        Unable to handle kernel paging request at virtual address 6b6b6c2b
      
      ...which is a classic sign of use after free under slub_debug.  The
      stack crawl in kgdb looked like:
      
       0  test_bit (addr=<optimized out>, nr=<optimized out>)
       1  bfq_bfqq_busy (bfqq=<optimized out>)
       2  bfq_select_queue (bfqd=<optimized out>)
       3  __bfq_dispatch_request (hctx=<optimized out>)
       4  bfq_dispatch_request (hctx=<optimized out>)
       5  0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
       6  0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
       7  0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
       8  0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
       9  0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
       10 0xc024cff4 in worker_thread (__worker=0xec6d4640)
      
      Digging in kgdb, it could be found that, though bfqq looked fine,
      bfqq->bic had been freed.
      
      Through further digging, I postulated that perhaps it is illegal to
      access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
      because the "bic" can be freed at some point in time after this call
      is made.  I confirmed that there certainly were cases where the exact
      crashing code path would access the "bic" after bfq_exit_icq() had
      been called.  Sspecifically I set the "bfqq->bic" to (void *)0x7 and
      saw that the bic was 0x7 at the time of the crash.
      
      To understand a bit more about why this crash was fairly uncommon (I
      saw it only once in a few hundred reboots), you can see that much of
      the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
      access the ->bic anymore.  The only case it doesn't is if
      bfq_put_queue() sees a reference still held.
      
      However, even in the case when bfqq isn't freed, the crash is still
      rare.  Why?  I tracked what happened to the "bic" after the exit
      routine.  It doesn't get freed right away.  Rather,
      put_io_context_active() eventually called put_io_context() which
      queued up freeing on a workqueue.  The freeing then actually happened
      later than that through call_rcu().  Despite all these delays, some
      extra debugging showed that all the hoops could be jumped through in
      time and the memory could be freed causing the original crash.  Phew!
      
      To make a long story short, assuming it truly is illegal to access an
      icq after the "exit_icq" callback is finished, this patch is needed.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbc3117d
    • Coly Li's avatar
      bcache: add reclaimed_journal_buckets to struct cache_set · dff90d58
      Coly Li authored
      Now we have counters for how many times jouranl is reclaimed, how many
      times cached dirty btree nodes are flushed, but we don't know how many
      jouranl buckets are really reclaimed.
      
      This patch adds reclaimed_journal_buckets into struct cache_set, this
      is an increasing only counter, to tell how many journal buckets are
      reclaimed since cache set runs. From all these three counters (reclaim,
      reclaimed_journal_buckets, flush_write), we can have idea how well
      current journal space reclaim code works.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dff90d58
    • Coly Li's avatar
      bcache: performance improvement for btree_flush_write() · 91be66e1
      Coly Li authored
      This patch improves performance for btree_flush_write() in following
      ways,
      - Use another spinlock journal.flush_write_lock to replace the very
        hot journal.lock. We don't have to use journal.lock here, selecting
        candidate btree nodes takes a lot of time, hold journal.lock here will
        block other jouranling threads and drop the overall I/O performance.
      - Only select flushing btree node from c->btree_cache list. When the
        machine has a large system memory, mca cache may have a huge number of
        cached btree nodes. Iterating all the cached nodes will take a lot
        of CPU time, and most of the nodes on c->btree_cache_freeable and
        c->btree_cache_freed lists are cleared and have need to flush. So only
        travel mca list c->btree_cache to select flushing btree node should be
        enough for most of the cases.
      - Don't iterate whole c->btree_cache list, only reversely select first
        BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
        c->btree_cache and select the oldest journal pin btree nodes consumes
        huge number of CPU cycles if the list is huge (push and pop a node
        into/out of a heap is expensive). The last several dirty btree nodes
        on the tail of c->btree_cache list are earlest allocated and cached
        btree nodes, they are relative to the oldest journal pin btree nodes.
        Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
        c->btree_cache probably includes the oldest journal pin btree nodes.
      
      In my testing, the above change decreases 50%+ CPU consumption when
      journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
      comparing blocking I/O for 120+ seconds in previous code, this is much
      better. Maybe there is room to improve in future, but at this momment
      the fix looks fine and performs well in my testing.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91be66e1
    • Coly Li's avatar
      bcache: fix race in btree_flush_write() · 50a260e8
      Coly Li authored
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-and-tested-by: default avatarkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      50a260e8
    • Coly Li's avatar
      bcache: remove retry_flush_write from struct cache_set · d91ce757
      Coly Li authored
      In struct cache_set, retry_flush_write is added for commit c4dc2497
      ("bcache: fix high CPU occupancy during journal") which is reverted in
      previous patch.
      
      Now it is useless anymore, and this patch removes it from bcache code.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d91ce757
    • Coly Li's avatar
      bcache: add comments for mutex_lock(&b->write_lock) · 41508bb7
      Coly Li authored
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      41508bb7
    • Coly Li's avatar
      bcache: only clear BTREE_NODE_dirty bit when it is set · e5ec5f47
      Coly Li authored
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5ec5f47