1. 14 Nov, 2022 7 commits
    • Xiao Ni's avatar
      md/raid0, raid10: Don't set discard sectors for request queue · 8e1a2279
      Xiao Ni authored
      It should use disk_stack_limits to get a proper max_discard_sectors
      rather than setting a value by stack drivers.
      
      And there is a bug. If all member disks are rotational devices,
      raid0/raid10 set max_discard_sectors. So the member devices are
      not ssd/nvme, but raid0/raid10 export the wrong value. It reports
      warning messages in function __blkdev_issue_discard when mkfs.xfs
      like this:
      
      [ 4616.022599] ------------[ cut here ]------------
      [ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0
      [ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0
      [ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7
      [ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246
      [ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080
      [ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080
      [ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000
      [ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000
      [ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080
      [ 4616.213259] FS:  00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000
      [ 4616.222298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0
      [ 4616.236689] Call Trace:
      [ 4616.239428]  blkdev_issue_discard+0x52/0xb0
      [ 4616.244108]  blkdev_common_ioctl+0x43c/0xa00
      [ 4616.248883]  blkdev_ioctl+0x116/0x280
      [ 4616.252977]  __x64_sys_ioctl+0x8a/0xc0
      [ 4616.257163]  do_syscall_64+0x5c/0x90
      [ 4616.261164]  ? handle_mm_fault+0xc5/0x2a0
      [ 4616.265652]  ? do_user_addr_fault+0x1d8/0x690
      [ 4616.270527]  ? do_syscall_64+0x69/0x90
      [ 4616.274717]  ? exc_page_fault+0x62/0x150
      [ 4616.279097]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [ 4616.284748] RIP: 0033:0x7f9a55398c6b
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      8e1a2279
    • Florian-Ewald Mueller's avatar
      md/bitmap: Fix bitmap chunk size overflow issues · 45552111
      Florian-Ewald Mueller authored
      - limit bitmap chunk size internal u64 variable to values not overflowing
        the u32 bitmap superblock structure variable stored on persistent media
      - assign bitmap chunk size internal u64 variable from unsigned values to
        avoid possible sign extension artifacts when assigning from a s32 value
      
      The bug has been there since at least kernel 4.0.
      Steps to reproduce it:
      1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
      -n2 /dev/rnbd1 /dev/rnbd2
      2 resize member device rnbd1 and rnbd2 to 8 TB
      3 mdadm --grow /dev/mdx --size=max
      
      The bitmap_chunksize will overflow without patch.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFlorian-Ewald Mueller <florian-ewald.mueller@ionos.com>
      Signed-off-by: default avatarJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      45552111
    • Ye Bin's avatar
      md: introduce md_ro_state · f97a5528
      Ye Bin authored
      Introduce md_ro_state for mddev->ro, so it is easy to understand.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      f97a5528
    • Ye Bin's avatar
      md: factor out __md_set_array_info() · 2f6d261e
      Ye Bin authored
      Factor out __md_set_array_info(). No functional change.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      2f6d261e
    • Giulio Benetti's avatar
      lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE · 42271ca3
      Giulio Benetti authored
      RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.
      Signed-off-by: default avatarGiulio Benetti <giulio.benetti@benettiengineering.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      42271ca3
    • Uros Bizjak's avatar
      raid5-cache: use try_cmpxchg in r5l_wake_reclaim · 9487a0f6
      Uros Bizjak authored
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so
      this change saves a compare after cmpxchg (and related move instruction in
      front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails. There is no need to re-read the value in the loop.
      
      Note that the value from *ptr should be read using READ_ONCE to prevent
      the compiler from merging, refetching or reordering the read.
      
      No functional change intended.
      
      Cc: Song Liu <song@kernel.org>
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      9487a0f6
    • Li Zhong's avatar
      drivers/md/md-bitmap: check the return value of md_bitmap_get_counter() · 3bd548e5
      Li Zhong authored
      Check the return value of md_bitmap_get_counter() in case it returns
      NULL pointer, which will result in a null pointer dereference.
      
      v2: update the check to include other dereference
      Signed-off-by: default avatarLi Zhong <floridsleeves@gmail.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bd548e5
  2. 11 Nov, 2022 1 commit
    • Gabriel Krisman Bertazi's avatar
      sbitmap: Use single per-bitmap counting to wake up queued tags · 4f8126bb
      Gabriel Krisman Bertazi authored
      sbitmap suffers from code complexity, as demonstrated by recent fixes,
      and eventual lost wake ups on nested I/O completion.  The later happens,
      from what I understand, due to the non-atomic nature of the updates to
      wait_cnt, which needs to be subtracted and eventually reset when equal
      to zero.  This two step process can eventually miss an update when a
      nested completion happens to interrupt the CPU in between the wait_cnt
      updates.  This is very hard to fix, as shown by the recent changes to
      this code.
      
      The code complexity arises mostly from the corner cases to avoid missed
      wakes in this scenario.  In addition, the handling of wake_batch
      recalculation plus the synchronization with sbq_queue_wake_up is
      non-trivial.
      
      This patchset implements the idea originally proposed by Jan [1], which
      removes the need for the two-step updates of wait_cnt.  This is done by
      tracking the number of completions and wakeups in always increasing,
      per-bitmap counters.  Instead of having to reset the wait_cnt when it
      reaches zero, we simply keep counting, and attempt to wake up N threads
      in a single wait queue whenever there is enough space for a batch.
      Waking up less than batch_wake shouldn't be a problem, because we
      haven't changed the conditions for wake up, and the existing batch
      calculation guarantees at least enough remaining completions to wake up
      a batch for each queue at any time.
      
      Performance-wise, one should expect very similar performance to the
      original algorithm for the case where there is no queueing.  In both the
      old algorithm and this implementation, the first thing is to check
      ws_active, which bails out if there is no queueing to be managed. In the
      new code, we took care to avoid accounting completions and wakeups when
      there is no queueing, to not pay the cost of atomic operations
      unnecessarily, since it doesn't skew the numbers.
      
      For more interesting cases, where there is queueing, we need to take
      into account the cross-communication of the atomic operations.  I've
      been benchmarking by running parallel fio jobs against a single hctx
      nullb in different hardware queue depth scenarios, and verifying both
      IOPS and queueing.
      
      Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
      jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
      varying only the hardware queue length per test.
      
      queue size 2                 4                 8                 16                 32                 64
      6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
      patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
      
      The following is a similar experiment, ran against a nullb with a single
      bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
      parallel fio jobs operating on the same device
      
      queue size 2 	             4                 8              	16             	    32		       64
      6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
      patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
      
      It has also survived blktests and a 12h-stress run against nullb. I also
      ran the code against nvme and a scsi SSD, and I didn't observe
      performance regression in those. If there are other tests you think I
      should run, please let me know and I will follow up with results.
      
      [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Liu Song <liusong@linux.alibaba.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f8126bb
  3. 10 Nov, 2022 2 commits
  4. 09 Nov, 2022 14 commits
  5. 07 Nov, 2022 1 commit
  6. 02 Nov, 2022 15 commits