1. 06 Sep, 2024 11 commits
  2. 05 Sep, 2024 1 commit
  3. 04 Sep, 2024 3 commits
  4. 03 Sep, 2024 5 commits
    • Jens Axboe's avatar
      MAINTAINERS: move the BFQ io scheduler to orphan state · 761e5afb
      Jens Axboe authored
      Nobody is maintaining this code, and it just falls under the umbrella
      of block layer code. But at least mark it as such, in case anyone wants
      to care more deeply about it and assume the responsibility of doing so.
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      761e5afb
    • Yu Kuai's avatar
      block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move() · f45916ae
      Yu Kuai authored
      Instead of open coding it, there are no functional changes.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f45916ae
    • Yu Kuai's avatar
      block, bfq: don't break merge chain in bfq_split_bfqq() · 42c306ed
      Yu Kuai authored
      Consider the following scenario:
      
          Process 1       Process 2       Process 3       Process 4
           (BIC1)          (BIC2)          (BIC3)          (BIC4)
            Λ               |               |                |
             \-------------\ \-------------\ \--------------\|
                            V               V                V
            bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
      ref    0              1               2                4
      
      If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
      decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
      of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
      break the merge chain.
      
      Expected result: caller will allocate a new bfqq for BIC1
      
          Process 1       Process 2       Process 3       Process 4
           (BIC1)          (BIC2)          (BIC3)          (BIC4)
                            |               |                |
                             \-------------\ \--------------\|
                                            V                V
            bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
      ref    0              0               1                3
      
      Since the condition is only used for the last bfqq4 when the previous
      bfqq2 and bfqq3 are already splited. Fix the problem by checking if
      bfqq is the last one in the merge chain as well.
      
      Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      42c306ed
    • Yu Kuai's avatar
      block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator() · 0e456dba
      Yu Kuai authored
      Consider the following merge chain:
      
      Process 1       Process 2       Process 3	Process 4
       (BIC1)          (BIC2)          (BIC3)		 (BIC4)
        Λ                |               |               |
         \--------------\ \-------------\ \-------------\|
                         V               V		   V
        bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
      
      IO from Process 1 will get bfqf2 from BIC1 first, then
      bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
      handle this IO from bfqq3. However, the merge chain can be much deeper
      and bfqq3 can be merged to other bfqq as well.
      
      Fix this problem by iterating to the last bfqq in
      bfq_setup_cooperator().
      
      Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0e456dba
    • Yu Kuai's avatar
      block, bfq: fix possible UAF for bfqq->bic with merge chain · 18ad4df0
      Yu Kuai authored
      1) initial state, three tasks:
      
      		Process 1       Process 2	Process 3
      		 (BIC1)          (BIC2)		 (BIC3)
      		  |  Λ            |  Λ		  |  Λ
      		  |  |            |  |		  |  |
      		  V  |            V  |		  V  |
      		  bfqq1           bfqq2		  bfqq3
      process ref:	   1		    1		    1
      
      2) bfqq1 merged to bfqq2:
      
      		Process 1       Process 2	Process 3
      		 (BIC1)          (BIC2)		 (BIC3)
      		  |               |		  |  Λ
      		  \--------------\|		  |  |
      		                  V		  V  |
      		  bfqq1--------->bfqq2		  bfqq3
      process ref:	   0		    2		    1
      
      3) bfqq2 merged to bfqq3:
      
      		Process 1       Process 2	Process 3
      		 (BIC1)          (BIC2)		 (BIC3)
      	 here -> Λ                |		  |
      		  \--------------\ \-------------\|
      		                  V		  V
      		  bfqq1--------->bfqq2---------->bfqq3
      process ref:	   0		    1		    3
      
      In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
      get bfqq3 through merge chain, and finially handle IO by bfqq3.
      Howerver, current code will think bfqq2 is owned by BIC1, like initial
      state, and set bfqq2->bic to BIC1.
      
      bfq_insert_request
      -> by Process 1
       bfqq = bfq_init_rq(rq)
        bfqq = bfq_get_bfqq_handle_split
         bfqq = bic_to_bfqq
         -> get bfqq2 from BIC1
       bfqq->ref++
       rq->elv.priv[0] = bic
       rq->elv.priv[1] = bfqq
       if (bfqq_process_refs(bfqq) == 1)
        bfqq->bic = bic
        -> record BIC1 to bfqq2
      
        __bfq_insert_request
         new_bfqq = bfq_setup_cooperator
         -> get bfqq3 from bfqq2->new_bfqq
         bfqq_request_freed(bfqq)
         new_bfqq->ref++
         rq->elv.priv[1] = new_bfqq
         -> handle IO by bfqq3
      
      Fix the problem by checking bfqq is from merge chain fist. And this
      might fix a following problem reported by our syzkaller(unreproducible):
      
      ==================================================================
      BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
      BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
      BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
      Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595
      
      CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G             L     6.6.0-07439-gba2303cacfda #6
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Workqueue: kblockd blk_mq_requeue_work
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
       print_address_description mm/kasan/report.c:364 [inline]
       print_report+0x10d/0x610 mm/kasan/report.c:475
       kasan_report+0x8e/0xc0 mm/kasan/report.c:588
       bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
       bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
       bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
       bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
       bfq_init_rq block/bfq-iosched.c:6876 [inline]
       bfq_insert_request block/bfq-iosched.c:6254 [inline]
       bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
       blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
       blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
       process_one_work kernel/workqueue.c:2627 [inline]
       process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
       worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
       kthread+0x33c/0x440 kernel/kthread.c:388
       ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
       </TASK>
      
      Allocated by task 20776:
       kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
       kasan_set_track+0x25/0x30 mm/kasan/common.c:52
       __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
       kasan_slab_alloc include/linux/kasan.h:188 [inline]
       slab_post_alloc_hook mm/slab.h:763 [inline]
       slab_alloc_node mm/slub.c:3458 [inline]
       kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
       ioc_create_icq block/blk-ioc.c:370 [inline]
       ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
       bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
       blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
       __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
       blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
       blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
       __submit_bio+0x331/0x6f0 block/blk-core.c:624
       __submit_bio_noacct_mq block/blk-core.c:703 [inline]
       submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
       submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
       xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
       xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
       xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
       process_one_work kernel/workqueue.c:2627 [inline]
       process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
       worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
       kthread+0x33c/0x440 kernel/kthread.c:388
       ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
      
      Freed by task 946:
       kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
       kasan_set_track+0x25/0x30 mm/kasan/common.c:52
       kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
       ____kasan_slab_free mm/kasan/common.c:236 [inline]
       __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
       kasan_slab_free include/linux/kasan.h:164 [inline]
       slab_free_hook mm/slub.c:1815 [inline]
       slab_free_freelist_hook mm/slub.c:1841 [inline]
       slab_free mm/slub.c:3786 [inline]
       kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
       rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
       rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
       __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553
      
      Last potentially related work creation:
       kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
       __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
       __call_rcu_common kernel/rcu/tree.c:2712 [inline]
       call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
       ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
       ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
       process_one_work kernel/workqueue.c:2627 [inline]
       process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
       worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
       kthread+0x33c/0x440 kernel/kthread.c:388
       ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
      
      Second to last potentially related work creation:
       kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
       __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
       __call_rcu_common kernel/rcu/tree.c:2712 [inline]
       call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
       ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
       ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
       process_one_work kernel/workqueue.c:2627 [inline]
       process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
       worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
       kthread+0x33c/0x440 kernel/kthread.c:388
       ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
      
      The buggy address belongs to the object at ffff888123839d68
       which belongs to the cache bfq_io_cq of size 1360
      The buggy address is located 336 bytes inside of
       freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)
      
      The buggy address belongs to the physical page:
      page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
      head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
      page_type: 0xffffffff()
      raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
      raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
       ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      ==================================================================
      
      Fixes: 36eca894 ("block, bfq: add Early Queue Merge (EQM)")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      18ad4df0
  5. 30 Aug, 2024 2 commits
  6. 29 Aug, 2024 9 commits
  7. 28 Aug, 2024 3 commits
    • Song Liu's avatar
      Merge branch 'md-6.12-bitmap' into md-6.12 · 7f67fdae
      Song Liu authored
      From Yu Kuai (with minor changes by Song Liu):
      
      The background is that currently bitmap is using a global spin_lock,
      causing lock contention and huge IO performance degradation for all raid
      levels.
      
      However, it's impossible to implement a new lock free bitmap with
      current situation that md-bitmap exposes the internal implementation
      with lots of exported apis. Hence bitmap_operations is invented, to
      describe bitmap core implementation, and a new bitmap can be introduced
      with a new bitmap_operations, we only need to switch to the new one
      during initialization.
      
      And with this we can build bitmap as kernel module, but that's not
      our concern for now.
      
      This version was tested with mdadm tests and lvm2 tests. This set does
      not introduce new errors in these tests.
      
      * md-6.12-bitmap: (42 commits)
        md/md-bitmap: make in memory structure internal
        md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
        md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
        md/md-bitmap: merge md_bitmap_free() into bitmap_operations
        md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
        md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
        md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
        md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
        md/md-bitmap: pass in mddev directly for md_bitmap_resize()
        md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
        md/md-bitmap: merge bitmap_unplug() into bitmap_operations
        md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
        md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
        md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
        md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
        md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
        md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()
        md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations
        md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations
        md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations
        ...
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      7f67fdae
    • Md Haris Iqbal's avatar
      block/rnbd-srv: Add sanity check and remove redundant assignment · f6f84be0
      Md Haris Iqbal authored
      The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
      do not need to assign msg->bi_size again to it, since its redudant and
      can also be harmful. Instead we can use it to add a sanity check, which
      checks the locally calculated bi_size, with the one sent in msg.
      Signed-off-by: default avatarMd Haris Iqbal <haris.iqbal@ionos.com>
      Signed-off-by: default avatarJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: default avatarGrzegorz Prajsner <grzegorz.prajsner@ionos.com>
      Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f6f84be0
    • Yu Kuai's avatar
      md: Remove flush handling · b75197e8
      Yu Kuai authored
      For flush request, md has a special flush handling to merge concurrent
      flush request into single one, however, the whole mechanism is based on
      a disk level spin_lock 'mddev->lock'. And fsync can be called quite
      often in some user cases, for consequence, spin lock from IO fast path can
      cause performance degradation.
      
      Fortunately, the block layer already has flush handling to merge
      concurrent flush request, and it only acquires hctx level spin lock. (see
      details in blk-flush.c)
      
      This patch removes the flush handling in md, and converts to use general
      block layer flush handling in underlying disks.
      
      Flush test for 4 nvme raid10:
      start 128 threads to do fsync 100000 times, on arm64, see how long it
      takes.
      
      Test script:
      void* thread_func(void* arg) {
          int fd = *(int*)arg;
          for (int i = 0; i < FSYNC_COUNT; i++) {
              fsync(fd);
          }
          return NULL;
      }
      
      int main() {
          int fd = open("/dev/md0", O_RDWR);
          if (fd < 0) {
              perror("open");
              exit(1);
          }
      
          pthread_t threads[THREADS];
          struct timeval start, end;
      
          gettimeofday(&start, NULL);
      
          for (int i = 0; i < THREADS; i++) {
              pthread_create(&threads[i], NULL, thread_func, &fd);
          }
      
          for (int i = 0; i < THREADS; i++) {
              pthread_join(threads[i], NULL);
          }
      
          gettimeofday(&end, NULL);
      
          close(fd);
      
          long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
          printf("Elapsed time: %lld microseconds\n", elapsed);
      
          return 0;
      }
      
      Test result: about 10 times faster:
      Before this patch: 50943374 microseconds
      After this patch:  5096347  microseconds
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.comSigned-off-by: default avatarSong Liu <song@kernel.org>
      b75197e8
  8. 27 Aug, 2024 6 commits