1. 02 Aug, 2022 40 commits
    • Guoqing Jiang's avatar
      rnbd-clt: kill read_only from struct rnbd_clt_dev · 017d76f4
      Guoqing Jiang authored
      The member is not needed since we can call get_disk_ro to achieve the
      same goal.
      Acked-by: default avatarJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Link: https://lore.kernel.org/r/20220706133152.12058-4-guoqing.jiang@linux.devSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      017d76f4
    • Guoqing Jiang's avatar
      rnbd-clt: don't free rsp in msg_open_conf for map scenario · 52334f4a
      Guoqing Jiang authored
      For map scenario, rsp is freed in two places:
      
      1. msg_open_conf frees rsp if rtrs_clt_request returns 0.
      
      2. Otherwise, rsp is freed by the call sites of rtrs_clt_request.
      
      Now, We'd like to control full lifecycle of rsp in rnbd_clt_map_device,
      with that, it is feasible to pass rsp to rnbd_client_setup_device in
      next commit.
      
      For 1, it is possible to free rsp from the caller of send_usr_msg
      because of the synchronization of iu->comp.wait. And we put iu later
      in rnbd_clt_map_device to ensure order of release rsp and iu.
      Acked-by: default avatarJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Link: https://lore.kernel.org/r/20220706133152.12058-3-guoqing.jiang@linux.devSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52334f4a
    • Guoqing Jiang's avatar
      rnbd-clt: open code send_msg_open in rnbd_clt_map_device · 9ddae3ba
      Guoqing Jiang authored
      Let's open code it in rnbd_clt_map_device, then we can use information
      from rsp to setup gendisk and request_queue in next commits. After that,
      we can remove some members (wc, fua and max_hw_sectors etc) from struct
      rnbd_clt_dev.
      Acked-by: default avatarJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Link: https://lore.kernel.org/r/20220706133152.12058-2-guoqing.jiang@linux.devSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ddae3ba
    • Christophe JAILLET's avatar
      block: null_blk: Use the bitmap API to allocate bitmaps · eb25ad80
      Christophe JAILLET authored
      Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.
      
      It is less verbose and it improves the semantic.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Link: https://lore.kernel.org/r/7c4d3116ba843fc4a8ae557dd6176352a6cd0985.1656864320.git.christophe.jaillet@wanadoo.frSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eb25ad80
    • Zhang Jiaming's avatar
      md: Fix spelling mistake in comments · 9e26728b
      Zhang Jiaming authored
      There are 2 spelling mistakes in comments. Fix it.
      Signed-off-by: default avatarZhang Jiaming <jiaming@nfschina.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9e26728b
    • Logan Gunthorpe's avatar
      md/raid5: Increase restriction on max segments per request · 9ad1a74f
      Logan Gunthorpe authored
      The block layer defaults the maximum segments to 128, which means
      requests tend to get split around the 512KB depending on how many
      pages can be merged. There's no such restriction in the raid5 code
      so increase the limit to USHRT_MAX so that larger requests can be
      sent as one.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ad1a74f
    • Logan Gunthorpe's avatar
      md/raid5: Improve debug prints · df1b620a
      Logan Gunthorpe authored
      Add a debug print for raid5_make_request() so that each request is
      printed and add the logical sector number to the debug print in
      __add_stripe_bio().
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      df1b620a
    • Logan Gunthorpe's avatar
      md/raid5: Pivot raid5_make_request() · 7e55c60a
      Logan Gunthorpe authored
      raid5_make_request() loops through every page in the request,
      finds the appropriate stripe and adds the bio for that page in the
      disk.
      
      This causes a great deal of contention on the hash_lock and extra
      work seeing each stripe must be found once for every data disk.
      
      The number of times a stripe must be found can be reduced by pivoting
      raid5_make_request() so that it loops through every stripe and then
      loops through every disk in that stripe to see if the bio must be
      added. This reduces the number of times the hash lock must be taken
      by a factor equal to the number of data disks.
      
      To accomplish this, the logical sectors that have already been added
      must be tracked. Tracking them is done with a bitmap: the bits
      for all pages are set at the start of the request and each bit
      is cleared once the bio is added to a stripe.
      
      Finding the next sector to be done is then just a call to
      find_first_bit() so that sectors that have been done can simply be
      skipped.
      
      One minor downside is that the maximum sectors for a request must be
      limited so that the bitmap can be appropriately sized on the stack.
      This limit is arbitrarily chosen to be 256 stripe pages which works out
      to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
      restrict the maximum request further seeing the default block queue
      settings are used which restricts the number of segments to 128 (which
      results in request sizes that are approximately 512KB).
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7e55c60a
    • Logan Gunthorpe's avatar
      md/raid5: Check all disks in a stripe_head for reshape progress · 486f6055
      Logan Gunthorpe authored
      When testing if a previous stripe has had reshape expand past it, use
      the earliest or latest logical sector in all the disks for that stripe
      head. This will allow adding multiple disks at a time in a subesquent
      patch.
      
      To do this cleaner, refactor the check into a helper function called
      stripe_ahead_of_reshape().
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      486f6055
    • Logan Gunthorpe's avatar
      md/raid5: Refactor add_stripe_bio() · 4ad1d984
      Logan Gunthorpe authored
      Factor out two helper functions from add_stripe_bio(): one to check for
      overlap (stripe_bio_overlaps()), and one to actually add the bio to the
      stripe (__add_stripe_bio()). The latter function will always succeed.
      
      This will be useful in the next patch so that overlap can be checked for
      multiple disks before adding any
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4ad1d984
    • Logan Gunthorpe's avatar
      md/raid5: Keep a reference to last stripe_head for batch · 3312e6c8
      Logan Gunthorpe authored
      When batching, every stripe head has to find the previous stripe head to
      add to the batch list. This involves taking the hash lock which is
      highly contended during IO.
      
      Instead of finding the previous stripe_head each time, store a
      reference to the previous stripe_head in a pointer so that it doesn't
      require taking the contended lock another time.
      
      The reference to the previous stripe must be released before scheduling
      and waiting for work to get done. Otherwise, it can hold up
      raid5_activate_delayed() and deadlock.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3312e6c8
    • Logan Gunthorpe's avatar
      md/raid5: Refactor for loop in raid5_make_request() into while loop · 0a2d1694
      Logan Gunthorpe authored
      The for loop with retry label can be more cleanly expressed as a while
      loop by moving the logical_sector increment into the success path.
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a2d1694
    • Logan Gunthorpe's avatar
      md/raid5: Move read_seqcount_begin() into make_stripe_request() · 4f354560
      Logan Gunthorpe authored
      Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
      into make_stripe_request().
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f354560
    • Logan Gunthorpe's avatar
      md/raid5: Drop the do_prepare flag in raid5_make_request() · 1cdb5b41
      Logan Gunthorpe authored
      prepare_to_wait() can be reasonably called after schedule instead of
      setting a flag and preparing in the next loop iteration.
      
      This means that prepare_to_wait() will be called before
      read_seqcount_begin(), but there shouldn't be any reason that the order
      matters here. On the first iteration of the loop prepare_to_wait() is
      already called first.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1cdb5b41
    • Logan Gunthorpe's avatar
      md/raid5: Factor out helper from raid5_make_request() loop · f4aec6a0
      Logan Gunthorpe authored
      Factor out the inner loop of raid5_make_request() into it's own helper
      called make_stripe_request().
      
      The helper returns a number of statuses: SUCCESS, RETRY,
      SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
      understand and allows the SCHEDULE_AND_RETRY path to be made common.
      
      A context structure is added to contain do_flush. It will be used
      more in subsequent patches for state that needs to be kept
      outside the loop.
      
      No functional changes intended. This will be cleaned up further in
      subsequent patches to untangle the gen_lock and do_prepare logic
      further.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4aec6a0
    • Logan Gunthorpe's avatar
      md/raid5: Move common stripe get code into new find_get_stripe() helper · 1baa1126
      Logan Gunthorpe authored
      Both uses of find_stripe() require a fairly complicated dance to
      increment the reference count. Move this into a common find_get_stripe()
      helper.
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1baa1126
    • Logan Gunthorpe's avatar
      md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() · 8757fef6
      Logan Gunthorpe authored
      stripe_add_to_batch_list() is better done in the loop in make_request
      instead of inside add_stripe_bio(). This is clearer and allows for
      storing the batch_head state outside the loop in a subsequent patch.
      
      The call to add_stripe_bio() in retry_aligned_read() is for read
      and batching only applies to write. So it's impossible for batching
      to happen at that call site.
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8757fef6
    • Logan Gunthorpe's avatar
      md/raid5: Refactor raid5_make_request loop · 27fb7010
      Logan Gunthorpe authored
      Break immediately if raid5_get_active_stripe() returns NULL and deindent
      the rest of the loop. Annotate this check with an unlikely().
      
      This makes the code easier to read and reduces the indentation level.
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      27fb7010
    • Logan Gunthorpe's avatar
      md/raid5: Factor out ahead_of_reshape() function · a8bb304c
      Logan Gunthorpe authored
      There are a few uses of an ugly ternary operator in raid5_make_request()
      to check if a sector is a head of a reshape sector.
      
      Factor this out into a simple helper called ahead_of_reshape().
      
      No functional changes intended.
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8bb304c
    • Logan Gunthorpe's avatar
      md/raid5: Make logic blocking check consistent with logic that blocks · 6e3f50d3
      Logan Gunthorpe authored
      The check in raid5_make_request differs very slightly from the logic
      that causes it to block lower down. This likely does not cause a bug
      as the check is fuzzy anyway (as reshape may move on between the first
      check and the subsequent check). However, make it consistent so it can
      be cleaned up in a subsequent patch.
      
      The condition which causes the schedule is:
      
       !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
         logical_sector >= conf->reshape_progress) &&
        (mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
         logical_sector >= conf->reshape_safe)
      
      The condition that causes the early bailout is made to match this.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e3f50d3
    • Guoqing Jiang's avatar
      md: unlock mddev before reap sync_thread in action_store · 9dfbdafd
      Guoqing Jiang authored
      Since the bug which commit 8b48ec23 ("md: don't unregister sync_thread
      with reconfig_mutex held") fixed is related with action_store path, other
      callers which reap sync_thread didn't need to be changed.
      
      Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
      bug with belows.
      
      1. unlock mddev before md_reap_sync_thread in action_store.
      2. save reshape_position before unlock, then restore it to ensure position
         not changed accidentally by others.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9dfbdafd
    • Chris Webb's avatar
      md: Explicitly create command-line configured devices · 05ce7fb9
      Chris Webb authored
      Boot-time assembly of arrays with md= command-line arguments breaks when
      CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
      calls blkdev_get_by_dev(), assuming this implicitly creates the block
      device.
      
      Fix this by attempting to md_alloc() the array first. As in the probe path,
      ignore any error as failure is caught by blkdev_get_by_dev() anyway.
      Signed-off-by: default avatarChris Webb <chris@arachsys.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05ce7fb9
    • Logan Gunthorpe's avatar
      md: Notify sysfs sync_completed in md_reap_sync_thread() · 9973f0fa
      Logan Gunthorpe authored
      The mdadm test 07layouts randomly produces a kernel hung task deadlock.
      The deadlock is caused by the suspend_lo/suspend_hi files being set by
      the mdadm background process during reshape and not being cleared
      because the process hangs. (Leaving aside the issue of the fragility of
      freezing kernel tasks by buggy userspace processes...)
      
      When the background mdadm process hangs it, is waiting (without a
      timeout) on a change to the sync_completed file signalling that the
      reshape has completed. The process is woken up a couple times when
      the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
      is cleared so sync_completed_show() reports 0 instead of "none".
      
      To fix this, notify the sysfs file in md_reap_sync_thread() after
      MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
      it to continue and write to suspend_lo/suspend_hi to allow IO to
      continue.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9973f0fa
    • Logan Gunthorpe's avatar
      md: Ensure resync is reported after it starts · b368856a
      Logan Gunthorpe authored
      The 07layouts test in mdadm fails on some systems. The failure
      presents itself as the backup file not being removed before the next
      layout is grown into:
      
        mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
            File exists
      
      This is because the background mdadm process, which is responsible for
      cleaning up this backup file gets into an infinite loop waiting for
      the reshape to start. mdadm checks the mdstat file if a reshape is
      going and, if it is not, it waits for an event on the file or times
      out in 5 seconds. On faster machines, the reshape may complete before
      the 5 seconds times out, and thus the background mdadm process loops
      waiting for a reshape to start that has already occurred.
      
      mdadm reads the mdstat file to start, but mdstat does not report that the
      reshape has begun, even though it has indeed begun. So the mdstat_wait()
      call (in mdadm) which polls on the mdstat file won't ever return until
      timing out.
      
      The reason mdstat reports the reshape has started is due to an issue
      in status_resync(). recovery_active is subtracted from curr_resync which
      will result in a value of zero for the first chunk of reshaped data, and
      the resulting read will report no reshape in progress.
      
      To fix this, if "resync - recovery_active" is an overloaded value, force
      the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b368856a
    • Logan Gunthorpe's avatar
      md: Use enum for overloaded magic numbers used by mddev->curr_resync · eac58d08
      Logan Gunthorpe authored
      Comments in the code document special values used for
      mddev->curr_resync. Make this clearer by using an enum to label these
      values.
      
      The only functional change is a couple places use the wrong comparison
      operator that implied 3 is another special value. They are all
      fixed to imply that 3 or greater is an active resync.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eac58d08
    • Logan Gunthorpe's avatar
      md/raid5-cache: Annotate pslot with __rcu notation · 6f28c5c3
      Logan Gunthorpe authored
      radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
      slot returned and looked up to be marked with __rcu. Otherwise
      sparse warnings are generated:
      
        drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
      			assignment (different address spaces)
        drivers/md/raid5-cache.c:2939:23:    expected void **pslot
        drivers/md/raid5-cache.c:2939:23:    got void [noderef] __rcu **
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6f28c5c3
    • Logan Gunthorpe's avatar
      md/raid5-cache: Clear conf->log after finishing work · b13015af
      Logan Gunthorpe authored
      A NULL pointer dereferlence on conf->log is seen randomly with
      the mdadm test 21raid5cache. Kasan reporst:
      
      BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
      Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086
      
      Call Trace:
        dump_stack_lvl+0x5a/0x74
        kasan_report.cold+0x5f/0x1a9
        __asan_load8+0x69/0x90
        r5l_reclaimable_space+0xf5/0x140
        r5l_do_reclaim+0xf4/0x5e0
        r5l_reclaim_thread+0x69/0x3b0
        md_thread+0x1a2/0x2c0
        kthread+0x177/0x1b0
        ret_from_fork+0x22/0x30
      
      This is caused by conf->log being cleared in r5l_exit_log() before
      stopping the reclaim thread.
      
      To fix this, clear conf->log after the reclaim_thread is unregistered
      and after flushing disable_writeback_work.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b13015af
    • Logan Gunthorpe's avatar
      md/raid5-cache: Drop RCU usage of conf->log · 7769085c
      Logan Gunthorpe authored
      The only place that uses RCU to access conf->log is in
      r5l_log_disk_error(). This function is mostly used in the IO path
      and once with mddev_lock() held in raid5_change_consistency_policy().
      
      It is known that the IO will be suspended before the log is freed and
      r5l_log_exit() is called with the mddev_lock() held.
      
      This should mean that conf->log can not be freed while the function is
      being called, so the RCU protection is not necessary. Drop the
      rcu_read_lock() as well as the synchronize_rcu() and
      rcu_assign_pointer() usage.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7769085c
    • Logan Gunthorpe's avatar
      md/raid5-cache: Take mddev_lock in r5c_journal_mode_show() · 78ede6a0
      Logan Gunthorpe authored
      The mddev->lock spinlock doesn't protect against the removal of
      conf->log in r5l_exit_log() so conf->log may be freed before it
      is used.
      
      To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78ede6a0
    • Logan Gunthorpe's avatar
      md/raid5: suspend the array for calls to log_exit() · c629f345
      Logan Gunthorpe authored
      The raid5-cache code relies on there being no IO in flight when
      log_exit() is called. There are two places where this is not
      guaranteed so add mddev_suspend() and mddev_resume() calls to these
      sites.
      
      The site in raid5_change_consistency_policy() is in the error path,
      and another similar call site already has suspend/resume calls just
      below it; so it should be equally safe to make that change here.
      
      There is one remaining site in raid5_remove_disk() that we call log_exit()
      without suspending the array. Unfortunately, as the comment stated, we
      cannot call mddev_suspend from raid5d.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c629f345
    • Logan Gunthorpe's avatar
      md/raid5-ppl: Drop unused argument from ppl_handle_flush_request() · e0fccdaf
      Logan Gunthorpe authored
      ppl_handle_flush_request() takes an struct r5log argument but doesn't
      use it. It has no buisiness taking this argument as it is only used
      by raid5-cache and has no way to derference it anyway. Remove
      the argument.
      
      No functional changes intended.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e0fccdaf
    • Logan Gunthorpe's avatar
      md/raid5-log: Drop extern decorators for function prototypes · ed0c6a5f
      Logan Gunthorpe authored
      extern is not necessary and recommended against when defining prototype
      functions in headers. checkpatch.pl complains about these. So remove
      them.
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ed0c6a5f
    • Song Liu's avatar
    • Lars Ellenberg's avatar
      drbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumes · 66757001
      Lars Ellenberg authored
      We usually do all our bitmap IO in units of PAGE_SIZE.
      
      With very small or oddly sized external meta data, or with
      PAGE_SIZE != 4k, it can happen that our last on-disk bitmap page
      is not fully PAGE_SIZE aligned, so we may need to adjust the size
      of the IO.
      
      We used to do that with
        min_t(unsigned int, PAGE_SIZE,
      	last_allowed_sector - current_offset);
      And for just the right diff, (unsigned int)(diff) will result in 0.
      
      A bio of length 0 will correctly be rejected with an IO error
      (and some scary WARN_ON_ONCE()) by the scsi layer.
      
      Do the calculation properly.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarChristoph Böhmwalder <christoph.boehmwalder@linbit.com>
      Link: https://lore.kernel.org/r/20220622204932.196830-1-christoph.boehmwalder@linbit.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      66757001
    • Linus Torvalds's avatar
      Merge tag 'for-6.0/dm-changes' of... · 8374cfe6
      Linus Torvalds authored
      Merge tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - Refactor DM core's mempool allocation so that it clearer by not being
         split acorss files.
      
       - Improve DM core's BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling.
      
       - Optimize DM core's more common bio splitting by eliminating the use
         of bio cloning with bio_split+bio_chain. Shift that cloning cost to
         the relatively unlikely dm_io requeue case that only occurs during
         error handling. Introduces dm_io_rewind() that will clone a bio that
         reflects the subset of the original bio that must be requeued.
      
       - Remove DM core's dm_table_get_num_targets() wrapper and audit all
         dm_table_get_target() callers.
      
       - Fix potential for OOM with DM writecache target by setting a default
         MAX_WRITEBACK_JOBS (set to 256MiB or 1/16 of total system memory,
         whichever is smaller).
      
       - Fix DM writecache target's stats that are reported through
         DM-specific table info.
      
       - Fix use-after-free crash in dm_sm_register_threshold_callback().
      
       - Refine DM core's Persistent Reservation handling in preparation for
         broader work Mike Christie is doing to add compatibility with
         Microsoft Windows Failover Cluster.
      
       - Fix various KASAN reported bugs in the DM raid target.
      
       - Fix DM raid target crash due to md_handle_request() bio splitting
         that recurses to block core without properly initializing the bio's
         bi_dev.
      
       - Fix some code comment typos and fix some Documentation formatting.
      
      * tag 'for-6.0/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
        dm: fix dm-raid crash if md_handle_request() splits bio
        dm raid: fix address sanitizer warning in raid_resume
        dm raid: fix address sanitizer warning in raid_status
        dm: Start pr_preempt from the same starting path
        dm: Fix PR release handling for non All Registrants
        dm: Start pr_reserve from the same starting path
        dm: Allow dm_call_pr to be used for path searches
        dm: return early from dm_pr_call() if DM device is suspended
        dm thin: fix use-after-free crash in dm_sm_register_threshold_callback
        dm writecache: count number of blocks discarded, not number of discard bios
        dm writecache: count number of blocks written, not number of write bios
        dm writecache: count number of blocks read, not number of read bios
        dm writecache: return void from functions
        dm kcopyd: use __GFP_HIGHMEM when allocating pages
        dm writecache: set a default MAX_WRITEBACK_JOBS
        Documentation: dm writecache: Render status list as list
        Documentation: dm writecache: add blank line before optional parameters
        dm snapshot: fix typo in snapshot_map() comment
        dm raid: remove redundant "the" in parse_raid_params() comment
        dm cache: fix typo in 2 comment blocks
        ...
      8374cfe6
    • Linus Torvalds's avatar
      Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block · c013d0af
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
      
       - Improve the type checking of request flags (Bart)
      
       - Ensure queue mapping for a single queues always picks the right queue
         (Bart)
      
       - Sanitize the io priority handling (Jan)
      
       - rq-qos race fix (Jinke)
      
       - Reserved tags handling improvements (John)
      
       - Separate memory alignment from file/disk offset aligment for O_DIRECT
         (Keith)
      
       - Add new ublk driver, userspace block driver using io_uring for
         communication with the userspace backend (Ming)
      
       - Use try_cmpxchg() to cleanup the code in various spots (Uros)
      
       - Finally remove bdevname() (Christoph)
      
       - Clean up the zoned device handling (Christoph)
      
       - Clean up independent access range support (Christoph)
      
       - Clean up and improve block sysfs handling (Christoph)
      
       - Clean up and improve teardown of block devices.
      
         This turns the usual two step process into something that is simpler
         to implement and handle in block drivers (Christoph)
      
       - Clean up chunk size handling (Christoph)
      
       - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
         Ming, Sebastian, Yang, Ying)
      
      * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
        ublk_drv: fix double shift bug
        ublk_drv: make sure that correct flags(features) returned to userspace
        ublk_drv: fix error handling of ublk_add_dev
        ublk_drv: fix lockdep warning
        block: remove __blk_get_queue
        block: call blk_mq_exit_queue from disk_release for never added disks
        blk-mq: fix error handling in __blk_mq_alloc_disk
        ublk: defer disk allocation
        ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
        ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
        ublk: cleanup ublk_ctrl_uring_cmd
        ublk: simplify ublk_ch_open and ublk_ch_release
        ublk: remove the empty open and release block device operations
        ublk: remove UBLK_IO_F_PREFLUSH
        ublk: add a MAINTAINERS entry
        block: don't allow the same type rq_qos add more than once
        mmc: fix disk/queue leak in case of adding disk failure
        ublk_drv: fix an IS_ERR() vs NULL check
        ublk: remove UBLK_IO_F_INTEGRITY
        ublk_drv: remove unneeded semicolon
        ...
      c013d0af
    • Linus Torvalds's avatar
      Merge tag 'for-5.20/io_uring-zerocopy-send-2022-07-29' of git://git.kernel.dk/linux-block · 42df1cbf
      Linus Torvalds authored
      Pull io_uring zerocopy support from Jens Axboe:
       "This adds support for efficient support for zerocopy sends through
        io_uring. Both ipv4 and ipv6 is supported, as well as both TCP and
        UDP.
      
        The core network changes to support this is in a stable branch from
        Jakub that both io_uring and net-next has pulled in, and the io_uring
        changes are layered on top of that.
      
        All of the work has been done by Pavel"
      
      * tag 'for-5.20/io_uring-zerocopy-send-2022-07-29' of git://git.kernel.dk/linux-block: (34 commits)
        io_uring: notification completion optimisation
        io_uring: export req alloc from core
        io_uring/net: use unsigned for flags
        io_uring/net: make page accounting more consistent
        io_uring/net: checks errors of zc mem accounting
        io_uring/net: improve io_get_notif_slot types
        selftests/io_uring: test zerocopy send
        io_uring: enable managed frags with register buffers
        io_uring: add zc notification flush requests
        io_uring: rename IORING_OP_FILES_UPDATE
        io_uring: flush notifiers after sendzc
        io_uring: sendzc with fixed buffers
        io_uring: allow to pass addr into sendzc
        io_uring: account locked pages for non-fixed zc
        io_uring: wire send zc request type
        io_uring: add notification slot registration
        io_uring: add rsrc referencing for notifiers
        io_uring: complete notifiers in tw
        io_uring: cache struct io_notif
        io_uring: add zc notification infrastructure
        ...
      42df1cbf
    • Linus Torvalds's avatar
      Merge tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block · 98e24746
      Linus Torvalds authored
      Pull io_uring buffered writes support from Jens Axboe:
       "This contains support for buffered writes, specifically for XFS. btrfs
        is in progress, will be coming in the next release.
      
        io_uring does support buffered writes on any file type, but since the
        buffered write path just always -EAGAIN (or -EOPNOTSUPP) any attempt
        to do so if IOCB_NOWAIT is set, any buffered write will effectively be
        handled by io-wq offload. This isn't very efficient, and we even have
        specific code in io-wq to serialize buffered writes to the same inode
        to avoid further inefficiencies with thread offload.
      
        This is particularly sad since most buffered writes don't block, they
        simply copy data to a page and dirty it. With this pull request, we
        can handle buffered writes a lot more effiently.
      
        If balance_dirty_pages() needs to block, we back off on writes as
        indicated.
      
        This improves buffered write support by 2-3x.
      
        Jan Kara helped with the mm bits for this, and Stefan handled the
        fs/iomap/xfs/io_uring parts of it"
      
      * tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block:
        mm: honor FGP_NOWAIT for page cache page allocation
        xfs: Add async buffered write support
        xfs: Specify lockmode when calling xfs_ilock_for_iomap()
        io_uring: Add tracepoint for short writes
        io_uring: fix issue with io_write() not always undoing sb_start_write()
        io_uring: Add support for async buffered writes
        fs: Add async write file modification handling.
        fs: Split off inode_needs_update_time and __file_update_time
        fs: add __remove_file_privs() with flags parameter
        fs: add a FMODE_BUF_WASYNC flags for f_mode
        iomap: Return -EAGAIN from iomap_write_iter()
        iomap: Add async buffered write support
        iomap: Add flags parameter to iomap_page_create()
        mm: Add balance_dirty_pages_ratelimited_flags() function
        mm: Move updates of dirty_exceeded into one place
        mm: Move starting of background writeback into the main balancing loop
      98e24746
    • Linus Torvalds's avatar
      Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block · b349b118
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
      
       - As per (valid) complaint in the last merge window, fs/io_uring.c has
         grown quite large these days. io_uring isn't really tied to fs
         either, as it supports a wide variety of functionality outside of
         that.
      
         Move the code to io_uring/ and split it into files that either
         implement a specific request type, and split some code into helpers
         as well. The code is organized a lot better like this, and io_uring.c
         is now < 4K LOC (me).
      
       - Deprecate the epoll_ctl opcode. It'll still work, just trigger a
         warning once if used. If we don't get any complaints on this, and I
         don't expect any, then we can fully remove it in a future release
         (me).
      
       - Improve the cancel hash locking (Hao)
      
       - kbuf cleanups (Hao)
      
       - Efficiency improvements to the task_work handling (Dylan, Pavel)
      
       - Provided buffer improvements (Dylan)
      
       - Add support for recv/recvmsg multishot support. This is similar to
         the accept (or poll) support for have for multishot, where a single
         SQE can trigger everytime data is received. For applications that
         expect to do more than a few receives on an instantiated socket, this
         greatly improves efficiency (Dylan).
      
       - Efficiency improvements for poll handling (Pavel)
      
       - Poll cancelation improvements (Pavel)
      
       - Allow specifiying a range for direct descriptor allocations (Pavel)
      
       - Cleanup the cqe32 handling (Pavel)
      
       - Move io_uring types to greatly cleanup the tracing (Pavel)
      
       - Tons of great code cleanups and improvements (Pavel)
      
       - Add a way to do sync cancelations rather than through the sqe -> cqe
         interface, as that's a lot easier to use for some use cases (me).
      
       - Add support to IORING_OP_MSG_RING for sending direct descriptors to a
         different ring. This avoids the usually problematic SCM case, as we
         disallow those. (me)
      
       - Make the per-command alloc cache we use for apoll generic, place
         limits on it, and use it for netmsg as well (me).
      
       - Various cleanups (me, Michal, Gustavo, Uros)
      
      * tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits)
        io_uring: ensure REQ_F_ISREG is set async offload
        net: fix compat pointer in get_compat_msghdr()
        io_uring: Don't require reinitable percpu_ref
        io_uring: fix types in io_recvmsg_multishot_overflow
        io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
        io_uring: support multishot in recvmsg
        net: copy from user before calling __get_compat_msghdr
        net: copy from user before calling __copy_msghdr
        io_uring: support 0 length iov in buffer select in compat
        io_uring: fix multishot ending when not polled
        io_uring: add netmsg cache
        io_uring: impose max limit on apoll cache
        io_uring: add abstraction around apoll cache
        io_uring: move apoll cache to poll.c
        io_uring: consolidate hash_locked io-wq handling
        io_uring: clear REQ_F_HASH_LOCKED on hash removal
        io_uring: don't race double poll setting REQ_F_ASYNC_DATA
        io_uring: don't miss setting REQ_F_DOUBLE_POLL
        io_uring: disable multishot recvmsg
        io_uring: only trace one of complete or overflow
        ...
      b349b118
    • Linus Torvalds's avatar
      Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux · efb28830
      Linus Torvalds authored
      Pull turbostat updates from Len Brown:
       "Only updating the turbostat tool here, no kernel changes"
      
      * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
        tools/power turbostat: version 2022.07.28
        tools/power turbostat: do not decode ACC for ICX and SPR
        tools/power turbostat: fix SPR PC6 limits
        tools/power turbostat: cleanup 'automatic_cstate_conversion_probe()'
        tools/power turbostat: separate SPR from ICX
        tools/power turbosstat: fix comment
        tools/power turbostat: Support RAPTORLAKE P
        tools/power turbostat: add support for ALDERLAKE_N
        tools/power turbostat: dump secondary Turbo-Ratio-Limit
        tools/power turbostat: simplify dump_turbo_ratio_limits()
        tools/power turbostat: dump CPUID.7.EDX.Hybrid
        tools/power turbostat: update turbostat.8
        tools/power turbostat: Show uncore frequency
        tools/power turbostat: Fix file pointer leak
        tools/power turbostat: replace strncmp with single character compare
        tools/power turbostat: print the kernel boot commandline
        tools/power turbostat: Introduce support for RaptorLake
      efb28830