1. 10 Oct, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-ringbuffer-v6.12-rc2' of... · 0edab8d1
      Linus Torvalds authored
      Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull tracing fix from Steven Rostedt:
       "Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
        callbacks
      
        When a ring buffer is mapped to memory assigned at boot, it also
        splits it up evenly between the possible CPUs. But the allocation code
        still attached a CPU notifier callback to this ring buffer. When a CPU
        is added, the callback will happen and another per-cpu buffer is
        created for the ring buffer.
      
        But for boot mapped buffers, there is no room to add another one (as
        they were all created already). The result of calling the CPU hotplug
        notifier on a boot mapped ring buffer is unpredictable and could lead
        to a system crash.
      
        If the ring buffer is boot mapped simply do not attach the CPU
        notifier to it"
      
      * tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
      0edab8d1
    • Linus Torvalds's avatar
      Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · eb952c47
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - update fstrim loop and add more cancellation points, fix reported
         delayed or blocked suspend if there's a huge chunk queued
      
       - fix error handling in recent qgroup xarray conversion
      
       - in zoned mode, fix warning printing device path without RCU
         protection
      
       - again fix invalid extent xarray state (6252690f), lost due to
         refactoring
      
      * tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
        btrfs: zoned: fix missing RCU locking in error message when loading zone info
        btrfs: fix missing error handling when adding delayed ref with qgroups enabled
        btrfs: add cancellation points to trim loops
        btrfs: split remaining space to discard in chunks
      eb952c47
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · 5870963f
      Linus Torvalds authored
      Pull nfsd fixes from Chuck Lever:
      
       - Fix NFSD bring-up / shutdown
      
       - Fix a UAF when releasing a stateid
      
      * tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        nfsd: fix possible badness in FREE_STATEID
        nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
        NFSD: Mark filecache "down" if init fails
      5870963f
    • Linus Torvalds's avatar
      Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 825ec756
      Linus Torvalds authored
      Pull xfs fixes from Carlos Maiolino:
      
       - A few small typo fixes
      
       - fstests xfs/538 DEBUG-only fix
      
       - Performance fix on blockgc on COW'ed files, by skipping trims on
         cowblock inodes currently opened for write
      
       - Prevent cowblocks to be freed under dirty pagecache during unshare
      
       - Update MAINTAINERS file to quote the new maintainer
      
      * tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: fix a typo
        xfs: don't free cowblocks from under dirty pagecache on unshare
        xfs: skip background cowblock trims on inodes open for write
        xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
        xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
        xfs: don't ifdef around the exact minlen allocations
        xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
        xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
        xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
        xfs: return bool from xfs_attr3_leaf_add
        xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
        xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
        xfs: scrub: convert comma to semicolon
        xfs: Remove empty declartion in header file
        MAINTAINERS: add Carlos Maiolino as XFS release manager
      825ec756
  2. 09 Oct, 2024 21 commits
  3. 08 Oct, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'sched_ext-for-6.12-rc2-fixes' of... · 75b607fa
      Linus Torvalds authored
      Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
      
      Pull sched_ext fixes from Tejun Heo:
      
       - ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
         and thus ops.select() were skipped. Some schedulers were incorrectly
         using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
         it.
      
       - Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()
      
       - Fix error information clobbering during load
      
       - Add missing __weak markers to BPF helper declarations
      
       - Doc update
      
      * tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
        sched_ext: Documentation: Update instructions for running example schedulers
        sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
        sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
        sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
        sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
        sched_ext: Improve error reporting during loading
        sched_ext: Add __weak markers to BPF helper function decalarations
      75b607fa
    • Devaansh-Kumar's avatar
      sched_ext: Documentation: Update instructions for running example schedulers · e0ed5215
      Devaansh-Kumar authored
      Since the artifact paths for tools changed, we need to update the documentation to reflect that path.
      Signed-off-by: default avatarDevaansh-Kumar <devaanshk840@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e0ed5215
    • Linus Torvalds's avatar
      Merge tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3 · 5b7c893e
      Linus Torvalds authored
      Pull ntfs3 updates from Konstantin Komarov:
      "New:
         - implement fallocate for compressed files
         - add support for the compression attribute
         - optimize large writes to sparse files
      
       Fixes:
         - fix several potential deadlock scenarios
         - fix various internal bugs detected by syzbot
         - add checks before accessing NTFS structures during parsing
         - correct the format of output messages
      
        Refactoring:
         - replace fsparam_flag_no with fsparam_flag in options parser
         - remove unused functions and macros"
      
      * tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits)
        fs/ntfs3: Format output messages like others fs in kernel
        fs/ntfs3: Additional check in ntfs_file_release
        fs/ntfs3: Fix general protection fault in run_is_mapped_full
        fs/ntfs3: Sequential field availability check in mi_enum_attr()
        fs/ntfs3: Additional check in ni_clear()
        fs/ntfs3: Fix possible deadlock in mi_read
        ntfs3: Change to non-blocking allocation in ntfs_d_hash
        fs/ntfs3: Remove unused al_delete_le
        fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr
        fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag
        fs/ntfs3: Add support for the compression attribute
        fs/ntfs3: Implement fallocate for compressed files
        fs/ntfs3: Make checks in run_unpack more clear
        fs/ntfs3: Add rough attr alloc_size check
        fs/ntfs3: Stale inode instead of bad
        fs/ntfs3: Refactor enum_rstbl to suppress static checker
        fs/ntfs3: Fix sparse warning in ni_fiemap
        fs/ntfs3: Fix warning possible deadlock in ntfs_set_state
        fs/ntfs3: Fix sparse warning for bigendian
        fs/ntfs3: Separete common code for file_read/write iter/splice
        ...
      5b7c893e
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v6.12-1-2024-10-08' of... · b2760b83
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v6.12-1-2024-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix an assert() to handle captured and unprocessed ARM CoreSight CPU
         traces
      
       - Fix static build compilation error when libdw isn't installed or is
         too old
      
       - Add missing include when building with
         !HAVE_DWARF_GETLOCATIONS_SUPPORT
      
       - Add missing refcount put on 32-bit DSOs
      
       - Fix disassembly of user space binaries by setting the binary_type of
         DSO when loading
      
       - Update headers with the kernel sources, including asound.h, sched.h,
         fcntl, msr-index.h, irq_vectors.h, socket.h, list_sort.c and arm64's
         cputype.h
      
      * tag 'perf-tools-fixes-for-v6.12-1-2024-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
        perf cs-etm: Fix the assert() to handle captured and unprocessed cpu trace
        perf build: Fix build feature-dwarf_getlocations fail for old libdw
        perf build: Fix static compilation error when libdw is not installed
        perf dwarf-aux: Fix build with !HAVE_DWARF_GETLOCATIONS_SUPPORT
        tools headers arm64: Sync arm64's cputype.h with the kernel sources
        perf tools: Cope with differences for lib/list_sort.c copy from the kernel
        tools check_headers.sh: Add check variant that excludes some hunks
        perf beauty: Update copy of linux/socket.h with the kernel sources
        tools headers UAPI: Sync the linux/in.h with the kernel sources
        perf trace beauty: Update the arch/x86/include/asm/irq_vectors.h copy with the kernel sources
        tools arch x86: Sync the msr-index.h copy with the kernel sources
        tools include UAPI: Sync linux/fcntl.h copy with the kernel sources
        tools include UAPI: Sync linux/sched.h copy with the kernel sources
        tools include UAPI: Sync sound/asound.h copy with the kernel sources
        perf vdso: Missed put on 32-bit dsos
        perf symbol: Set binary_type of dso when loading
      b2760b83
  4. 07 Oct, 2024 11 commits
    • Filipe Manana's avatar
      btrfs: fix missing error handling when adding delayed ref with qgroups enabled · 6ef8fbce
      Filipe Manana authored
      When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
      if we fail to insert the qgroup record we don't error out, we ignore it.
      In fact we treat it as if there was no error and there was already an
      existing record - we don't distinguish between the cases where
      btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
      existed and we can free the given record, and the case where it returns
      a negative error value, meaning the insertion into the xarray that is
      used to track records failed.
      
      Effectively we end up ignoring that we are lacking qgroup record in the
      dirty extents xarray, resulting in incorrect qgroup accounting.
      
      Fix this by checking for errors and return them to the callers.
      
      Fixes: 3cce39a8 ("btrfs: qgroup: use xarray to track dirty extents in transaction")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ef8fbce
    • Luca Stefani's avatar
      btrfs: add cancellation points to trim loops · 69313850
      Luca Stefani authored
      There are reports that system cannot suspend due to running trim because
      the task responsible for trimming the device isn't able to finish in
      time, especially since we have a free extent discarding phase, which can
      trim a lot of unallocated space. There are no limits on the trim size
      (unlike the block group part).
      
      Since trime isn't a critical call it can be interrupted at any time,
      in such cases we stop the trim, report the amount of discarded bytes and
      return an error.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarLuca Stefani <luca.stefani.ge1@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69313850
    • Luca Stefani's avatar
      btrfs: split remaining space to discard in chunks · a99fcb01
      Luca Stefani authored
      Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
      mostly empty although we will do the split according to our super block
      locations, the last super block ends at 256G, we can submit a huge
      discard for the range [256G, 8T), causing a large delay.
      
      Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
      preparation of introduction of cancellation points to trim. The value
      of the chunk size is arbitrary, it can be higher or derived from actual
      device capabilities but we can't easily read that using
      bio_discard_limit().
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarLuca Stefani <luca.stefani.ge1@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a99fcb01
    • Tejun Heo's avatar
      sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED · 9b671793
      Tejun Heo authored
      scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
      tell whether ops.select_cpu() was called. This is incorrect as
      ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
      incorrectly skipping direct dispatch for tasks that are bound to a single
      CPU.
      
      sched core has been updated to specify ENQUEUE_RQ_SELECTED if
      ->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
      scx_qmap to test it instead of SCX_ENQ_WAKEUP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      9b671793
    • Tejun Heo's avatar
      sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called · f207dc2d
      Tejun Heo authored
      During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
      migration is disabled. sched_ext schedulers may perform operations such as
      direct dispatch from ->select_task_rq() path and it is useful for them to
      know whether ->select_task_rq() was skipped in the ->enqueue_task() path.
      
      Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
      and end up assuming incorrectly that ->select_task_rq() was called for tasks
      that are bound to a single CPU or migration disabled.
      
      Make select_task_rq() indicate whether ->select_task_rq() was called by
      setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
      to ENQUEUE_RQ_SELECTED for ->enqueue_task().
      
      This will be used by sched_ext to fix ->select_task_rq() skip detection.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      f207dc2d
    • Tejun Heo's avatar
      sched/core: Make select_task_rq() take the pointer to wake_flags instead of value · b62933ee
      Tejun Heo authored
      This will be used to allow select_task_rq() to indicate whether
      ->select_task_rq() was called by modifying *wake_flags.
      
      This makes try_to_wake_up() call all functions that take wake_flags with
      WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
      more consistent, and, as the flag is only tested by ->select_task_rq()
      implementations, it doesn't cause any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      b62933ee
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · 87d6aab2
      Linus Torvalds authored
      Pull virtio fixes from Michael Tsirkin:
       "Several small bugfixes all over the place.
      
        Most notably, fixes the vsock allocation with GFP_KERNEL in atomic
        context, which has been triggering warnings for lots of testers"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()
        vsock/virtio: use GFP_ATOMIC under RCU read lock
        virtio_console: fix misc probe bugs
        virtio_ring: tag event_triggered as racy for KCSAN
        vdpa/octeon_ep: Fix format specifier for pointers in debug messages
      87d6aab2
    • Haoran Zhang's avatar
      vhost/scsi: null-ptr-dereference in vhost_scsi_get_req() · 221af82f
      Haoran Zhang authored
      Since commit 3f8ca2e1 ("vhost/scsi: Extract common handling code
      from control queue handler") a null pointer dereference bug can be
      triggered when guest sends an SCSI AN request.
      
      In vhost_scsi_ctl_handle_vq(), `vc.target` is assigned with
      `&v_req.tmf.lun[1]` within a switch-case block and is then passed to
      vhost_scsi_get_req() which extracts `vc->req` and `tpg`. However, for
      a `VIRTIO_SCSI_T_AN_*` request, tpg is not required, so `vc.target` is
      set to NULL in this branch. Later, in vhost_scsi_get_req(),
      `vc->target` is dereferenced without being checked, leading to a null
      pointer dereference bug. This bug can be triggered from guest.
      
      When this bug occurs, the vhost_worker process is killed while holding
      `vq->mutex` and the corresponding tpg will remain occupied
      indefinitely.
      
      Below is the KASAN report:
      Oops: general protection fault, probably for non-canonical address
      0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
      CPU: 1 PID: 840 Comm: poc Not tainted 6.10.0+ #1
      Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS
      1.16.3-debian-1.16.3-2 04/01/2014
      RIP: 0010:vhost_scsi_get_req+0x165/0x3a0
      Code: 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 2b 02 00 00
      48 b8 00 00 00 00 00 fc ff df 4d 8b 65 30 4c 89 e2 48 c1 ea 03 <0f> b6
      04 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 be 01 00 00
      RSP: 0018:ffff888017affb50 EFLAGS: 00010246
      RAX: dffffc0000000000 RBX: ffff88801b000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888017affcb8
      RBP: ffff888017affb80 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff888017affc88 R14: ffff888017affd1c R15: ffff888017993000
      FS:  000055556e076500(0000) GS:ffff88806b100000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000200027c0 CR3: 0000000010ed0004 CR4: 0000000000370ef0
      Call Trace:
       <TASK>
       ? show_regs+0x86/0xa0
       ? die_addr+0x4b/0xd0
       ? exc_general_protection+0x163/0x260
       ? asm_exc_general_protection+0x27/0x30
       ? vhost_scsi_get_req+0x165/0x3a0
       vhost_scsi_ctl_handle_vq+0x2a4/0xca0
       ? __pfx_vhost_scsi_ctl_handle_vq+0x10/0x10
       ? __switch_to+0x721/0xeb0
       ? __schedule+0xda5/0x5710
       ? __kasan_check_write+0x14/0x30
       ? _raw_spin_lock+0x82/0xf0
       vhost_scsi_ctl_handle_kick+0x52/0x90
       vhost_run_work_list+0x134/0x1b0
       vhost_task_fn+0x121/0x350
      ...
       </TASK>
      ---[ end trace 0000000000000000 ]---
      
      Let's add a check in vhost_scsi_get_req.
      
      Fixes: 3f8ca2e1 ("vhost/scsi: Extract common handling code from control queue handler")
      Signed-off-by: default avatarHaoran Zhang <wh1sper@zju.edu.cn>
      [whitespace fixes]
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Message-Id: <b26d7ddd-b098-4361-88f8-17ca7f90adf7@oracle.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      221af82f
    • Michael S. Tsirkin's avatar
      vsock/virtio: use GFP_ATOMIC under RCU read lock · a194c985
      Michael S. Tsirkin authored
      virtio_transport_send_pkt in now called on transport fast path,
      under RCU read lock. In that case, we have a bug: virtio_add_sgs
      is called with GFP_KERNEL, and might sleep.
      
      Pass the gfp flags as an argument, and use GFP_ATOMIC on
      the fast path.
      
      Link: https://lore.kernel.org/all/hfcr2aget2zojmqpr4uhlzvnep4vgskblx5b6xf2ddosbsrke7@nt34bxgp7j2x
      Fixes: efcd71af ("vsock/virtio: avoid queuing packets when intermediate queue is empty")
      Reported-by: default avatarChristian Brauner <brauner@kernel.org>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Cc: Luigi Leonardi <luigi.leonardi@outlook.com>
      Message-ID: <3fbfb6e871f625f89eb578c7228e127437b1975a.1727876449.git.mst@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta@amd.com>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarLuigi Leonardi <luigi.leonardi@outlook.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      a194c985
    • Brian Foster's avatar
      xfs: skip background cowblock trims on inodes open for write · 90a71daa
      Brian Foster authored
      The background blockgc scanner runs on a 5m interval by default and
      trims preallocation (post-eof and cow fork) from inodes that are
      otherwise idle. Idle effectively means that iolock can be acquired
      without blocking and that the inode has no dirty pagecache or I/O in
      flight.
      
      This simple mechanism and heuristic has worked fairly well for
      post-eof speculative preallocations. Support for reflink and COW
      fork preallocations came sometime later and plugged into the same
      mechanism, with similar heuristics. Some recent testing has shown
      that COW fork preallocation may be notably more sensitive to blockgc
      processing than post-eof preallocation, however.
      
      For example, consider an 8GB reflinked file with a COW extent size
      hint of 1MB. A worst case fully randomized overwrite of this file
      results in ~8k extents of an average size of ~1MB. If the same
      workload is interrupted a couple times for blockgc processing
      (assuming the file goes idle), the resulting extent count explodes
      to over 100k extents with an average size <100kB. This is
      significantly worse than ideal and essentially defeats the COW
      extent size hint mechanism.
      
      While this particular test is instrumented, it reflects a fairly
      reasonable pattern in practice where random I/Os might spread out
      over a large period of time with varying periods of (in)activity.
      For example, consider a cloned disk image file for a VM or container
      with long uptime and variable and bursty usage. A background blockgc
      scan that races and processes the image file when it happens to be
      clean and idle can have a significant effect on the future
      fragmentation level of the file, even when still in use.
      
      To help combat this, update the heuristic to skip cowblocks inodes
      that are currently opened for write access during non-sync blockgc
      scans. This allows COW fork preallocations to persist for as long as
      possible unless otherwise needed for functional purposes (i.e. a
      sync scan), the file is idle and closed, or the inode is being
      evicted from cache. While here, update the comments to help
      distinguish performance oriented heuristics from the logic that
      exists to maintain functional correctness.
      Suggested-by: default avatarDarrick Wong <djwong@kernel.org>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarCarlos Maiolino <cem@kernel.org>
      90a71daa
    • Christoph Hellwig's avatar
      xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc · 6aac7705
      Christoph Hellwig authored
      Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
      variant fails to drop into the lowmode last resort allocator, and
      thus can sometimes fail allocations for which the caller has a
      transaction block reservation.
      
      Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarCarlos Maiolino <cem@kernel.org>
      6aac7705