1. 07 Apr, 2021 3 commits
    • Chandan Babu R's avatar
      xfs: scrub: Remove incorrect check executed on block format directories · e773f880
      Chandan Babu R authored
      A directory with one directory block which in turns consists of two or more fs
      blocks is incorrectly flagged as corrupt by scrub since it assumes that
      "Block" format directories have a data fork single extent spanning the file
      offset range of [0, Dir block size - 1].
      
      This commit fixes the bug by removing the incorrect check.
      Signed-off-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e773f880
    • Chandan Babu R's avatar
      xfs: Initialize xfs_alloc_arg->total correctly when allocating minlen extents · 6e8bd39d
      Chandan Babu R authored
      xfs/538 can cause the following call trace to be printed when executing on a
      multi-block directory configuration,
      
       WARNING: CPU: 1 PID: 2578 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
       Call Trace:
        ? xfs_buf_rele+0x4f/0x450
        xfs_bmap_add_extent_hole_real+0x747/0x960
        xfs_bmapi_allocate+0x39a/0x440
        xfs_bmapi_write+0x507/0x9e0
        xfs_da_grow_inode_int+0x1cd/0x330
        ? up+0x12/0x60
        xfs_dir2_grow_inode+0x62/0x110
        ? xfs_trans_log_inode+0x234/0x2d0
        xfs_dir2_sf_to_block+0x103/0x940
        ? xfs_dir2_sf_check+0x8c/0x210
        ? xfs_da_compname+0x19/0x30
        ? xfs_dir2_sf_lookup+0xd0/0x3d0
        xfs_dir2_sf_addname+0x10d/0x910
        xfs_dir_createname+0x1ad/0x210
        xfs_create+0x404/0x620
        xfs_generic_create+0x24c/0x320
        path_openat+0xda6/0x1030
        do_filp_open+0x88/0x130
        ? kmem_cache_alloc+0x50/0x210
        ? __cond_resched+0x16/0x40
        ? kmem_cache_alloc+0x50/0x210
        do_sys_openat2+0x97/0x150
        __x64_sys_creat+0x49/0x70
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This occurs because xfs_bmap_exact_minlen_extent_alloc() initializes
      xfs_alloc_arg->total to xfs_bmalloca->minlen. In the context of
      xfs_bmap_exact_minlen_extent_alloc(), xfs_bmalloca->minlen has a value of 1
      and hence the space allocator could choose an AG which has less than
      xfs_bmalloca->total number of free blocks available. As the transaction
      proceeds, one of the future space allocation requests could fail due to
      non-availability of free blocks in the AG that was originally chosen.
      
      This commit fixes the bug by assigning xfs_alloc_arg->total to the value of
      xfs_bmalloca->total.
      
      Fixes: 30151967 ("xfs: Introduce error injection to allocate only minlen size extents for files")
      Signed-off-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6e8bd39d
    • Chandan Babu R's avatar
      xfs: Fix dax inode extent calculation when direct write is performed on an unwritten extent · 5147ef30
      Chandan Babu R authored
      With dax enabled filesystems, a direct write operation into an existing
      unwritten extent results in xfs_iomap_write_direct() zero-ing and converting
      the extent into a normal extent before the actual data is copied from the
      userspace buffer.
      
      The inode extent count can increase by 2 if the extent range being written to
      maps to the middle of the existing unwritten extent range. Hence this commit
      uses XFS_IEXT_WRITE_UNWRITTEN_CNT as the extent count delta when such a write
      operation is being performed.
      
      Fixes: 727e1acd ("xfs: Check for extent overflow when trivally adding a new extent")
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5147ef30
  2. 25 Mar, 2021 29 commits
    • Anthony Iliopoulos's avatar
      xfs: fix xfs_trans slab cache name · 25dfa65f
      Anthony Iliopoulos authored
      Removal of kmem_zone_init wrappers accidentally changed a slab cache
      name from "xfs_trans" to "xf_trans". Fix this so that userspace
      consumers of /proc/slabinfo and /sys/kernel/slab can find it again.
      
      Fixes: b1231760 ("xfs: Remove slab init wrappers")
      Signed-off-by: default avatarAnthony Iliopoulos <ailiop@suse.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      25dfa65f
    • Gao Xiang's avatar
      xfs: add error injection for per-AG resv failure · 2b92faed
      Gao Xiang authored
      per-AG resv failure after fixing up freespace is hard to test in an
      effective way, so directly add an error injection path to observe
      such error handling path works as expected.
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2b92faed
    • Gao Xiang's avatar
      xfs: support shrinking unused space in the last AG · fb2fc172
      Gao Xiang authored
      As the first step of shrinking, this attempts to enable shrinking
      unused space in the last allocation group by fixing up freespace
      btree, agi, agf and adjusting super block and use a helper
      xfs_ag_shrink_space() to fixup the last AG.
      
      This can be all done in one transaction for now, so I think no
      additional protection is needed.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      fb2fc172
    • Gao Xiang's avatar
      xfs: introduce xfs_ag_shrink_space() · 46141dc8
      Gao Xiang authored
      This patch introduces a helper to shrink unused space in the last AG
      by fixing up the freespace btree.
      
      Also make sure that the per-AG reservation works under the new AG
      size. If such per-AG reservation or extent allocation fails, roll
      the transaction so the new transaction could cancel without any side
      effects.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      46141dc8
    • Gao Xiang's avatar
      xfs: hoist out xfs_resizefs_init_new_ags() · c789c83c
      Gao Xiang authored
      Move out related logic for initializing new added AGs to a new helper
      in preparation for shrinking. No logic changes.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c789c83c
    • Gao Xiang's avatar
      xfs: update lazy sb counters immediately for resizefs · 014695c0
      Gao Xiang authored
      sb_fdblocks will be updated lazily if lazysbcount is enabled,
      therefore when shrinking the filesystem sb_fdblocks could be
      larger than sb_dblocks and xfs_validate_sb_write() would fail.
      
      Even for growfs case, it'd be better to update lazy sb counters
      immediately to reflect the real sb counters.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      014695c0
    • Bhaskar Chowdhury's avatar
      xfs: Fix a typo · f9dd7ba4
      Bhaskar Chowdhury authored
      s/strutures/structures/
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Reviewed-by: default avatarPavel Reichl <preichl@redhat.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f9dd7ba4
    • Bhaskar Chowdhury's avatar
      xfs: Rudimentary spelling fix · 0145225e
      Bhaskar Chowdhury authored
      s/sytemcall/syscall/
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0145225e
    • Bhaskar Chowdhury's avatar
      xfs: Rudimentary typo fixes · bd24a4f5
      Bhaskar Chowdhury authored
      s/filesytem/filesystem/
      s/instrumention/instrumentation/
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      bd24a4f5
    • Dave Chinner's avatar
      xfs: __percpu_counter_compare() inode count debug too expensive · 5825bea0
      Dave Chinner authored
      - 21.92% __xfs_trans_commit
           - 21.62% xfs_log_commit_cil
      	- 11.69% xfs_trans_unreserve_and_mod_sb
      	   - 11.58% __percpu_counter_compare
      	      - 11.45% __percpu_counter_sum
      		 - 10.29% _raw_spin_lock_irqsave
      		    - 10.28% do_raw_spin_lock
      			 __pv_queued_spin_lock_slowpath
      
      We debated just getting rid of it last time this came up and
      there was no real objection to removing it. Now it's the biggest
      scalability limitation for debug kernels even on smallish machines,
      so let's just get rid of it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5825bea0
    • Dave Chinner's avatar
      xfs: reduce debug overhead of dir leaf/node checks · 1fea323f
      Dave Chinner authored
      On debug kernels, we call xfs_dir3_leaf_check_int() multiple times
      on every directory modification. The robust hash ordering checks it
      does on every entry in the leaf on every call results in a massive
      CPU overhead which slows down debug kernels by a large amount.
      
      We use xfs_dir3_leaf_check_int() for the verifiers as well, so we
      can't just gut the function to reduce overhead. What we can do,
      however, is reduce the work it does when it is called from the
      debug interfaces, just leaving the high level checks in place and
      leaving the robust validation to the verifiers. This means the debug
      checks will catch gross errors, but subtle bugs might not be caught
      until a verifier is run.
      
      It is easy enough to restore the existing debug behaviour if the
      developer needs it (just change a call parameter in the debug code),
      but overwise the overhead makes testing large directory block sizes
      on debug kernels very slow.
      
      Profile at an unlink rate of ~80k file/s on a 64k block size
      filesystem before the patch:
      
        40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
        10.98%  [kernel]  [k] __xfs_dir3_data_check
         8.10%  [kernel]  [k] xfs_verify_dir_ino
         4.42%  [kernel]  [k] memcpy
         2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
         1.52%  [kernel]  [k] do_raw_spin_lock
      
      Profile after, at an unlink rate of ~125k files/s (+50% improvement)
      has largely dropped the leaf verification debug overhead out of the
      profile.
      
        16.53%  [kernel]  [k] __xfs_dir3_data_check
        12.53%  [kernel]  [k] xfs_verify_dir_ino
         7.97%  [kernel]  [k] memcpy
         3.36%  [kernel]  [k] xfs_dir2_data_get_ftype
         2.86%  [kernel]  [k] __pv_queued_spin_lock_slowpath
      
      Create shows a similar change in profile and a +25% improvement in
      performance.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1fea323f
    • Dave Chinner's avatar
      xfs: No need for inode number error injection in __xfs_dir3_data_check · 39d3c0b5
      Dave Chinner authored
      We call xfs_dir_ino_validate() for every dir entry in a directory
      when doing validity checking of the directory. It calls
      xfs_verify_dir_ino() then emits a corruption report if bad or does
      error injection if good. It is extremely costly:
      
        43.27%  [kernel]  [k] xfs_dir3_leaf_check_int
        10.28%  [kernel]  [k] __xfs_dir3_data_check
         6.61%  [kernel]  [k] xfs_verify_dir_ino
         4.16%  [kernel]  [k] xfs_errortag_test
         4.00%  [kernel]  [k] memcpy
         3.48%  [kernel]  [k] xfs_dir_ino_validate
      
      7% of the cpu usage in this directory traversal workload is
      xfs_dir_ino_validate() doing absolutely nothing.
      
      We don't need error injection to simulate a bad inode numbers in the
      directory structure because we can do that by fuzzing the structure
      on disk.
      
      And we don't need a corruption report, because the
      __xfs_dir3_data_check() will emit one if the inode number is bad.
      
      So just call xfs_verify_dir_ino() directly here, and get rid of all
      this unnecessary overhead:
      
        40.30%  [kernel]  [k] xfs_dir3_leaf_check_int
        10.98%  [kernel]  [k] __xfs_dir3_data_check
         8.10%  [kernel]  [k] xfs_verify_dir_ino
         4.42%  [kernel]  [k] memcpy
         2.22%  [kernel]  [k] xfs_dir2_data_get_ftype
         1.52%  [kernel]  [k] do_raw_spin_lock
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      39d3c0b5
    • Dave Chinner's avatar
      xfs: type verification is expensive · ec08c14b
      Dave Chinner authored
      From a concurrent rm -rf workload:
      
        41.04%  [kernel]  [k] xfs_dir3_leaf_check_int
         9.85%  [kernel]  [k] __xfs_dir3_data_check
         5.60%  [kernel]  [k] xfs_verify_ino
         5.32%  [kernel]  [k] xfs_agino_range
         4.21%  [kernel]  [k] memcpy
         3.06%  [kernel]  [k] xfs_errortag_test
         2.57%  [kernel]  [k] xfs_dir_ino_validate
         1.66%  [kernel]  [k] xfs_dir2_data_get_ftype
         1.17%  [kernel]  [k] do_raw_spin_lock
         1.11%  [kernel]  [k] xfs_verify_dir_ino
         0.84%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
         0.83%  [kernel]  [k] xfs_buf_find
         0.64%  [kernel]  [k] xfs_log_commit_cil
      
      THere's an awful lot of overhead in just range checking inode
      numbers in that, but each inode number check is not a lot of code.
      The total is a bit over 14.5% of the CPU time is spent validating
      inode numbers.
      
      The problem is that they deeply nested global scope functions so the
      overhead here is all in function call marshalling.
      
         text	   data	    bss	    dec	    hex	filename
         2077	      0	      0	   2077	    81d fs/xfs/libxfs/xfs_types.o.orig
         2197	      0	      0	   2197	    895	fs/xfs/libxfs/xfs_types.o
      
      There's a small increase in binary size by inlining all the local
      nested calls in the verifier functions, but the same workload now
      profiles as:
      
        40.69%  [kernel]  [k] xfs_dir3_leaf_check_int
        10.52%  [kernel]  [k] __xfs_dir3_data_check
         6.68%  [kernel]  [k] xfs_verify_dir_ino
         4.22%  [kernel]  [k] xfs_errortag_test
         4.15%  [kernel]  [k] memcpy
         3.53%  [kernel]  [k] xfs_dir_ino_validate
         1.87%  [kernel]  [k] xfs_dir2_data_get_ftype
         1.37%  [kernel]  [k] do_raw_spin_lock
         0.98%  [kernel]  [k] xfs_buf_find
         0.94%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
         0.73%  [kernel]  [k] xfs_log_commit_cil
      
      Now we only spend just over 10% of the time validing inode numbers
      for the same workload. Hence a few "inline" keyworks is good enough
      to reduce the validation overhead by 30%...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      ec08c14b
    • Dave Chinner's avatar
      xfs: optimise xfs_buf_item_size/format for contiguous regions · 929f8b0d
      Dave Chinner authored
      We process the buf_log_item bitmap one set bit at a time with
      xfs_next_bit() so we can detect if a region crosses a memcpy
      discontinuity in the buffer data address. This has massive overhead
      on large buffers (e.g. 64k directory blocks) because we do a lot of
      unnecessary checks and xfs_buf_offset() calls.
      
      For example, 16-way concurrent create workload on debug kernel
      running CPU bound has this at the top of the profile at ~120k
      create/s on 64kb directory block size:
      
        20.66%  [kernel]  [k] xfs_dir3_leaf_check_int
         7.10%  [kernel]  [k] memcpy
         6.22%  [kernel]  [k] xfs_next_bit
         3.55%  [kernel]  [k] xfs_buf_offset
         3.53%  [kernel]  [k] xfs_buf_item_format
         3.34%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         3.04%  [kernel]  [k] do_raw_spin_lock
         2.84%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
         2.31%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
         1.36%  [kernel]  [k] xfs_log_commit_cil
      
      (debug checks hurt large blocks)
      
      The only buffers with discontinuities in the data address are
      unmapped buffers, and they are only used for inode cluster buffers
      and only for logging unlinked pointers. IOWs, it is -rare- that we
      even need to detect a discontinuity in the buffer item formatting
      code.
      
      Optimise all this by using xfs_contig_bits() to find the size of
      the contiguous regions, then test for a discontiunity inside it. If
      we find one, do the slow "bit at a time" method we do now. If we
      don't, then just copy the entire contiguous range in one go.
      
      Profile now looks like:
      
        25.26%  [kernel]  [k] xfs_dir3_leaf_check_int
         9.25%  [kernel]  [k] memcpy
         5.01%  [kernel]  [k] __pv_queued_spin_lock_slowpath
         2.84%  [kernel]  [k] do_raw_spin_lock
         2.22%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
         1.88%  [kernel]  [k] xfs_buf_find
         1.53%  [kernel]  [k] memmove
         1.47%  [kernel]  [k] xfs_log_commit_cil
      ....
         0.34%  [kernel]  [k] xfs_buf_item_format
      ....
         0.21%  [kernel]  [k] xfs_buf_offset
      ....
         0.16%  [kernel]  [k] xfs_contig_bits
      ....
         0.13%  [kernel]  [k] xfs_buf_item_size_segment.isra.0
      
      So the bit scanning over for the dirty region tracking for the
      buffer log items is basically gone. Debug overhead hurts even more
      now...
      
      Perf comparison
      
      		dir block	 creates		unlink
      		size (kb)	time	rate		time
      
      Original	 4		4m08s	220k		 5m13s
      Original	64		7m21s	115k		13m25s
      Patched		 4		3m59s	230k		 5m03s
      Patched		64		6m23s	143k		12m33s
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      929f8b0d
    • Dave Chinner's avatar
      xfs: xfs_buf_item_size_segment() needs to pass segment offset · c81ea11e
      Dave Chinner authored
      Otherwise it doesn't correctly calculate the number of vectors
      in a logged buffer that has a contiguous map that gets split into
      multiple regions because the range spans discontigous memory.
      
      Probably never been hit in practice - we don't log contiguous ranges
      on unmapped buffers (inode clusters).
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c81ea11e
    • Dave Chinner's avatar
      xfs: reduce buffer log item shadow allocations · accc661b
      Dave Chinner authored
      When we modify btrees repeatedly, we regularly increase the size of
      the logged region by a single chunk at a time (per transaction
      commit). This results in the CIL formatting code having to
      reallocate the log vector buffer every time the buffer dirty region
      grows. Hence over a typical 4kB btree buffer, we might grow the log
      vector 4096/128 = 32x over a short period where we repeatedly add
      or remove records to/from the buffer over a series of running
      transaction. This means we are doing 32 memory allocations and frees
      over this time during a performance critical path in the journal.
      
      The amount of space tracked in the CIL for the object is calculated
      during the ->iop_format() call for the buffer log item, but the
      buffer memory allocated for it is calculated by the ->iop_size()
      call. The size callout determines the size of the buffer, the format
      call determines the space used in the buffer.
      
      Hence we can oversize the buffer space required in the size
      calculation without impacting the amount of space used and accounted
      to the CIL for the changes being logged. This allows us to reduce
      the number of allocations by rounding up the buffer size to allow
      for future growth. This can safe a substantial amount of CPU time in
      this path:
      
      -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
         - 44.49% xfs_log_commit_cil
            - 30.78% _raw_spin_lock
               - 30.75% do_raw_spin_lock
                    30.27% __pv_queued_spin_lock_slowpath
      
      (oh, ouch!)
      ....
            - 1.05% kmem_alloc_large
               - 1.02% kmem_alloc
                    0.94% __kmalloc
      
      This overhead here us what this patch is aimed at. After:
      
            - 0.76% kmem_alloc_large
               - 0.75% kmem_alloc
                    0.70% __kmalloc
      
      The size of 512 bytes is based on the bitmap chunk size being 128
      bytes and that random directory entry updates almost never require
      more than 3-4 128 byte regions to be logged in the directory block.
      
      The other observation is for per-ag btrees. When we are inserting
      into a new btree block, we'll pack it from the front. Hence the
      first few records land in the first 128 bytes so we log only 128
      bytes, the next 8-16 records land in the second region so now we log
      256 bytes. And so on.  If we are doing random updates, it will only
      allocate every 4 random 128 byte regions that are dirtied instead of
      every single one.
      
      Any larger than 512 bytes and I noticed an increase in memory
      footprint in my scalability workloads. Any less than this and I
      didn't really see any significant benefit to CPU usage.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@redhat.com>
      accc661b
    • Dave Chinner's avatar
      xfs: initialise attr fork on inode create · e6a688c3
      Dave Chinner authored
      When we allocate a new inode, we often need to add an attribute to
      the inode as part of the create. This can happen as a result of
      needing to add default ACLs or security labels before the inode is
      made visible to userspace.
      
      This is highly inefficient right now. We do the create transaction
      to allocate the inode, then we do an "add attr fork" transaction to
      modify the just created empty inode to set the inode fork offset to
      allow attributes to be stored, then we go and do the attribute
      creation.
      
      This means 3 transactions instead of 1 to allocate an inode, and
      this greatly increases the load on the CIL commit code, resulting in
      excessive contention on the CIL spin locks and performance
      degradation:
      
       18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
        3.57%  [kernel]                [k] do_raw_spin_lock
        2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
        2.48%  [kernel]                [k] memcpy
        2.34%  [kernel]                [k] xfs_log_commit_cil
      
      The typical profile resulting from running fsmark on a selinux enabled
      filesytem is adds this overhead to the create path:
      
        - 15.30% xfs_init_security
           - 15.23% security_inode_init_security
      	- 13.05% xfs_initxattrs
      	   - 12.94% xfs_attr_set
      	      - 6.75% xfs_bmap_add_attrfork
      		 - 5.51% xfs_trans_commit
      		    - 5.48% __xfs_trans_commit
      		       - 5.35% xfs_log_commit_cil
      			  - 3.86% _raw_spin_lock
      			     - do_raw_spin_lock
      				  __pv_queued_spin_lock_slowpath
      		 - 0.70% xfs_trans_alloc
      		      0.52% xfs_trans_reserve
      	      - 5.41% xfs_attr_set_args
      		 - 5.39% xfs_attr_set_shortform.constprop.0
      		    - 4.46% xfs_trans_commit
      		       - 4.46% __xfs_trans_commit
      			  - 4.33% xfs_log_commit_cil
      			     - 2.74% _raw_spin_lock
      				- do_raw_spin_lock
      				     __pv_queued_spin_lock_slowpath
      			       0.60% xfs_inode_item_format
      		      0.90% xfs_attr_try_sf_addname
      	- 1.99% selinux_inode_init_security
      	   - 1.02% security_sid_to_context_force
      	      - 1.00% security_sid_to_context_core
      		 - 0.92% sidtab_entry_to_string
      		    - 0.90% sidtab_sid2str_get
      			 0.59% sidtab_sid2str_put.part.0
      	   - 0.82% selinux_determine_inode_label
      	      - 0.77% security_transition_sid
      		   0.70% security_compute_sid.part.0
      
      And fsmark creation rate performance drops by ~25%. The key point to
      note here is that half the additional overhead comes from adding the
      attribute fork to the newly created inode. That's crazy, considering
      we can do this same thing at inode create time with a couple of
      lines of code and no extra overhead.
      
      So, if we know we are going to add an attribute immediately after
      creating the inode, let's just initialise the attribute fork inside
      the create transaction and chop that whole chunk of code out of
      the create fast path. This completely removes the performance
      drop caused by enabling SELinux, and the profile looks like:
      
           - 8.99% xfs_init_security
               - 9.00% security_inode_init_security
                  - 6.43% xfs_initxattrs
                     - 6.37% xfs_attr_set
                        - 5.45% xfs_attr_set_args
                           - 5.42% xfs_attr_set_shortform.constprop.0
                              - 4.51% xfs_trans_commit
                                 - 4.54% __xfs_trans_commit
                                    - 4.59% xfs_log_commit_cil
                                       - 2.67% _raw_spin_lock
                                          - 3.28% do_raw_spin_lock
                                               3.08% __pv_queued_spin_lock_slowpath
                                         0.66% xfs_inode_item_format
                              - 0.90% xfs_attr_try_sf_addname
                        - 0.60% xfs_trans_alloc
                  - 2.35% selinux_inode_init_security
                     - 1.25% security_sid_to_context_force
                        - 1.21% security_sid_to_context_core
                           - 1.19% sidtab_entry_to_string
                              - 1.20% sidtab_sid2str_get
                                 - 0.86% sidtab_sid2str_put.part.0
                                    - 0.62% _raw_spin_lock_irqsave
                                       - 0.77% do_raw_spin_lock
                                            __pv_queued_spin_lock_slowpath
                     - 0.84% selinux_determine_inode_label
                        - 0.83% security_transition_sid
                             0.86% security_compute_sid.part.0
      
      Which indicates the XFS overhead of creating the selinux xattr has
      been halved. This doesn't fix the CIL lock contention problem, just
      means it's not a limiting factor for this workload. Lock contention
      in the security subsystems is going to be an issue soon, though...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [djwong: fix compilation error when CONFIG_SECURITY=n]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@redhat.com>
      e6a688c3
    • Gao Xiang's avatar
      xfs: ensure xfs_errortag_random_default matches XFS_ERRTAG_MAX · b2c2974b
      Gao Xiang authored
      Add the BUILD_BUG_ON to xfs_errortag_add() in order to make sure that
      the length of xfs_errortag_random_default matches XFS_ERRTAG_MAX when
      building.
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b2c2974b
    • Pavel Reichl's avatar
      xfs: Skip repetitive warnings about mount options · 92cf7d36
      Pavel Reichl authored
      Skip the warnings about mount option being deprecated if we are
      remounting and deprecated option state is not changing.
      
      Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605Fix-suggested-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarPavel Reichl <preichl@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      92cf7d36
    • Pavel Reichl's avatar
      xfs: rename variable mp to parsing_mp · 0f98b4ec
      Pavel Reichl authored
      Rename mp variable to parsisng_mp so it is easy to distinguish
      between current mount point handle and handle for mount point
      which mount options are being parsed.
      Suggested-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarPavel Reichl <preichl@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0f98b4ec
    • Darrick J. Wong's avatar
      xfs: rename the blockgc workqueue · 3fef46fc
      Darrick J. Wong authored
      Since we're about to start using the blockgc workqueue to dispose of
      inactivated inodes, strip the "block" prefix from the name; now it's
      merely the general garbage collection (gc) workqueue.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3fef46fc
    • Darrick J. Wong's avatar
      xfs: prevent metadata files from being inactivated · 383e32b0
      Darrick J. Wong authored
      Files containing metadata (quota records, rt bitmap and summary info)
      are fully managed by the filesystem, which means that all resource
      cleanup must be explicit, not automatic.  This means that they should
      never be subjected automatic to post-eof truncation, nor should they be
      freed automatically even if the link count drops to zero.
      
      In other words, xfs_inactive() should leave these files alone.  Add the
      necessary predicate functions to make this happen.  This adds a second
      layer of prevention for the kinds of fs corruption that was fixed by
      commit f4c32e87.  If we ever decide to support removing metadata
      files, we should make all those metadata updates explicit.
      
      Rearrange the order of #includes to fix compiler errors, since
      xfs_mount.h is supposed to be included before xfs_inode.h
      
      Followup-to: f4c32e87 ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      383e32b0
    • Darrick J. Wong's avatar
      xfs: validate ag btree levels using the precomputed values · 973975b7
      Darrick J. Wong authored
      Use the AG btree height limits that we precomputed into the xfs_mount to
      validate the AG headers instead of using XFS_BTREE_MAXLEVELS.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      973975b7
    • Darrick J. Wong's avatar
      xfs: remove return value from xchk_ag_btcur_init · f53acfac
      Darrick J. Wong authored
      Functions called by this function cannot fail, so get rid of the return
      and error checking.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f53acfac
    • Darrick J. Wong's avatar
      xfs: set the scrub AG number in xchk_ag_read_headers · de9d2a78
      Darrick J. Wong authored
      Since xchk_ag_read_headers initializes fields in struct xchk_ag, we
      might as well set the AG number and save the callers the trouble.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      de9d2a78
    • Darrick J. Wong's avatar
      xfs: mark a data structure sick if there are cross-referencing errors · 9de4b514
      Darrick J. Wong authored
      If scrub observes cross-referencing errors while scanning a data
      structure, mark the data structure sick.  There's /something/
      inconsistent, even if we can't really tell what it is.
      
      Fixes: 4860a05d ("xfs: scrub/repair should update filesystem metadata health")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      9de4b514
    • Darrick J. Wong's avatar
      xfs: bail out of scrub immediately if scan incomplete · 7716ee54
      Darrick J. Wong authored
      If a scrubber cannot complete its check and signals an incomplete check,
      we must bail out immediately without updating health status, trying a
      repair, etc. because our scan is incomplete and we therefore do not know
      much more.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7716ee54
    • Darrick J. Wong's avatar
      xfs: fix dquot scrub loop cancellation · 05237032
      Darrick J. Wong authored
      When xchk_quota_item figures out that it needs to terminate the scrub
      operation, it needs to return some error code to abort the loop, but
      instead it returns zero and the loop keeps running.  Fix this by making
      it use ECANCELED, and fix the other loop bailout condition check at the
      bottom too.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      05237032
    • Darrick J. Wong's avatar
      xfs: fix uninitialized variables in xrep_calc_ag_resblks · 1aa26707
      Darrick J. Wong authored
      If we can't read the AGF header, we never actually set a value for
      freelen and usedlen.  These two variables are used to make the worst
      case estimate of btree size, so it's safe to set them to the AG size as
      a fallback.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      1aa26707
  3. 24 Mar, 2021 1 commit
  4. 21 Mar, 2021 7 commits
    • Linus Torvalds's avatar
      Linux 5.12-rc4 · 0d02ec6b
      Linus Torvalds authored
      0d02ec6b
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d7f5f1bd
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Miscellaneous ext4 bug fixes for v5.12"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: initialize ret to suppress smatch warning
        ext4: stop inode update before return
        ext4: fix rename whiteout with fast commit
        ext4: fix timer use-after-free on failed mount
        ext4: fix potential error in ext4_do_update_inode
        ext4: do not try to set xattr into ea_inode if value is empty
        ext4: do not iput inode under running transaction in ext4_rename()
        ext4: find old entry again if failed to rename whiteout
        ext4: fix error handling in ext4_end_enable_verity()
        ext4: fix bh ref count on error paths
        fs/ext4: fix integer overflow in s_log_groups_per_flex
        ext4: add reclaim checks to xattr code
        ext4: shrink race window in ext4_should_retry_alloc()
      d7f5f1bd
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block · 2c41fab1
      Linus Torvalds authored
      Pull io_uring followup fixes from Jens Axboe:
      
       - The SIGSTOP change from Eric, so we properly ignore that for
         PF_IO_WORKER threads.
      
       - Disallow sending signals to PF_IO_WORKER threads in general, we're
         not interested in having them funnel back to the io_uring owning
         task.
      
       - Stable fix from Stefan, ensuring we properly break links for short
         send/sendmsg recv/recvmsg if MSG_WAITALL is set.
      
       - Catch and loop when needing to run task_work before a PF_IO_WORKER
         threads goes to sleep.
      
      * tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block:
        io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL
        io-wq: ensure task is running before processing task_work
        signal: don't allow STOP on PF_IO_WORKER threads
        signal: don't allow sending any signals to PF_IO_WORKER threads
      2c41fab1
    • Linus Torvalds's avatar
      Merge tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 1d4345eb
      Linus Torvalds authored
      Pull staging and IIO driver fixes from Greg KH:
       "Some small staging and IIO driver fixes:
      
         - MAINTAINERS changes for the move of the staging mailing list
      
         - comedi driver fixes to get request_irq() to work correctly
      
         - counter driver fixes for reported issues with iio devices
      
         - tiny iio driver fixes for reported issues.
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vt665x: fix alignment constraints
        staging: comedi: cb_pcidas64: fix request_irq() warn
        staging: comedi: cb_pcidas: fix request_irq() warn
        MAINTAINERS: move the staging subsystem to lists.linux.dev
        MAINTAINERS: move some real subsystems off of the staging mailing list
        iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler
        iio: hid-sensor-temperature: Fix issues of timestamp channel
        iio: hid-sensor-humidity: Fix alignment issue of timestamp channel
        counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register
        counter: stm32-timer-cnt: fix ceiling write max value
        counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED
        iio: adc: ab8500-gpadc: Fix off by 10 to 3
        iio:adc:stm32-adc: Add HAS_IOMEM dependency
        iio: adis16400: Fix an error code in adis16400_initial_setup()
        iio: adc: adi-axi-adc: add proper Kconfig dependencies
        iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask
        iio: hid-sensor-prox: Fix scale not correct issue
        iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
      1d4345eb
    • Linus Torvalds's avatar
      Merge tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 3001c355
      Linus Torvalds authored
      Pull USB and Thunderbolt driver fixes from Greg KH:
       "Here are some small Thunderbolt and USB driver fixes for some reported
        issues:
      
         - thunderbolt fixes for minor problems
      
         - typec fixes for power issues
      
         - usb-storage quirk addition
      
         - usbip bugfix
      
         - dwc3 bugfix when stopping transfers
      
         - cdnsp bugfix for isoc transfers
      
         - gadget use-after-free fix
      
        All have been in linux-next this week with no reported issues"
      
      * tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: tcpm: Skip sink_cap query only when VDM sm is busy
        usb: dwc3: gadget: Prevent EP queuing while stopping transfers
        usb: typec: tcpm: Invoke power_supply_changed for tcpm-source-psy-
        usb: typec: Remove vdo[3] part of tps6598x_rx_identity_reg struct
        usb-storage: Add quirk to defeat Kindle's automatic unload
        usb: gadget: configfs: Fix KASAN use-after-free
        usbip: Fix incorrect double assignment to udc->ud.tcp_rx
        usb: cdnsp: Fixes incorrect value in ISOC TRB
        thunderbolt: Increase runtime PM reference count on DP tunnel discovery
        thunderbolt: Initialize HopID IDAs in tb_switch_alloc()
      3001c355
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ee96fa9
      Linus Torvalds authored
      Pull irq fix from Ingo Molnar:
       "A change to robustify force-threaded IRQ handlers to always disable
        interrupts, plus a DocBook fix.
      
        The force-threaded IRQ handler change has been accelerated from the
        normal schedule of such a change to keep the bad pattern/workaround of
        spin_lock_irqsave() in handlers or IRQF_NOTHREAD as a kludge from
        spreading"
      
      * tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Disable interrupts for force threaded handlers
        genirq/irq_sim: Fix typos in kernel doc (fnode -> fwnode)
      5ee96fa9
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1c74516c
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Boundary condition fixes for bugs unearthed by the perf fuzzer"
      
      * tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT
        perf/x86/intel: Fix a crash caused by zero PEBS status
      1c74516c