1. 25 Mar, 2021 14 commits
    • Dave Chinner's avatar
      xfs: reduce buffer log item shadow allocations · accc661b
      Dave Chinner authored
      When we modify btrees repeatedly, we regularly increase the size of
      the logged region by a single chunk at a time (per transaction
      commit). This results in the CIL formatting code having to
      reallocate the log vector buffer every time the buffer dirty region
      grows. Hence over a typical 4kB btree buffer, we might grow the log
      vector 4096/128 = 32x over a short period where we repeatedly add
      or remove records to/from the buffer over a series of running
      transaction. This means we are doing 32 memory allocations and frees
      over this time during a performance critical path in the journal.
      
      The amount of space tracked in the CIL for the object is calculated
      during the ->iop_format() call for the buffer log item, but the
      buffer memory allocated for it is calculated by the ->iop_size()
      call. The size callout determines the size of the buffer, the format
      call determines the space used in the buffer.
      
      Hence we can oversize the buffer space required in the size
      calculation without impacting the amount of space used and accounted
      to the CIL for the changes being logged. This allows us to reduce
      the number of allocations by rounding up the buffer size to allow
      for future growth. This can safe a substantial amount of CPU time in
      this path:
      
      -   46.52%     2.02%  [kernel]                  [k] xfs_log_commit_cil
         - 44.49% xfs_log_commit_cil
            - 30.78% _raw_spin_lock
               - 30.75% do_raw_spin_lock
                    30.27% __pv_queued_spin_lock_slowpath
      
      (oh, ouch!)
      ....
            - 1.05% kmem_alloc_large
               - 1.02% kmem_alloc
                    0.94% __kmalloc
      
      This overhead here us what this patch is aimed at. After:
      
            - 0.76% kmem_alloc_large
               - 0.75% kmem_alloc
                    0.70% __kmalloc
      
      The size of 512 bytes is based on the bitmap chunk size being 128
      bytes and that random directory entry updates almost never require
      more than 3-4 128 byte regions to be logged in the directory block.
      
      The other observation is for per-ag btrees. When we are inserting
      into a new btree block, we'll pack it from the front. Hence the
      first few records land in the first 128 bytes so we log only 128
      bytes, the next 8-16 records land in the second region so now we log
      256 bytes. And so on.  If we are doing random updates, it will only
      allocate every 4 random 128 byte regions that are dirtied instead of
      every single one.
      
      Any larger than 512 bytes and I noticed an increase in memory
      footprint in my scalability workloads. Any less than this and I
      didn't really see any significant benefit to CPU usage.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@redhat.com>
      accc661b
    • Dave Chinner's avatar
      xfs: initialise attr fork on inode create · e6a688c3
      Dave Chinner authored
      When we allocate a new inode, we often need to add an attribute to
      the inode as part of the create. This can happen as a result of
      needing to add default ACLs or security labels before the inode is
      made visible to userspace.
      
      This is highly inefficient right now. We do the create transaction
      to allocate the inode, then we do an "add attr fork" transaction to
      modify the just created empty inode to set the inode fork offset to
      allow attributes to be stored, then we go and do the attribute
      creation.
      
      This means 3 transactions instead of 1 to allocate an inode, and
      this greatly increases the load on the CIL commit code, resulting in
      excessive contention on the CIL spin locks and performance
      degradation:
      
       18.99%  [kernel]                [k] __pv_queued_spin_lock_slowpath
        3.57%  [kernel]                [k] do_raw_spin_lock
        2.51%  [kernel]                [k] __raw_callee_save___pv_queued_spin_unlock
        2.48%  [kernel]                [k] memcpy
        2.34%  [kernel]                [k] xfs_log_commit_cil
      
      The typical profile resulting from running fsmark on a selinux enabled
      filesytem is adds this overhead to the create path:
      
        - 15.30% xfs_init_security
           - 15.23% security_inode_init_security
      	- 13.05% xfs_initxattrs
      	   - 12.94% xfs_attr_set
      	      - 6.75% xfs_bmap_add_attrfork
      		 - 5.51% xfs_trans_commit
      		    - 5.48% __xfs_trans_commit
      		       - 5.35% xfs_log_commit_cil
      			  - 3.86% _raw_spin_lock
      			     - do_raw_spin_lock
      				  __pv_queued_spin_lock_slowpath
      		 - 0.70% xfs_trans_alloc
      		      0.52% xfs_trans_reserve
      	      - 5.41% xfs_attr_set_args
      		 - 5.39% xfs_attr_set_shortform.constprop.0
      		    - 4.46% xfs_trans_commit
      		       - 4.46% __xfs_trans_commit
      			  - 4.33% xfs_log_commit_cil
      			     - 2.74% _raw_spin_lock
      				- do_raw_spin_lock
      				     __pv_queued_spin_lock_slowpath
      			       0.60% xfs_inode_item_format
      		      0.90% xfs_attr_try_sf_addname
      	- 1.99% selinux_inode_init_security
      	   - 1.02% security_sid_to_context_force
      	      - 1.00% security_sid_to_context_core
      		 - 0.92% sidtab_entry_to_string
      		    - 0.90% sidtab_sid2str_get
      			 0.59% sidtab_sid2str_put.part.0
      	   - 0.82% selinux_determine_inode_label
      	      - 0.77% security_transition_sid
      		   0.70% security_compute_sid.part.0
      
      And fsmark creation rate performance drops by ~25%. The key point to
      note here is that half the additional overhead comes from adding the
      attribute fork to the newly created inode. That's crazy, considering
      we can do this same thing at inode create time with a couple of
      lines of code and no extra overhead.
      
      So, if we know we are going to add an attribute immediately after
      creating the inode, let's just initialise the attribute fork inside
      the create transaction and chop that whole chunk of code out of
      the create fast path. This completely removes the performance
      drop caused by enabling SELinux, and the profile looks like:
      
           - 8.99% xfs_init_security
               - 9.00% security_inode_init_security
                  - 6.43% xfs_initxattrs
                     - 6.37% xfs_attr_set
                        - 5.45% xfs_attr_set_args
                           - 5.42% xfs_attr_set_shortform.constprop.0
                              - 4.51% xfs_trans_commit
                                 - 4.54% __xfs_trans_commit
                                    - 4.59% xfs_log_commit_cil
                                       - 2.67% _raw_spin_lock
                                          - 3.28% do_raw_spin_lock
                                               3.08% __pv_queued_spin_lock_slowpath
                                         0.66% xfs_inode_item_format
                              - 0.90% xfs_attr_try_sf_addname
                        - 0.60% xfs_trans_alloc
                  - 2.35% selinux_inode_init_security
                     - 1.25% security_sid_to_context_force
                        - 1.21% security_sid_to_context_core
                           - 1.19% sidtab_entry_to_string
                              - 1.20% sidtab_sid2str_get
                                 - 0.86% sidtab_sid2str_put.part.0
                                    - 0.62% _raw_spin_lock_irqsave
                                       - 0.77% do_raw_spin_lock
                                            __pv_queued_spin_lock_slowpath
                     - 0.84% selinux_determine_inode_label
                        - 0.83% security_transition_sid
                             0.86% security_compute_sid.part.0
      
      Which indicates the XFS overhead of creating the selinux xattr has
      been halved. This doesn't fix the CIL lock contention problem, just
      means it's not a limiting factor for this workload. Lock contention
      in the security subsystems is going to be an issue soon, though...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [djwong: fix compilation error when CONFIG_SECURITY=n]
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@redhat.com>
      e6a688c3
    • Gao Xiang's avatar
      xfs: ensure xfs_errortag_random_default matches XFS_ERRTAG_MAX · b2c2974b
      Gao Xiang authored
      Add the BUILD_BUG_ON to xfs_errortag_add() in order to make sure that
      the length of xfs_errortag_random_default matches XFS_ERRTAG_MAX when
      building.
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b2c2974b
    • Pavel Reichl's avatar
      xfs: Skip repetitive warnings about mount options · 92cf7d36
      Pavel Reichl authored
      Skip the warnings about mount option being deprecated if we are
      remounting and deprecated option state is not changing.
      
      Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605Fix-suggested-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarPavel Reichl <preichl@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      92cf7d36
    • Pavel Reichl's avatar
      xfs: rename variable mp to parsing_mp · 0f98b4ec
      Pavel Reichl authored
      Rename mp variable to parsisng_mp so it is easy to distinguish
      between current mount point handle and handle for mount point
      which mount options are being parsed.
      Suggested-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarPavel Reichl <preichl@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0f98b4ec
    • Darrick J. Wong's avatar
      xfs: rename the blockgc workqueue · 3fef46fc
      Darrick J. Wong authored
      Since we're about to start using the blockgc workqueue to dispose of
      inactivated inodes, strip the "block" prefix from the name; now it's
      merely the general garbage collection (gc) workqueue.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3fef46fc
    • Darrick J. Wong's avatar
      xfs: prevent metadata files from being inactivated · 383e32b0
      Darrick J. Wong authored
      Files containing metadata (quota records, rt bitmap and summary info)
      are fully managed by the filesystem, which means that all resource
      cleanup must be explicit, not automatic.  This means that they should
      never be subjected automatic to post-eof truncation, nor should they be
      freed automatically even if the link count drops to zero.
      
      In other words, xfs_inactive() should leave these files alone.  Add the
      necessary predicate functions to make this happen.  This adds a second
      layer of prevention for the kinds of fs corruption that was fixed by
      commit f4c32e87.  If we ever decide to support removing metadata
      files, we should make all those metadata updates explicit.
      
      Rearrange the order of #includes to fix compiler errors, since
      xfs_mount.h is supposed to be included before xfs_inode.h
      
      Followup-to: f4c32e87 ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      383e32b0
    • Darrick J. Wong's avatar
      xfs: validate ag btree levels using the precomputed values · 973975b7
      Darrick J. Wong authored
      Use the AG btree height limits that we precomputed into the xfs_mount to
      validate the AG headers instead of using XFS_BTREE_MAXLEVELS.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      973975b7
    • Darrick J. Wong's avatar
      xfs: remove return value from xchk_ag_btcur_init · f53acfac
      Darrick J. Wong authored
      Functions called by this function cannot fail, so get rid of the return
      and error checking.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f53acfac
    • Darrick J. Wong's avatar
      xfs: set the scrub AG number in xchk_ag_read_headers · de9d2a78
      Darrick J. Wong authored
      Since xchk_ag_read_headers initializes fields in struct xchk_ag, we
      might as well set the AG number and save the callers the trouble.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      de9d2a78
    • Darrick J. Wong's avatar
      xfs: mark a data structure sick if there are cross-referencing errors · 9de4b514
      Darrick J. Wong authored
      If scrub observes cross-referencing errors while scanning a data
      structure, mark the data structure sick.  There's /something/
      inconsistent, even if we can't really tell what it is.
      
      Fixes: 4860a05d ("xfs: scrub/repair should update filesystem metadata health")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      9de4b514
    • Darrick J. Wong's avatar
      xfs: bail out of scrub immediately if scan incomplete · 7716ee54
      Darrick J. Wong authored
      If a scrubber cannot complete its check and signals an incomplete check,
      we must bail out immediately without updating health status, trying a
      repair, etc. because our scan is incomplete and we therefore do not know
      much more.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7716ee54
    • Darrick J. Wong's avatar
      xfs: fix dquot scrub loop cancellation · 05237032
      Darrick J. Wong authored
      When xchk_quota_item figures out that it needs to terminate the scrub
      operation, it needs to return some error code to abort the loop, but
      instead it returns zero and the loop keeps running.  Fix this by making
      it use ECANCELED, and fix the other loop bailout condition check at the
      bottom too.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      05237032
    • Darrick J. Wong's avatar
      xfs: fix uninitialized variables in xrep_calc_ag_resblks · 1aa26707
      Darrick J. Wong authored
      If we can't read the AGF header, we never actually set a value for
      freelen and usedlen.  These two variables are used to make the worst
      case estimate of btree size, so it's safe to set them to the AG size as
      a fallback.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      1aa26707
  2. 24 Mar, 2021 1 commit
  3. 21 Mar, 2021 23 commits
  4. 20 Mar, 2021 2 commits
    • Thomas Gleixner's avatar
      genirq: Disable interrupts for force threaded handlers · 81e2073c
      Thomas Gleixner authored
      With interrupt force threading all device interrupt handlers are invoked
      from kernel threads. Contrary to hard interrupt context the invocation only
      disables bottom halfs, but not interrupts. This was an oversight back then
      because any code like this will have an issue:
      
      thread(irq_A)
        irq_handler(A)
          spin_lock(&foo->lock);
      
      interrupt(irq_B)
        irq_handler(B)
          spin_lock(&foo->lock);
      
      This has been triggered with networking (NAPI vs. hrtimers) and console
      drivers where printk() happens from an interrupt which interrupted the
      force threaded handler.
      
      Now people noticed and started to change the spin_lock() in the handler to
      spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the
      interrupt request which in turn breaks RT.
      
      Fix the root cause and not the symptom and disable interrupts before
      invoking the force threaded handler which preserves the regular semantics
      and the usefulness of the interrupt force threading as a general debugging
      tool.
      
      For not RT this is not changing much, except that during the execution of
      the threaded handler interrupts are delayed until the handler
      returns. Vs. scheduling and softirq processing there is no difference.
      
      For RT kernels there is no issue.
      
      Fixes: 8d32a307 ("genirq: Provide forced interrupt threading")
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJohan Hovold <johan@kernel.org>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de
      81e2073c
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 812da4d3
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
       "A handful of fixes for 5.12:
      
         - fix the SBI remote fence numbers for hypervisor fences, which had
           been transcribed in the wrong order in Linux. These fences are only
           used with the KVM patches applied.
      
         - fix a whole host of build warnings, these should have no functional
           change.
      
         - fix init_resources() to prevent an off-by-one error from causing an
           out-of-bounds array reference. This was manifesting during boot on
           vexriscv.
      
         - ensure the KASAN mappings are visible before proceeding to use
           them"
      
      * tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Correct SPARSEMEM configuration
        RISC-V: kasan: Declare kasan_shallow_populate() static
        riscv: Ensure page table writes are flushed when initializing KASAN vmalloc
        RISC-V: Fix out-of-bounds accesses in init_resources()
        riscv: Fix compilation error with Canaan SoC
        ftrace: Fix spelling mistake "disabed" -> "disabled"
        riscv: fix bugon.cocci warnings
        riscv: process: Fix no prototype for arch_dup_task_struct
        riscv: ftrace: Use ftrace_get_regs helper
        riscv: process: Fix no prototype for show_regs
        riscv: syscall_table: Reduce W=1 compilation warnings noise
        riscv: time: Fix no prototype for time_init
        riscv: ptrace: Fix no prototype warnings
        riscv: sbi: Fix comment of __sbi_set_timer_v01
        riscv: irq: Fix no prototype warning
        riscv: traps: Fix no prototype warnings
        RISC-V: correct enum sbi_ext_rfence_fid
      812da4d3