1. 19 Aug, 2023 2 commits
  2. 16 Aug, 2023 1 commit
  3. 15 Aug, 2023 6 commits
  4. 09 Aug, 2023 1 commit
  5. 08 Aug, 2023 1 commit
    • Linus Torvalds's avatar
      fs: use __fput_sync in close(2) · 021a160a
      Linus Torvalds authored
      close(2) is a special case which guarantees a shallow kernel stack,
      making delegation to task_work machinery unnecessary. Said delegation is
      problematic as it involves atomic ops and interrupt masking trips, none
      of which are cheap on x86-64. Forcing close(2) to do it looks like an
      oversight in the original work.
      
      Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
      -> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
      thread returning to userspace land in resume_user_mode_work(), where
      rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
      the thread (and it is by default with contemporary glibc).
      
      Sample result when benchmarking open1_processes -t 1 from will-it-scale
      (that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
      Rapid CPU (ops/s):
      stock+RSEQ:     1329857
      stock-RSEQ:     1421667 (+7%)
      patched:        1523521 (+14.5% / +7%) (with / without rseq)
      
      Patched result is the same regardless of rseq as the codepath is avoided.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      021a160a
  6. 04 Aug, 2023 1 commit
    • Mateusz Guzik's avatar
      file: mostly eliminate spurious relocking in __range_close · ed192c59
      Mateusz Guzik authored
      Stock code takes a lock trip for every fd in range, but this can be
      trivially avoided and real-world consumers do have plenty of already
      closed cases.
      
      Just booting Debian 12 with a debug printk shows:
      (sh) min 3 max 17 closed 15 empty 0
      (sh) min 19 max 63 closed 31 empty 14
      (sh) min 4 max 63 closed 0 empty 60
      (spawn) min 3 max 63 closed 13 empty 48
      (spawn) min 3 max 63 closed 13 empty 48
      (mount) min 3 max 17 closed 15 empty 0
      (mount) min 19 max 63 closed 32 empty 13
      
      and so on.
      
      While here use more idiomatic naming.
      
      An avoidable relock is left in place to avoid uglifying the code.
      The code was not switched to bitmap traversal for the same reason.
      
      Tested with ltp kernel/syscalls/close_range
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Message-Id: <20230727113809.800067-1-mjguzik@gmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      ed192c59
  7. 02 Aug, 2023 1 commit
  8. 26 Jul, 2023 1 commit
  9. 14 Jul, 2023 2 commits
    • Wang Ming's avatar
      fs: Fix error checking for d_hash_and_lookup() · 0d5a4f8f
      Wang Ming authored
      The d_hash_and_lookup() function returns error pointers or NULL.
      Most incorrect error checks were fixed, but the one in int path_pts()
      was forgotten.
      
      Fixes: eedf265a ("devpts: Make each mount of devpts an independent filesystem.")
      Signed-off-by: default avatarWang Ming <machel@vivo.com>
      Message-Id: <20230713120555.7025-1-machel@vivo.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      0d5a4f8f
    • Christian Brauner's avatar
      attr: block mode changes of symlinks · 5d1f903f
      Christian Brauner authored
      Changing the mode of symlinks is meaningless as the vfs doesn't take the
      mode of a symlink into account during path lookup permission checking.
      
      However, the vfs doesn't block mode changes on symlinks. This however,
      has lead to an untenable mess roughly classifiable into the following
      two categories:
      
      (1) Filesystems that don't implement a i_op->setattr() for symlinks.
      
          Such filesystems may or may not know that without i_op->setattr()
          defined, notify_change() falls back to simple_setattr() causing the
          inode's mode in the inode cache to be changed.
      
          That's a generic issue as this will affect all non-size changing
          inode attributes including ownership changes.
      
          Example: afs
      
      (2) Filesystems that fail with EOPNOTSUPP but change the mode of the
          symlink nonetheless.
      
          Some filesystems will happily update the mode of a symlink but still
          return EOPNOTSUPP. This is the biggest source of confusion for
          userspace.
      
          The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
          comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
      
              if (!err && attr->ia_valid & ATTR_MODE)
                      err = posix_acl_chmod(idmap, dentry, inode->i_mode);
      
          Filesystems including btrfs don't implement i_op->set_acl() so
          posix_acl_chmod() will report EOPNOTSUPP.
      
          When posix_acl_chmod() is called, most filesystems will have
          finished updating the inode.
      
          Perversely, this has the consequences that this behavior may depend
          on two kconfig options and mount options:
      
          * CONFIG_POSIX_ACL={y,n}
          * CONFIG_${FSTYPE}_POSIX_ACL={y,n}
          * Opt_acl, Opt_noacl
      
          Example: btrfs, ext4, xfs
      
      The only way to change the mode on a symlink currently involves abusing
      an O_PATH file descriptor in the following manner:
      
              fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
      
              char path[PATH_MAX];
              snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
              chmod(path, 0000);
      
      But for most major filesystems with POSIX ACL support such as btrfs,
      ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
      the mode still updated due to the aforementioned posix_acl_chmod()
      nonsense.
      
      So, given that for all major filesystems this would fail with EOPNOTSUPP
      and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
      changes on symlinks we should just try and block mode changes on
      symlinks directly in the vfs and have a clean break with this nonsense.
      
      If this causes any regressions, we do the next best thing and fix up all
      filesystems that do return EOPNOTSUPP with the mode updated to not call
      posix_acl_chmod() on symlinks.
      
      But as usual, let's try the clean cut solution first. It's a simple
      patch that can be easily reverted. Not marking this for backport as I'll
      do that manually if we're reasonably sure that this works and there are
      no strong objections.
      
      We could block this in chmod_common() but it's more appropriate to do it
      notify_change() as it will also mean that we catch filesystems that
      change symlink permissions explicitly or accidently.
      
      Similar proposals were floated in the past as in [3] and [4] and again
      recently in [5]. There's also a couple of bugs about this inconsistency
      as in [6] and [7].
      
      Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
      Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
      Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
      Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
      Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
      Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
      Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
      Reviewed-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      5d1f903f
  10. 11 Jul, 2023 1 commit
    • Wen Yang's avatar
      eventfd: prevent underflow for eventfd semaphores · 758b4920
      Wen Yang authored
      For eventfd with flag EFD_SEMAPHORE, when its ctx->count is 0, calling
      eventfd_ctx_do_read will cause ctx->count to overflow to ULLONG_MAX.
      
      An underflow can happen with EFD_SEMAPHORE eventfds in at least the
      following three subsystems:
      
      (1) virt/kvm/eventfd.c
      (2) drivers/vfio/virqfd.c
      (3) drivers/virt/acrn/irqfd.c
      
      where (2) and (3) are just modeled after (1). An eventfd must be
      specified for use with the KVM_IRQFD ioctl(). This can also be an
      EFD_SEMAPHORE eventfd. When the eventfd count is zero or has been
      decremented to zero an underflow can be triggered when the irqfd is shut
      down by raising the KVM_IRQFD_FLAG_DEASSIGN flag in the KVM_IRQFD
      ioctl():
      
              // ctx->count == 0
              kvm_vm_ioctl()
              -> kvm_irqfd()
                 -> kvm_irqfd_deassign()
                    -> irqfd_deactivate()
                       -> irqfd_shutdown()
                          -> eventfd_ctx_remove_wait_queue(&cnt)
                             -> eventfd_ctx_do_read(&cnt)
      
      Userspace polling on the eventfd wouldn't notice the underflow because 1
      is always returned as the value from eventfd_read() while ctx->count
      would've underflowed. It's not a huge deal because this should only be
      happening when the irqfd is shutdown but we should still fix it and
      avoid the spurious wakeup.
      
      Fixes: cb289d62 ("eventfd - allow atomic read and waitqueue remove")
      Signed-off-by: default avatarWen Yang <wenyang.linux@foxmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dylan Yudaken <dylany@fb.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <tencent_7588DFD1F365950A757310D764517A14B306@qq.com>
      [brauner: rewrite commit message and add explanation how this underflow can happen]
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      758b4920
  11. 10 Jul, 2023 11 commits
  12. 09 Jul, 2023 10 commits
  13. 08 Jul, 2023 2 commits
    • Hugh Dickins's avatar
      mm: lock newly mapped VMA with corrected ordering · 1c7873e3
      Hugh Dickins authored
      Lockdep is certainly right to complain about
      
        (&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
                       but task is already holding lock:
        (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
      
      Invert those to the usual ordering.
      
      Fixes: 33313a74 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c7873e3
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of... · 946c6b59
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull hotfixes from Andrew Morton:
       "16 hotfixes. Six are cc:stable and the remainder address post-6.4
        issues"
      
      The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since
      it was all hopefully fixed in mainline.
      
      * tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        lib: dhry: fix sleeping allocations inside non-preemptable section
        kasan, slub: fix HW_TAGS zeroing with slub_debug
        kasan: fix type cast in memory_is_poisoned_n
        mailmap: add entries for Heiko Stuebner
        mailmap: update manpage link
        bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
        MAINTAINERS: add linux-next info
        mailmap: add Markus Schneider-Pargmann
        writeback: account the number of pages written back
        mm: call arch_swap_restore() from do_swap_page()
        squashfs: fix cache race with migration
        mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
        docs: update ocfs2-devel mailing list address
        MAINTAINERS: update ocfs2-devel mailing list address
        mm: disable CONFIG_PER_VMA_LOCK until its fixed
        fork: lock VMAs of the parent process when forking
      946c6b59