1. 30 Jun, 2023 1 commit
  2. 26 Jun, 2023 21 commits
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 00173879
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Updates for the interrupt subsystem:
      
        Core:
      
         - Convert the interrupt descriptor storage to a maple tree to
           overcome the limitations of the radixtree + fixed size bitmap.
      
           This allows us to handle very large servers with a huge number of
           guests without imposing a huge memory overhead on everyone
      
         - Implement optional retriggering of interrupts which utilize the
           fasteoi handler to work around a GICv3 architecture issue
      
        Drivers:
      
         - A set of fixes and updates for the Loongson/Loongarch related
           drivers
      
         - Workaound for an ASR8601 integration hickup which ends up with CPU
           numbering which can't be represented in the GIC implementation
      
         - The usual set of boring fixes and updates all over the place"
      
      * tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
        Revert "irqchip/mxs: Include linux/irqchip/mxs.h"
        irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
        irqchip/stm32-exti: Fix warning on initialized field overwritten
        irqchip/stm32-exti: Add STM32MP15xx IWDG2 EXTI to GIC map
        irqchip/gicv3: Add a iort_pmsi_get_dev_id() prototype
        irqchip/mxs: Include linux/irqchip/mxs.h
        irqchip/clps711x: Remove unused clps711x_intc_init() function
        irqchip/mmp: Remove non-DT codepath
        irqchip/ftintc010: Mark all function static
        irqdomain: Include internals.h for function prototypes
        irqchip/loongson-eiointc: Add DT init support
        dt-bindings: interrupt-controller: Add Loongson EIOINTC
        irqchip/loongson-eiointc: Fix irq affinity setting during resume
        irqchip/loongson-liointc: Add IRQCHIP_SKIP_SET_WAKE flag
        irqchip/loongson-liointc: Fix IRQ trigger polarity
        irqchip/loongson-pch-pic: Fix potential incorrect hwirq assignment
        irqchip/loongson-pch-pic: Fix initialization of HT vector register
        irqchip/gic-v3-its: Enable RESEND_WHEN_IN_PROGRESS for LPIs
        genirq: Allow fasteoi handler to resend interrupts on concurrent handling
        genirq: Expand doc for PENDING and REPLAY flags
        ...
      00173879
    • Linus Torvalds's avatar
      Merge tag 'core-debugobjects-2023-06-26' of... · cef2dd76
      Linus Torvalds authored
      Merge tag 'core-debugobjects-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
      
      Pull debugobjects update from Thomas Gleixner:
       "A single update for debug objects:
      
         - Recheck whether debug objects is enabled before reporting a problem
           to avoid spamming the logs with messages which are caused by a
           concurrent OOM"
      
      * tag 'core-debugobjects-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        debugobjects: Recheck debug_objects_enabled before reporting
      cef2dd76
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux · a0433f8c
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
      
       - NVMe pull request via Keith:
            - Various cleanups all around (Irvin, Chaitanya, Christophe)
            - Better struct packing (Christophe JAILLET)
            - Reduce controller error logs for optional commands (Keith)
            - Support for >=64KiB block sizes (Daniel Gomez)
            - Fabrics fixes and code organization (Max, Chaitanya, Daniel
              Wagner)
      
       - bcache updates via Coly:
            - Fix a race at init time (Mingzhe Zou)
            - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
      
       - use page pinning in the block layer for dio (David)
      
       - convert old block dio code to page pinning (David, Christoph)
      
       - cleanups for pktcdvd (Andy)
      
       - cleanups for rnbd (Guoqing)
      
       - use the unchecked __bio_add_page() for the initial single page
         additions (Johannes)
      
       - fix overflows in the Amiga partition handling code (Michael)
      
       - improve mq-deadline zoned device support (Bart)
      
       - keep passthrough requests out of the IO schedulers (Christoph, Ming)
      
       - improve support for flush requests, making them less special to deal
         with (Christoph)
      
       - add bdev holder ops and shutdown methods (Christoph)
      
       - fix the name_to_dev_t() situation and use cases (Christoph)
      
       - decouple the block open flags from fmode_t (Christoph)
      
       - ublk updates and cleanups, including adding user copy support (Ming)
      
       - BFQ sanity checking (Bart)
      
       - convert brd from radix to xarray (Pankaj)
      
       - constify various structures (Thomas, Ivan)
      
       - more fine grained persistent reservation ioctl capability checks
         (Jingbo)
      
       - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
         Jordy, Li, Min, Yu, Zhong, Waiman)
      
      * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
        scsi/sg: don't grab scsi host module reference
        ext4: Fix warning in blkdev_put()
        block: don't return -EINVAL for not found names in devt_from_devname
        cdrom: Fix spectre-v1 gadget
        block: Improve kernel-doc headers
        blk-mq: don't insert passthrough request into sw queue
        bsg: make bsg_class a static const structure
        ublk: make ublk_chr_class a static const structure
        aoe: make aoe_class a static const structure
        block/rnbd: make all 'class' structures const
        block: fix the exclusive open mask in disk_scan_partitions
        block: add overflow checks for Amiga partition support
        block: change all __u32 annotations to __be32 in affs_hardblocks.h
        block: fix signed int overflow in Amiga partition support
        block: add capacity validation in bdev_add_partition()
        block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
        block: disallow Persistent Reservation on partitions
        reiserfs: fix blkdev_put() warning from release_journal_dev()
        block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
        block: document the holder argument to blkdev_get_by_path
        ...
      a0433f8c
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux · 0aa69d53
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
       "Nothing major in this release, just a bunch of cleanups and some
        optimizations around networking mostly.
      
         - clean up file request flags handling (Christoph)
      
         - clean up request freeing and CQ locking (Pavel)
      
         - support for using pre-registering the io_uring fd at setup time
           (Josh)
      
         - Add support for user allocated ring memory, rather than having the
           kernel allocate it. Mostly for packing rings into a huge page (me)
      
         - avoid an unnecessary double retry on receive (me)
      
         - maintain ordering for task_work, which also improves performance
           (me)
      
         - misc cleanups/fixes (Pavel, me)"
      
      * tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
        io_uring: merge conditional unlock flush helpers
        io_uring: make io_cq_unlock_post static
        io_uring: inline __io_cq_unlock
        io_uring: fix acquire/release annotations
        io_uring: kill io_cq_unlock()
        io_uring: remove IOU_F_TWQ_FORCE_NORMAL
        io_uring: don't batch task put on reqs free
        io_uring: move io_clean_op()
        io_uring: inline io_dismantle_req()
        io_uring: remove io_free_req_tw
        io_uring: open code io_put_req_find_next
        io_uring: add helpers to decode the fixed file file_ptr
        io_uring: use io_file_from_index in io_msg_grab_file
        io_uring: use io_file_from_index in __io_sync_cancel
        io_uring: return REQ_F_ flags from io_file_get_flags
        io_uring: remove io_req_ffs_set
        io_uring: remove a confusing comment above io_file_get_flags
        io_uring: remove the mode variable in io_file_get_flags
        io_uring: remove __io_file_supports_nowait
        io_uring: wait interruptibly for request completions on exit
        ...
      0aa69d53
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux · 3eccc0c8
      Linus Torvalds authored
      Pull splice updates from Jens Axboe:
       "This kills off ITER_PIPE to avoid a race between truncate,
        iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
        with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
        memory corruption.
      
        Instead, we either use (a) filemap_splice_read(), which invokes the
        buffered file reading code and splices from the pagecache into the
        pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
        into it and then pushes the filled pages into the pipe; or (c) handle
        it in filesystem-specific code.
      
        Summary:
      
         - Rename direct_splice_read() to copy_splice_read()
      
         - Simplify the calculations for the number of pages to be reclaimed
           in copy_splice_read()
      
         - Turn do_splice_to() into a helper, vfs_splice_read(), so that it
           can be used by overlayfs and coda to perform the checks on the
           lower fs
      
         - Make vfs_splice_read() jump to copy_splice_read() to handle
           direct-I/O and DAX
      
         - Provide shmem with its own splice_read to handle non-existent pages
           in the pagecache. We don't want a ->read_folio() as we don't want
           to populate holes, but filemap_get_pages() requires it
      
         - Provide overlayfs with its own splice_read to call down to a lower
           layer as overlayfs doesn't provide ->read_folio()
      
         - Provide coda with its own splice_read to call down to a lower layer
           as coda doesn't provide ->read_folio()
      
         - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
           and random files as they just copy to the output buffer and don't
           splice pages
      
         - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
           ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
      
         - Make cifs use filemap_splice_read()
      
         - Replace pointers to generic_file_splice_read() with pointers to
           filemap_splice_read() as DIO and DAX are handled in the caller;
           filesystems can still provide their own alternate ->splice_read()
           op
      
         - Remove generic_file_splice_read()
      
         - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
           was the only user"
      
      * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
        splice: kdoc for filemap_splice_read() and copy_splice_read()
        iov_iter: Kill ITER_PIPE
        splice: Remove generic_file_splice_read()
        splice: Use filemap_splice_read() instead of generic_file_splice_read()
        cifs: Use filemap_splice_read()
        trace: Convert trace/seq to use copy_splice_read()
        zonefs: Provide a splice-read wrapper
        xfs: Provide a splice-read wrapper
        orangefs: Provide a splice-read wrapper
        ocfs2: Provide a splice-read wrapper
        ntfs3: Provide a splice-read wrapper
        nfs: Provide a splice-read wrapper
        f2fs: Provide a splice-read wrapper
        ext4: Provide a splice-read wrapper
        ecryptfs: Provide a splice-read wrapper
        ceph: Provide a splice-read wrapper
        afs: Provide a splice-read wrapper
        9p: Add splice_read wrapper
        net: Make sock_splice_read() use copy_splice_read() by default
        tty, proc, kernfs, random: Use copy_splice_read()
        ...
      3eccc0c8
    • Linus Torvalds's avatar
      Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · cc423f63
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "Mainly core changes, refactoring and optimizations.
      
        Performance is improved in some areas, overall there may be a
        cumulative improvement due to refactoring that removed lookups in the
        IO path or simplified IO submission tracking.
      
        Core:
      
         - submit IO synchronously for fast checksums (crc32c and xxhash),
           remove high priority worker kthread
      
         - read extent buffer in one go, simplify IO tracking, bio submission
           and locking
      
         - remove additional tracking of redirtied extent buffers, originally
           added for zoned mode but actually not needed
      
         - track ordered extent pointer in bio to avoid rbtree lookups during
           IO
      
         - scrub, use recovered data stripes as cache to avoid unnecessary
           read
      
         - in zoned mode, optimize logical to physical mappings of extents
      
         - remove PageError handling, not set by VFS nor writeback
      
         - cleanups, refactoring, better structure packing
      
         - lots of error handling improvements
      
         - more assertions, lockdep annotations
      
         - print assertion failure with the exact line where it happens
      
         - tracepoint updates
      
         - more debugging prints
      
        Performance:
      
         - speedup in fsync(), better tracking of inode logged status can
           avoid transaction commit
      
         - IO path structures track logical offsets in data structures and
           does not need to look it up
      
        User visible changes:
      
         - don't commit transaction for every created subvolume, this can
           reduce time when many subvolumes are created in a batch
      
         - print affected files when relocation fails
      
         - trigger orphan file cleanup during START_SYNC ioctl
      
        Notable fixes:
      
         - fix crash when disabling quota and relocation
      
         - fix crashes when removing roots from drity list
      
         - fix transacion abort during relocation when converting from newer
           profiles not covered by fallback
      
         - in zoned mode, stop reclaiming block groups if filesystem becomes
           read-only
      
         - fix rare race condition in tree mod log rewind that can miss some
           btree node slots
      
         - with enabled fsverity, drop up-to-date page bit in case the
           verification fails"
      
      * tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
        btrfs: fix race between quota disable and relocation
        btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
        btrfs: fix race when deleting free space root from the dirty cow roots list
        btrfs: fix race when deleting quota root from the dirty cow roots list
        btrfs: tracepoints: also show actual number of the outstanding extents
        btrfs: update i_version in update_dev_time
        btrfs: make btrfs_compressed_bioset static
        btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
        btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
        btrfs: scrub: remove scrub_ctx::csum_list member
        btrfs: do not BUG_ON after failure to migrate space during truncation
        btrfs: do not BUG_ON on failure to get dir index for new snapshot
        btrfs: send: do not BUG_ON() on unexpected symlink data extent
        btrfs: do not BUG_ON() when dropping inode items from log root
        btrfs: replace BUG_ON() at split_item() with proper error handling
        btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
        btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
        btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
        btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
        btrfs: abort transaction at update_ref_for_cow() when ref count is zero
        ...
      cc423f63
    • Linus Torvalds's avatar
      Merge tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · e940efa9
      Linus Torvalds authored
      Pull zonefs updates from Damien Le Moal:
      
       - Modify the synchronous direct write path to use iomap instead of
         manually coding issuing zone append write BIOs (me)
      
       - Use the FMODE_CAN_ODIRECT file flag to indicate support from direct
         IO instead of using the old way with noop direct_io methods
         (Christoph)
      
      * tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
        zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
        zonefs: use iomap for synchronous direct writes
      e940efa9
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 098c5dd9
      Linus Torvalds authored
      Pull erofs updates from Gao Xiang:
       "No outstanding new feature for this cycle.
      
        Most of these commits are decompression cleanups which are part of the
        ongoing development for subpage/folio compression support as well as
        xattr cleanups for the upcoming xattr bloom filter optimization [1].
      
        In addition, there are bugfixes to address some corner cases of
        compressed images due to global data de-duplication and arm64 16k
        pages.
      
        Summary:
      
         - Fix rare I/O hang on deduplicated compressed images due to loop
           hooked chains
      
         - Fix compact compression layout of 16k blocks on arm64 devices
      
         - Fix atomic context detection of async decompression
      
         - Decompression/Xattr code cleanups"
      
      Link: https://lore.kernel.org/r/20230621083209.116024-1-jefflexu@linux.alibaba.com [1]
      
      * tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: clean up zmap.c
        erofs: remove unnecessary goto
        erofs: Fix detection of atomic context
        erofs: use separate xattr parsers for listxattr/getxattr
        erofs: unify inline/shared xattr iterators for listxattr/getxattr
        erofs: make the size of read data stored in buffer_ofs
        erofs: unify xattr_iter structures
        erofs: use absolute position in xattr iterator
        erofs: fix compact 4B support for 16k block size
        erofs: convert erofs_read_metabuf() to erofs_bread() for xattr
        erofs: use poison pointer to replace the hard-coded address
        erofs: use struct lockref to replace handcrafted approach
        erofs: adapt managed inode operations into folios
        erofs: kill hooked chains to avoid loops on deduplicated compressed images
        erofs: avoid on-stack pagepool directly passed by arguments
        erofs: allocate extra bvec pages directly instead of retrying
        erofs: clean up z_erofs_pcluster_readmore()
        erofs: remove the member readahead from struct z_erofs_decompress_frontend
        erofs: fold in z_erofs_decompress()
      098c5dd9
    • Linus Torvalds's avatar
      Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux · 74774e24
      Linus Torvalds authored
      Pull fsverity updates from Eric Biggers:
       "Several updates for fs/verity/:
      
         - Do all hashing with the shash API instead of with the ahash API.
      
           This simplifies the code and reduces API overhead. It should also
           make things slightly easier for XFS's upcoming support for
           fsverity. It does drop fsverity's support for off-CPU hash
           accelerators, but that support was incomplete and not known to be
           used
      
         - Update and export fsverity_get_digest() so that it's ready for
           overlayfs's upcoming support for fsverity checking of lowerdata
      
         - Improve the documentation for builtin signature support
      
         - Fix a bug in the large folio support"
      
      * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
        fsverity: improve documentation for builtin signature support
        fsverity: rework fsverity_get_digest() again
        fsverity: simplify error handling in verify_data_block()
        fsverity: don't use bio_first_page_all() in fsverity_verify_bio()
        fsverity: constify fsverity_hash_alg
        fsverity: use shash API instead of ahash API
      74774e24
    • Linus Torvalds's avatar
      Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux · 4d483ab7
      Linus Torvalds authored
      Pull fscrypt update from Eric Biggers:
       "Just one flex array conversion patch"
      
      * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
        fscrypt: Replace 1-element array with flexible array
      4d483ab7
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · f7976a64
      Linus Torvalds authored
      Pull nfsd updates from Chuck Lever:
      
       - Clean-ups in the READ path in anticipation of MSG_SPLICE_PAGES
      
       - Better NUMA awareness when allocating pages and other objects
      
       - A number of minor clean-ups to XDR encoding
      
       - Elimination of a race when accepting a TCP socket
      
       - Numerous observability enhancements
      
      * tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (46 commits)
        nfsd: remove redundant assignments to variable len
        svcrdma: Fix stale comment
        NFSD: Distinguish per-net namespace initialization
        nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net
        SUNRPC: Address RCU warning in net/sunrpc/svc.c
        SUNRPC: Use sysfs_emit in place of strlcpy/sprintf
        SUNRPC: Remove transport class dprintk call sites
        SUNRPC: Fix comments for transport class registration
        svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()
        svcrdma: trace cc_release calls
        svcrdma: Convert "might sleep" comment into a code annotation
        NFSD: Add an nfsd4_encode_nfstime4() helper
        SUNRPC: Move initialization of rq_stime
        SUNRPC: Optimize page release in svc_rdma_sendto()
        svcrdma: Prevent page release when nothing was received
        svcrdma: Revert 2a1e4f21 ("svcrdma: Normalize Send page handling")
        SUNRPC: Revert 57990067 ("svcrdma: Remove unused sc_pages field")
        SUNRPC: Revert cc93ce95 ("svcrdma: Retain the page backing rq_res.head[0].iov_base")
        NFSD: add encoding of op_recall flag for write delegation
        NFSD: Add "official" reviewers for this subsystem
        ...
      f7976a64
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · c0a572d9
      Linus Torvalds authored
      Pull vfs mount updates from Christian Brauner:
       "This contains the work to extend move_mount() to allow adding a mount
        beneath the topmost mount of a mount stack.
      
        There are two LWN articles about this. One covers the original patch
        series in [1]. The other in [2] summarizes the session and roughly the
        discussion between Al and me at LSFMM. The second article also goes
        into some good questions from attendees.
      
        Since all details are found in the relevant commit with a technical
        dive into semantics and locking at the end I'm only adding the
        motivation and core functionality for this from commit message and
        leave out the invasive details. The code is also heavily commented and
        annotated as well which was explicitly requested.
      
        TL;DR:
      
          > mount -t ext4 /dev/sda /mnt
            |
            └─/mnt    /dev/sda    ext4
      
          > mount --beneath -t xfs /dev/sdb /mnt
            |
            └─/mnt    /dev/sdb    xfs
              └─/mnt  /dev/sda    ext4
      
          > umount /mnt
            |
            └─/mnt    /dev/sdb    xfs
      
        The longer motivation is that various distributions are adding or are
        in the process of adding support for system extensions and in the
        future configuration extensions through various tools. A more detailed
        explanation on system and configuration extensions can be found on the
        manpage which is listed below at [3].
      
        System extension images may – dynamically at runtime — extend the
        /usr/ and /opt/ directory hierarchies with additional files. This is
        particularly useful on immutable system images where a /usr/ and/or
        /opt/ hierarchy residing on a read-only file system shall be extended
        temporarily at runtime without making any persistent modifications.
      
        When one or more system extension images are activated, their /usr/
        and /opt/ hierarchies are combined via overlayfs with the same
        hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
        with it ("merging"). When they are deactivated, the mount point is
        disassembled — again revealing the unmodified original host version of
        the hierarchy ("unmerging"). Merging thus makes the extension's
        resources suddenly appear below the /usr/ and /opt/ hierarchies as if
        they were included in the base OS image itself. Unmerging makes them
        disappear again, leaving in place only the files that were shipped
        with the base OS image itself.
      
        System configuration images are similar but operate on directories
        containing system or service configuration.
      
        On nearly all modern distributions mount propagation plays a crucial
        role and the rootfs of the OS is a shared mount in a peer group
        (usually with peer group id 1):
      
           TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /       /       ext4    shared:1     29      1
      
        On such systems all services and containers run in a separate mount
        namespace and are pivot_root()ed into their rootfs. A separate mount
        namespace is almost always used as it is the minimal isolation
        mechanism services have. But usually they are even much more isolated
        up to the point where they almost become indistinguishable from
        containers.
      
        Mount propagation again plays a crucial role here. The rootfs of all
        these services is a slave mount to the peer group of the host rootfs.
        This is done so the service will receive mount propagation events from
        the host when certain files or directories are updated.
      
        In addition, the rootfs of each service, container, and sandbox is
        also a shared mount in its separate peer group:
      
           TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
           /       /       ext4    shared:24 master:1  71      47
      
        For people not too familiar with mount propagation, the master:1 means
        that this is a slave mount to peer group 1. Which as one can see is
        the host rootfs as indicated by shared:1 above. The shared:24
        indicates that the service rootfs is a shared mount in a separate peer
        group with peer group id 24.
      
        A service may run other services. Such nested services will also have
        a rootfs mount that is a slave to the peer group of the outer service
        rootfs mount.
      
        For containers things are just slighly different. A container's rootfs
        isn't a slave to the service's or host rootfs' peer group. The rootfs
        mount of a container is simply a shared mount in its own peer group:
      
           TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /home/ubuntu/debian-tree  /       ext4    shared:99    61      60
      
        So whereas services are isolated OS components a container is treated
        like a separate world and mount propagation into it is restricted to a
        single well known mount that is a slave to the peer group of the
        shared mount /run on the host:
      
           TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68
      
        Here, the master:5 indicates that this mount is a slave to the peer
        group with peer group id 5. This allows to propagate mounts into the
        container and served as a workaround for not being able to insert
        mounts into mount namespaces directly. But the new mount api does
        support inserting mounts directly. For the interested reader the
        blogpost in [4] might be worth reading where I explain the old and the
        new approach to inserting mounts into mount namespaces.
      
        Containers of course, can themselves be run as services. They often
        run full systems themselves which means they again run services and
        containers with the exact same propagation settings explained above.
      
        The whole system is designed so that it can be easily updated,
        including all services in various fine-grained ways without having to
        enter every single service's mount namespace which would be
        prohibitively expensive. The mount propagation layout has been
        carefully chosen so it is possible to propagate updates for system
        extensions and configurations from the host into all services.
      
        The simplest model to update the whole system is to mount on top of
        /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
        will then propagate into every service. This works cleanly the first
        time. However, when the system is updated multiple times it becomes
        necessary to unmount the first update on /opt, /usr, /etc and then
        propagate the new update. But this means, there's an interval where
        the old base system is accessible. This has to be avoided to protect
        against downgrade attacks.
      
        The vfs already exposes a mechanism to userspace whereby mounts can be
        mounted beneath an existing mount. Such mounts are internally referred
        to as "tucked". The patch series exposes the ability to mount beneath
        a top mount through the new MOVE_MOUNT_BENEATH flag for the
        move_mount() system call. This allows userspace to seamlessly upgrade
        mounts. After this series the only thing that will have changed is
        that mounting beneath an existing mount can be done explicitly instead
        of just implicitly.
      
        The crux is that the proposed mechanism already exists and that it is
        so powerful as to cover cases where mounts are supposed to be updated
        with new versions. Crucially, it offers an important flexibility.
        Namely that updates to a system may either be forced or can be delayed
        and the umount of the top mount be left to a service if it is a
        cooperative one"
      
      Link: https://lwn.net/Articles/927491 [1]
      Link: https://lwn.net/Articles/934094 [2]
      Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
      Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
      Link: https://github.com/flatcar/sysext-bakery
      Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
      Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
      Link: https://github.com/systemd/systemd/pull/26013
      
      * tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: allow to mount beneath top mount
        fs: use a for loop when locking a mount
        fs: properly document __lookup_mnt()
        fs: add path_mounted()
      c0a572d9
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 1f2300a7
      Linus Torvalds authored
      Pull vfs file handling updates from Christian Brauner:
       "This contains Amir's work to fix a long-standing problem where an
        unprivileged overlayfs mount can be used to avoid fanotify permission
        events that were requested for an inode or superblock on the
        underlying filesystem.
      
        Some background about files opened in overlayfs. If a file is opened
        in overlayfs @file->f_path will refer to a "fake" path. What this
        means is that while @file->f_inode will refer to inode of the
        underlying layer, @file->f_path refers to an overlayfs
        {dentry,vfsmount} pair. The reasons for doing this are out of scope
        here but it is the reason why the vfs has been providing the
        open_with_fake_path() helper for overlayfs for very long time now. So
        nothing new here.
      
        This is for sure not very elegant and everyone including the overlayfs
        maintainers agree. Improving this significantly would involve more
        fragile and potentially rather invasive changes.
      
        In various codepaths access to the path of the underlying filesystem
        is needed for such hybrid file. The best example is fsnotify where
        this becomes security relevant. Passing the overlayfs
        @file->f_path->dentry will cause fsnotify to skip generating fsnotify
        events registered on the underlying inode or superblock.
      
        To fix this we extend the vfs provided open_with_fake_path() concept
        for overlayfs to create a backing file container that holds the real
        path and to expose a helper that can be used by relevant callers to
        get access to the path of the underlying filesystem through the new
        file_real_path() helper. This pattern is similar to what we do in
        d_real() and d_real_inode().
      
        The first beneficiary is fsnotify and fixes the security sensitive
        problem mentioned above.
      
        There's a couple of nice cleanups included as well.
      
        Over time, the old open_with_fake_path() helper added specifically for
        overlayfs a long time ago started to get used in other places such as
        cachefiles. Even though cachefiles have nothing to do with hybrid
        files.
      
        The only reason cachefiles used that concept was that files opened
        with open_with_fake_path() aren't charged against the caller's open
        file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
        both overlayfs and cachefiles need to ensure to not overcharge the
        caller for their internal open calls.
      
        So this work disentangles FMODE_NOACCOUNT use cases and backing file
        use-cases by adding the FMODE_BACKING flag which indicates that the
        file can be used to retrieve the backing file of another filesystem.
        (Fyi, Jens will be sending you a really nice cleanup from Christoph
        that gets rid of 3 FMODE_* flags otherwise this would be the last
        fmode_t bit we'd be using.)
      
        So now overlayfs becomes the sole user of the renamed
        open_with_fake_path() helper which is now named backing_file_open().
        For internal kernel users such as cachefiles that are only interested
        in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
        kernel_file_open() helper which opens a file without being charged
        against the caller's open file limit. All new helpers are properly
        documented and clearly annotated to mention their special uses.
      
        We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
        distinguish it from vfs_tmpfile() and align it the other kernel_*()
        internal helpers"
      
      * tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        ovl: enable fsnotify events on underlying real files
        fs: use backing_file container for internal files with "fake" f_path
        fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
        fs: use a helper for opening kernel internal files
        fs: rename {vfs,kernel}_tmpfile_open()
      1f2300a7
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 2eedfa9e
      Linus Torvalds authored
      Pull vfs rename locking updates from Christian Brauner:
       "This contains the work from Jan to fix problems with cross-directory
        renames originally reported in [1].
      
        To quickly sum it up some filesystems (so far we know at least about
        ext4, udf, f2fs, ocfs2, likely also reiserfs, gfs2 and others) need to
        lock the directory when it is being renamed into another directory.
      
        This is because we need to update the parent pointer in the directory
        in that case and if that races with other operations on the directory,
        in particular a conversion from one directory format into another, bad
        things can happen.
      
        So far we've done the locking in the filesystem code but recently
        Darrick pointed out in [2] that the RENAME_EXCHANGE case was missing.
        That one is particularly nasty because RENAME_EXCHANGE can arbitrarily
        mix regular files and directories and proper lock ordering is not
        achievable in the filesystems alone.
      
        This patch set adds locking into vfs_rename() so that not only parent
        directories but also moved inodes, regardless of whether they are
        directories or not, are locked when calling into the filesystem.
      
        This means establishing a locking order for unrelated directories. New
        helpers are added for this purpose and our documentation is updated to
        cover this in detail.
      
        The locking is now actually easier to follow as we now always lock
        source and target. We've always locked the target independent of
        whether it was a directory or file and we've always locked source if
        it was a regular file. The exact details for why this came about can
        be found in [3] and [4]"
      
      Link: https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3 [1]
      Link: https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs [2]
      Link: https://lore.kernel.org/all/20230526-schrebergarten-vortag-9cd89694517e@brauner [3]
      Link: https://lore.kernel.org/all/20230530-seenotrettung-allrad-44f4b00139d4@brauner [4]
      
      * tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: Restrict lock_two_nondirectories() to non-directory inodes
        fs: Lock moved directories
        fs: Establish locking order for unrelated directories
        Revert "f2fs: fix potential corruption when moving a directory"
        Revert "udf: Protect rename against modification of moved directory"
        ext4: Remove ext4 locking of moved directory
      2eedfa9e
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 64bf6ae9
      Linus Torvalds authored
      Pull misc vfs updates from Christian Brauner:
       "Miscellaneous features, cleanups, and fixes for vfs and individual fs
      
        Features:
      
         - Use mode 0600 for file created by cachefilesd so it can be run by
           unprivileged users. This aligns them with directories which are
           already created with mode 0700 by cachefilesd
      
         - Reorder a few members in struct file to prevent some false sharing
           scenarios
      
         - Indicate that an eventfd is used a semaphore in the eventfd's
           fdinfo procfs file
      
         - Add a missing uapi header for eventfd exposing relevant uapi
           defines
      
         - Let the VFS protect transitions of a superblock from read-only to
           read-write in addition to the protection it already provides for
           transitions from read-write to read-only. Protecting read-only to
           read-write transitions allows filesystems such as ext4 to perform
           internal writes, keeping writers away until the transition is
           completed
      
        Cleanups:
      
         - Arnd removed the architecture specific arch_report_meminfo()
           prototypes and added a generic one into procfs.h. Note, we got a
           report about a warning in amdpgpu codepaths that suggested this was
           bisectable to this change but we concluded it was a false positive
      
         - Remove unused parameters from split_fs_names()
      
         - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
           reflect the order of the cleanup operation that has to unmap before
           the actual put
      
         - Unexport buffer_check_dirty_writeback() as it is not used outside
           of block device aops
      
         - Stop allocating aio rings from highmem
      
         - Protecting read-{only,write} transitions in the VFS used open-coded
           barriers in various places. Replace them with proper little helpers
           and document both the helpers and all barrier interactions involved
           when transitioning between read-{only,write} states
      
         - Use flexible array members in old readdir codepaths
      
        Fixes:
      
         - Use the correct type __poll_t for epoll and eventfd
      
         - Replace all deprecated strlcpy() invocations, whose return value
           isn't checked with an equivalent strscpy() call
      
         - Fix some kernel-doc warnings in fs/open.c
      
         - Reduce the stack usage in jffs2's xattr codepaths finally getting
           rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
           bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
           royally annoying compilation warning
      
         - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
           fmode_t is required to avoid fmode_t to integer degradation
           warnings
      
         - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
           explanation in that commit how O_RDWR is actually a bug which we
           found out with the help of Linus and git archeology
      
         - Fix "no previous prototype" warnings in the pipe codepaths
      
         - Add overflow calculations for remap_verify_area() as a signed
           addition overflow could be triggered in xfstests
      
         - Fix a null pointer dereference in sysv
      
         - Use an unsigned variable for length calculations in jfs avoiding
           compilation warnings with gcc 13
      
         - Fix a dangling pipe pointer in the watch queue codepath
      
         - The legacy mount option parser provided as a fallback by the VFS
           for filesystems not yet converted to the new mount api did prefix
           the generated mount option string with a leading ',' causing issues
           for some filesystems
      
         - Fix a repeated word in a comment in fs.h
      
         - autofs: Update the ctime when mtime is updated as mandated by
           POSIX"
      
      * tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
        readdir: Replace one-element arrays with flexible-array members
        fs: Provide helpers for manipulating sb->s_readonly_remount
        fs: Protect reconfiguration of sb read-write from racing writes
        eventfd: add a uapi header for eventfd userspace APIs
        autofs: set ctime as well when mtime changes on a dir
        eventfd: show the EFD_SEMAPHORE flag in fdinfo
        fs/aio: Stop allocating aio rings from HIGHMEM
        fs: Fix comment typo
        fs: unexport buffer_check_dirty_writeback
        fs: avoid empty option when generating legacy mount string
        watch_queue: prevent dangling pipe pointer
        fs.h: Optimize file struct to prevent false sharing
        highmem: Rename put_and_unmap_page() to unmap_and_put_page()
        cachefiles: Allow the cache to be non-root
        init: remove unused names parameter in split_fs_names()
        jfs: Use unsigned variable for length calculations
        fs/sysv: Null check to prevent null-ptr-deref bug
        fs: use UB-safe check for signed addition overflow in remap_verify_area
        procfs: consolidate arch_report_meminfo declaration
        fs: pipe: reveal missing function protoypes
        ...
      64bf6ae9
    • Linus Torvalds's avatar
      Merge tag 'v6.5/fs.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 5c1c88cd
      Linus Torvalds authored
      Pull ntfs updates from Christian Brauner:
       "A pile of various smaller fixes for ntfs"
      
      * tag 'v6.5/fs.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        ntfs: do not dereference a null ctx on error
        ntfs: Remove unneeded semicolon
        ntfs: Correct spelling
        ntfs: remove redundant initialization to pointer cb_sb_start
      5c1c88cd
    • Linus Torvalds's avatar
      Merge tag 'auxdisplay-6.5' of https://github.com/ojeda/linux · 1f268d6d
      Linus Torvalds authored
      Pull auxdisplay update from Miguel Ojeda:
       "A single cleanup for i2c drivers to switch them back to use
        '.probe()'"
      
      * tag 'auxdisplay-6.5' of https://github.com/ojeda/linux:
        auxdisplay: Switch i2c drivers back to use .probe()
      1f268d6d
    • Linus Torvalds's avatar
      Merge tag 'rust-6.5' of https://github.com/Rust-for-Linux/linux · a1257b5e
      Linus Torvalds authored
      Pull rust updates from Miguel Ojeda:
       "A fairly small one in terms of feature additions. Most of the changes
        in terms of lines come from the upgrade to the new version of the
        toolchain (which in turn is big due to the vendored 'alloc' crate).
      
        Upgrade to Rust 1.68.2:
      
         - This is the first such upgrade, and we will try to update it often
           from now on, in order to remain close to the latest release, until
           a minimum version (which is "in the future") can be established.
      
           The upgrade brings the stabilization of 4 features we used (and 2
           more that we used in our old 'rust' branch).
      
           Commit 3ed03f4d ("rust: upgrade to Rust 1.68.2") contains the
           details and rationale.
      
        pin-init API:
      
         - Several internal improvements and fixes to the pin-init API, e.g.
           allowing to use 'Self' in a struct definition with '#[pin_data]'.
      
        'error' module:
      
         - New 'name()' method for the 'Error' type (with 'errname()'
           integration), used to implement the 'Debug' trait for 'Error'.
      
         - Add error codes from 'include/linux/errno.h' to the list of Rust
           'Error' constants.
      
         - Allow specifying error type on the 'Result' type (with the default
           still being our usual 'Error' type).
      
        'str' module:
      
         - 'TryFrom' implementation for 'CStr', and new 'to_cstring()' method
           based on it.
      
        'sync' module:
      
         - Implement 'AsRef' trait for 'Arc', allowing to use 'Arc' in code
           that is generic over smart pointer types.
      
         - Add 'ptr_eq' method to 'Arc' for easier, less error prone
           comparison between two 'Arc' pointers.
      
         - Reword the 'Send' safety comment for 'Arc', and avoid referencing
           it from the 'Sync' one.
      
        'task' module:
      
         - Implement 'Send' marker for 'Task'.
      
        'types' module:
      
         - Implement 'Send' and 'Sync' markers for 'ARef<T>' when 'T' is
           'AlwaysRefCounted', 'Send' and 'Sync'.
      
        Other changes:
      
         - Documentation improvements and '.gitattributes' change to start
           using the Rust diff driver"
      
      * tag 'rust-6.5' of https://github.com/Rust-for-Linux/linux:
        rust: error: `impl Debug` for `Error` with `errname()` integration
        rust: task: add `Send` marker to `Task`
        rust: specify when `ARef` is thread safe
        rust: sync: reword the `Arc` safety comment for `Sync`
        rust: sync: reword the `Arc` safety comment for `Send`
        rust: sync: implement `AsRef<T>` for `Arc<T>`
        rust: sync: add `Arc::ptr_eq`
        rust: error: add missing error codes
        rust: str: add conversion from `CStr` to `CString`
        rust: error: allow specifying error type on `Result`
        rust: init: update macro expansion example in docs
        rust: macros: replace Self with the concrete type in #[pin_data]
        rust: macros: refactor generics parsing of `#[pin_data]` into its own function
        rust: macros: fix usage of `#[allow]` in `quote!`
        docs: rust: point directly to the standalone installers
        .gitattributes: set diff driver for Rust source code files
        rust: upgrade to Rust 1.68.2
        rust: arc: fix intra-doc link in `Arc<T>::init`
        rust: alloc: clarify what is the upstream version
      a1257b5e
    • Linus Torvalds's avatar
      Merge tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 9d9a9bf0
      Linus Torvalds authored
      Pull s390 updates from Alexander Gordeev:
      
       - Use correct type for size of memory allocated for ELF core header on
         kernel crash.
      
       - Fix insecure W+X mapping warning when KASAN shadow memory range is
         not aligned on page boundary.
      
       - Avoid allocation of short by one page KASAN shadow memory when the
         original memory range is less than (PAGE_SIZE << 3).
      
       - Fix virtual vs physical address confusion in physical memory
         enumerator. It is not a real issue, since virtual and physical
         addresses are currently the same.
      
       - Set CONFIG_NET_TC_SKB_EXT=y in s390 config files as it is required
         for offloading TC as well as bridges on switchdev capable ConnectX
         devices.
      
      * tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/defconfigs: set CONFIG_NET_TC_SKB_EXT=y
        s390/boot: fix physmem_info virtual vs physical address confusion
        s390/kasan: avoid short by one page shadow memory
        s390/kasan: fix insecure W+X mapping warning
        s390/crash: use the correct type for memory allocation
      9d9a9bf0
    • Linus Torvalds's avatar
      Merge tag 'nios2_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux · be5b52dc
      Linus Torvalds authored
      Pull nios2 updates from Dinh Nguyen:
      
       - Convert pgtable constructor/destructors to ptdesc
      
       - Replace strlcpy with strscpy
      
      * tag 'nios2_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
        nios2: Replace all non-returning strlcpy with strscpy
        nios2: Convert __pte_free_tlb() to use ptdescs
      be5b52dc
    • Thomas Gleixner's avatar
      Merge tag 'irqchip-6.5' of... · f121ab7f
      Thomas Gleixner authored
      Merge tag 'irqchip-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
      
      Pull irqchip updates from Marc Zyngier:
      
        - A number of Loogson/Loogarch fixes
      
        - Allow the core code to retrigger an interrupt that has
          fired while the same interrupt is being handled on another
          CPU, papering over a GICv3 architecture issue
      
        - Work around an integration problem on ASR8601, where the CPU
          numbering isn't representable in the GIC implementation...
      
        - Add some missing interrupt to the STM32 irqchip
      
        - A bunch of warning squashing triggered by W=1 builds
      
      Link: https://lore.kernel.org/r/20230623224345.3577134-1-maz@kernel.org
      f121ab7f
  3. 25 Jun, 2023 5 commits
  4. 23 Jun, 2023 13 commits