1. 11 Mar, 2024 24 commits
    • Linus Torvalds's avatar
      Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 720c8579
      Linus Torvalds authored
      Pull x86 FRED support from Thomas Gleixner:
       "Support for x86 Fast Return and Event Delivery (FRED).
      
        FRED is a replacement for IDT event delivery on x86 and addresses most
        of the technical nightmares which IDT exposes:
      
         1) Exception cause registers like CR2 need to be manually preserved
            in nested exception scenarios.
      
         2) Hardware interrupt stack switching is suboptimal for nested
            exceptions as the interrupt stack mechanism rewinds the stack on
            each entry which requires a massive effort in the low level entry
            of #NMI code to handle this.
      
         3) No hardware distinction between entry from kernel or from user
            which makes establishing kernel context more complex than it needs
            to be especially for unconditionally nestable exceptions like NMI.
      
         4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
            is a problem when the perf NMI takes a fault when collecting a
            stack trace.
      
         5) Partial restore of ESP when returning to a 16-bit segment
      
         6) Limitation of the vector space which can cause vector exhaustion
            on large systems.
      
         7) Inability to differentiate NMI sources
      
        FRED addresses these shortcomings by:
      
         1) An extended exception stack frame which the CPU uses to save
            exception cause registers. This ensures that the meta information
            for each exception is preserved on stack and avoids the extra
            complexity of preserving it in software.
      
         2) Hardware interrupt stack switching is non-rewinding if a nested
            exception uses the currently interrupt stack.
      
         3) The entry points for kernel and user context are separate and GS
            BASE handling which is required to establish kernel context for
            per CPU variable access is done in hardware.
      
         4) NMIs are now nesting protected. They are only reenabled on the
            return from NMI.
      
         5) FRED guarantees full restore of ESP
      
         6) FRED does not put a limitation on the vector space by design
            because it uses a central entry points for kernel and user space
            and the CPUstores the entry type (exception, trap, interrupt,
            syscall) on the entry stack along with the vector number. The
            entry code has to demultiplex this information, but this removes
            the vector space restriction.
      
            The first hardware implementations will still have the current
            restricted vector space because lifting this limitation requires
            further changes to the local APIC.
      
         7) FRED stores the vector number and meta information on stack which
            allows having more than one NMI vector in future hardware when the
            required local APIC changes are in place.
      
        The series implements the initial FRED support by:
      
         - Reworking the existing entry and IDT handling infrastructure to
           accomodate for the alternative entry mechanism.
      
         - Expanding the stack frame to accomodate for the extra 16 bytes FRED
           requires to store context and meta information
      
         - Providing FRED specific C entry points for events which have
           information pushed to the extended stack frame, e.g. #PF and #DB.
      
         - Providing FRED specific C entry points for #NMI and #MCE
      
         - Implementing the FRED specific ASM entry points and the C code to
           demultiplex the events
      
         - Providing detection and initialization mechanisms and the necessary
           tweaks in context switching, GS BASE handling etc.
      
        The FRED integration aims for maximum code reuse vs the existing IDT
        implementation to the extent possible and the deviation in hot paths
        like context switching are handled with alternatives to minimalize the
        impact. The low level entry and exit paths are seperate due to the
        extended stack frame and the hardware based GS BASE swichting and
        therefore have no impact on IDT based systems.
      
        It has been extensively tested on existing systems and on the FRED
        simulation and as of now there are no outstanding problems"
      
      * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
        x86/fred: Fix init_task thread stack pointer initialization
        MAINTAINERS: Add a maintainer entry for FRED
        x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
        x86/fred: Invoke FRED initialization code to enable FRED
        x86/fred: Add FRED initialization functions
        x86/syscall: Split IDT syscall setup code into idt_syscall_init()
        KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
        x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
        x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
        x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
        x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
        x86/traps: Add sysvec_install() to install a system interrupt handler
        x86/fred: FRED entry/exit and dispatch code
        x86/fred: Add a machine check entry stub for FRED
        x86/fred: Add a NMI entry stub for FRED
        x86/fred: Add a debug fault entry stub for FRED
        x86/idtentry: Incorporate definitions/declarations of the FRED entries
        x86/fred: Make exc_page_fault() work for FRED
        x86/fred: Allow single-step trap and NMI when starting a new task
        x86/fred: No ESPFIX needed when FRED is enabled
        ...
      720c8579
    • Linus Torvalds's avatar
      Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ca7e9177
      Linus Torvalds authored
      Pull x86 APIC updates from Thomas Gleixner:
       "Rework of APIC enumeration and topology evaluation.
      
        The current implementation has a couple of shortcomings:
      
         - It fails to handle hybrid systems correctly.
      
         - The APIC registration code which handles CPU number assignents is
           in the middle of the APIC code and detached from the topology
           evaluation.
      
         - The various mechanisms which enumerate APICs, ACPI, MPPARSE and
           guest specific ones, tweak global variables as they see fit or in
           case of XENPV just hack around the generic mechanisms completely.
      
         - The CPUID topology evaluation code is sprinkled all over the vendor
           code and reevaluates global variables on every hotplug operation.
      
         - There is no way to analyze topology on the boot CPU before bringing
           up the APs. This causes problems for infrastructure like PERF which
           needs to size certain aspects upfront or could be simplified if
           that would be possible.
      
         - The APIC admission and CPU number association logic is
           incomprehensible and overly complex and needs to be kept around
           after boot instead of completing this right after the APIC
           enumeration.
      
        This update addresses these shortcomings with the following changes:
      
         - Rework the CPUID evaluation code so it is common for all vendors
           and provides information about the APIC ID segments in a uniform
           way independent of the number of segments (Thread, Core, Module,
           ..., Die, Package) so that this information can be computed instead
           of rewriting global variables of dubious value over and over.
      
         - A few cleanups and simplifcations of the APIC, IO/APIC and related
           interfaces to prepare for the topology evaluation changes.
      
         - Seperation of the parser stages so the early evaluation which tries
           to find the APIC address can be seperately overridden from the late
           evaluation which enumerates and registers the local APIC as further
           preparation for sanitizing the topology evaluation.
      
         - A new registration and admission logic which
      
             - encapsulates the inner workings so that parsers and guest logic
               cannot longer fiddle in it
      
             - uses the APIC ID segments to build topology bitmaps at
               registration time
      
             - provides a sane admission logic
      
             - allows to detect the crash kernel case, where CPU0 does not run
               on the real BSP, automatically. This is required to prevent
               sending INIT/SIPI sequences to the real BSP which would reset
               the whole machine. This was so far handled by a tedious command
               line parameter, which does not even work in nested crash
               scenarios.
      
             - Associates CPU number after the enumeration completed and
               prevents the late registration of APICs, which was somehow
               tolerated before.
      
         - Converting all parsers and guest enumeration mechanisms over to the
           new interfaces.
      
           This allows to get rid of all global variable tweaking from the
           parsers and enumeration mechanisms and sanitizes the XEN[PV]
           handling so it can use CPUID evaluation for the first time.
      
         - Mopping up existing sins by taking the information from the APIC ID
           segment bitmaps.
      
           This evaluates hybrid systems correctly on the boot CPU and allows
           for cleanups and fixes in the related drivers, e.g. PERF.
      
        The series has been extensively tested and the minimal late fallout
        due to a broken ACPI/MADT table has been addressed by tightening the
        admission logic further"
      
      * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
        x86/topology: Ignore non-present APIC IDs in a present package
        x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
        smp: Provide 'setup_max_cpus' definition on UP too
        smp: Avoid 'setup_max_cpus' namespace collision/shadowing
        x86/bugs: Use fixed addressing for VERW operand
        x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
        x86/cpu/topology: Provide __num_[cores|threads]_per_package
        x86/cpu/topology: Rename topology_max_die_per_package()
        x86/cpu/topology: Rename smp_num_siblings
        x86/cpu/topology: Retrieve cores per package from topology bitmaps
        x86/cpu/topology: Use topology logical mapping mechanism
        x86/cpu/topology: Provide logical pkg/die mapping
        x86/cpu/topology: Simplify cpu_mark_primary_thread()
        x86/cpu/topology: Mop up primary thread mask handling
        x86/cpu/topology: Use topology bitmaps for sizing
        x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
        x86/xen/smp_pv: Count number of vCPUs early
        x86/cpu/topology: Assign hotpluggable CPUIDs during init
        x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
        x86/topology: Add a mechanism to track topology via APIC IDs
        ...
      ca7e9177
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d08c407f
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "A large set of updates and features for timers and timekeeping:
      
         - The hierarchical timer pull model
      
           When timer wheel timers are armed they are placed into the timer
           wheel of a CPU which is likely to be busy at the time of expiry.
           This is done to avoid wakeups on potentially idle CPUs.
      
           This is wrong in several aspects:
      
             1) The heuristics to select the target CPU are wrong by
                definition as the chance to get the prediction right is
                close to zero.
      
             2) Due to #1 it is possible that timers are accumulated on
                a single target CPU
      
             3) The required computation in the enqueue path is just overhead
                for dubious value especially under the consideration that the
                vast majority of timer wheel timers are either canceled or
                rearmed before they expire.
      
           The timer pull model avoids the above by removing the target
           computation on enqueue and queueing timers always on the CPU on
           which they get armed.
      
           This is achieved by having separate wheels for CPU pinned timers
           and global timers which do not care about where they expire.
      
           As long as a CPU is busy it handles both the pinned and the global
           timers which are queued on the CPU local timer wheels.
      
           When a CPU goes idle it evaluates its own timer wheels:
      
             - If the first expiring timer is a pinned timer, then the global
               timers can be ignored as the CPU will wake up before they
               expire.
      
             - If the first expiring timer is a global timer, then the expiry
               time is propagated into the timer pull hierarchy and the CPU
               makes sure to wake up for the first pinned timer.
      
           The timer pull hierarchy organizes CPUs in groups of eight at the
           lowest level and at the next levels groups of eight groups up to
           the point where no further aggregation of groups is required, i.e.
           the number of levels is log8(NR_CPUS). The magic number of eight
           has been established by experimention, but can be adjusted if
           needed.
      
           In each group one busy CPU acts as the migrator. It's only one CPU
           to avoid lock contention on remote timer wheels.
      
           The migrator CPU checks in its own timer wheel handling whether
           there are other CPUs in the group which have gone idle and have
           global timers to expire. If there are global timers to expire, the
           migrator locks the remote CPU timer wheel and handles the expiry.
      
           Depending on the group level in the hierarchy this handling can
           require to walk the hierarchy downwards to the CPU level.
      
           Special care is taken when the last CPU goes idle. At this point
           the CPU is the systemwide migrator at the top of the hierarchy and
           it therefore cannot delegate to the hierarchy. It needs to arm its
           own timer device to expire either at the first expiring timer in
           the hierarchy or at the first CPU local timer, which ever expires
           first.
      
           This completely removes the overhead from the enqueue path, which
           is e.g. for networking a true hotpath and trades it for a slightly
           more complex idle path.
      
           This has been in development for a couple of years and the final
           series has been extensively tested by various teams from silicon
           vendors and ran through extensive CI.
      
           There have been slight performance improvements observed on network
           centric workloads and an Intel team confirmed that this allows them
           to power down a die completely on a mult-die socket for the first
           time in a mostly idle scenario.
      
           There is only one outstanding ~1.5% regression on a specific
           overloaded netperf test which is currently investigated, but the
           rest is either positive or neutral performance wise and positive on
           the power management side.
      
         - Fixes for the timekeeping interpolation code for cross-timestamps:
      
           cross-timestamps are used for PTP to get snapshots from hardware
           timers and interpolated them back to clock MONOTONIC. The changes
           address a few corner cases in the interpolation code which got the
           math and logic wrong.
      
         - Simplifcation of the clocksource watchdog retry logic to
           automatically adjust to handle larger systems correctly instead of
           having more incomprehensible command line parameters.
      
         - Treewide consolidation of the VDSO data structures.
      
         - The usual small improvements and cleanups all over the place"
      
      * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
        timer/migration: Fix quick check reporting late expiry
        tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
        vdso/datapage: Quick fix - use asm/page-def.h for ARM64
        timers: Assert no next dyntick timer look-up while CPU is offline
        tick: Assume timekeeping is correctly handed over upon last offline idle call
        tick: Shut down low-res tick from dying CPU
        tick: Split nohz and highres features from nohz_mode
        tick: Move individual bit features to debuggable mask accesses
        tick: Move got_idle_tick away from common flags
        tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
        tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
        tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
        tick: Start centralizing tick related CPU hotplug operations
        tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
        tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
        tick: Use IS_ENABLED() whenever possible
        tick/sched: Remove useless oneshot ifdeffery
        tick/nohz: Remove duplicate between lowres and highres handlers
        tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
        hrtimer: Select housekeeping CPU during migration
        ...
      d08c407f
    • Linus Torvalds's avatar
      Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 80a76c60
      Linus Torvalds authored
      Pull clocksource updates from Thomas Gleixner:
       "Updates for timekeeping and PTP core.
      
        The cross-timestamp mechanism which allows to correlate hardware
        clocks uses clocksource pointers for describing the correlation.
      
        That's suboptimal as drivers need to obtain the pointer, which
        requires needless exports and exposing internals. This can all be
        completely avoided by assigning clocksource IDs and using them for
        describing the correlated clock source.
      
        So this adds clocksource IDs to all clocksources in the tree which can
        be exposed to this mechanism and removes the pointer and now needless
        exports.
      
        A related improvement for the core and the correlation handling has
        not made it this time, but is expected to get ready for the next
        round"
      
      * tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kvmclock: Unexport kvmclock clocksource
        treewide: Remove system_counterval_t.cs, which is never read
        timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
        ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
        x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
        x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
        timekeeping: Add clocksource ID to struct system_counterval_t
        x86/tsc: Correct kernel-doc notation
      80a76c60
    • Linus Torvalds's avatar
      Merge tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 397935e3
      Linus Torvalds authored
      Pull cpu core updates from Thomas Gleixner:
       "A small boring set of cleanups for the SMP and CPU hotplug code"
      
      * tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu: Remove stray semicolon
        smp: Make __smp_processor_id() 0-argument macro
        cpu: Mark cpu_possible_mask as __ro_after_init
        kernel/cpu: Convert snprintf() to sysfs_emit()
        cpu/hotplug: Delete an extraneous kernel-doc description
      397935e3
    • Linus Torvalds's avatar
      Merge tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4527e837
      Linus Torvalds authored
      Pull MSI updates from Thomas Gleixner:
       "Updates for the MSI interrupt subsystem and initial RISC-V MSI
        support.
      
        The core changes have been adopted from previous work which converted
        ARM[64] to the new per device MSI domain model, which was merged to
        support multiple MSI domain per device. The ARM[64] changes are being
        worked on too, but have not been ready yet. The core and platform-MSI
        changes have been split out to not hold up RISC-V and to avoid that
        RISC-V builds on the scheduled for removal interfaces.
      
        The core support provides new interfaces to handle wire to MSI bridges
        in a straight forward way and introduces new platform-MSI interfaces
        which are built on top of the per device MSI domain model.
      
        Once ARM[64] is converted over the old platform-MSI interfaces and the
        related ugliness in the MSI core code will be removed.
      
        The actual MSI parts for RISC-V were finalized late and have been
        post-poned for the next merge window.
      
        Drivers:
      
         - Add a new driver for the Andes hart-level interrupt controller
      
         - Rework the SiFive PLIC driver to prepare for MSI suport
      
         - Expand the RISC-V INTC driver to support the new RISC-V AIA
           controller which provides the basis for MSI on RISC-V
      
         - A few fixup for the fallout of the core changes"
      
      * tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
        irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
        x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
        genirq/matrix: Dynamic bitmap allocation
        irqchip/riscv-intc: Add support for RISC-V AIA
        irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
        irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
        irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
        irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
        irqchip/sifive-plic: Use devm_xyz() for managed allocation
        irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
        irqchip/sifive-plic: Convert PLIC driver into a platform driver
        irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
        irqchip/riscv-intc: Allow large non-standard interrupt number
        genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
        irqchip/imx-intmux: Handle pure domain searches correctly
        genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
        genirq/irqdomain: Reroute device MSI create_mapping
        genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
        genirq/msi: Optionally use dev->fwnode for device domain
        genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
        ...
      4527e837
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 02d4df78
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Core:
      
         - Make affinity changes take effect immediately for interrupt
           threads. This reduces the impact on isolated CPUs as it pulls over
           the thread right away instead of doing it after the next hardware
           interrupt arrived.
      
         - Cleanup and improvements for the interrupt chip simulator
      
         - Deduplication of the interrupt descriptor initialization code so
           the sparse and non-sparse mode share more code.
      
        Drivers:
      
         - A set of conversions to platform_drivers::remove_new() which gets
           rid of the pointless return value.
      
         - A new driver for the Starfive JH8100 SoC
      
         - Support for Amlogic-T7 SoCs
      
         - Improvement for the interrupt handling and EOI management for the
           loongson interrupt controller.
      
         - The usual fixes and improvements all over the place"
      
      * tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
        irqchip/ts4800: Convert to platform_driver::remove_new() callback
        irqchip/stm32-exti: Convert to platform_driver::remove_new() callback
        irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback
        irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback
        irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback
        irqchip/pruss-intc: Convert to platform_driver::remove_new() callback
        irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback
        irqchip/madera: Convert to platform_driver::remove_new() callback
        irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback
        irqchip/keystone: Convert to platform_driver::remove_new() callback
        irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback
        irqchip/imx-intmux: Convert to platform_driver::remove_new() callback
        irqchip/imgpdc: Convert to platform_driver::remove_new() callback
        irqchip: Add StarFive external interrupt controller
        dt-bindings: interrupt-controller: Add starfive,jh8100-intc
        arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs
        irqchip/meson-gpio: Add support for Amlogic-T7 SoCs
        dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs
        irqchip/vic: Fix a kernel-doc warning
        genirq: Wake interrupt threads immediately when changing affinity
        ...
      02d4df78
    • Linus Torvalds's avatar
      Merge tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 045395d8
      Linus Torvalds authored
      Pull cgroup updates from Tejun Heo:
       "A quiet cycle. One trivial doc update patch. Two patches to drop the
        now defunct memory_spread_slab feature from cgroup1 cpuset"
      
      * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Mark memory_spread_slab as obsolete
        cgroup/cpuset: Remove cpuset_do_slab_mem_spread()
        docs: cgroup-v1: add missing code-block tags
      045395d8
    • Linus Torvalds's avatar
      Merge tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · 1a1e0989
      Linus Torvalds authored
      Pull workqueue BH conversions from Tejun Heo:
       "This contains two patches that convert tasklet users to BH workqueues:
        backtracetest and usb hcd.
      
        DM conversions are being routed through the respective subsystem tree.
        Hopefully, the next cycle will see a lot more conversions"
      
      * tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        usb: core: hcd: Convert from tasklet to BH workqueue
        backtracetest: Convert from tasklet to BH workqueue
      1a1e0989
    • Linus Torvalds's avatar
      Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · ff887eb0
      Linus Torvalds authored
      Pull workqueue updates from Tejun Heo:
       "This cycle, a lot of workqueue changes including some that are
        significant and invasive.
      
         - During v6.6 cycle, unbound workqueues were updated so that they are
           more topology aware and flexible, which among other things improved
           workqueue behavior on modern multi-L3 CPUs. In the process, commit
           636b927e ("workqueue: Make unbound workqueues to use per-cpu
           pool_workqueues") switched unbound workqueues to use per-CPU
           frontend pool_workqueues as a part of increasing front-back mapping
           flexibility.
      
           An unwelcome side effect of this change was that this made max
           concurrency enforcement per-CPU blowing up the maximum number of
           allowed concurrent executions. I incorrectly assumed that this
           wouldn't cause practical problems as most unbound workqueue users
           are self-regulate max concurrency; however, there definitely are
           which don't (e.g. on IO paths) and the drastic increase in the
           allowed max concurrency led to noticeable perf regressions in some
           use cases.
      
           This is now addressed by separating out max concurrency enforcement
           to a separate struct - wq_node_nr_active - which makes @max_active
           consistently mean system-wide max concurrency regardless of the
           number of CPUs or (finally) NUMA nodes. This is a rather invasive
           and, in places, a bit clunky; however, the clunkiness rises from
           the the inherent requirement to handle the disagreement between the
           execution locality domain and max concurrency enforcement domain on
           some modern machines.
      
           See commit 5797b1c1 ("workqueue: Implement system-wide
           nr_active enforcement for unbound workqueues") for more details.
      
         - BH workqueue support is added.
      
           They are similar to per-CPU workqueues but execute work items in
           the softirq context. This is expected to replace tasklet. However,
           currently, it's missing the ability to disable and enable work
           items which is needed to convert many tasklet users. To avoid
           crowding this merge window too much, this will be included in the
           next merge window. A separate pull request will be sent for the
           couple conversion patches that are currently pending.
      
         - Waiman plugged a long-standing hole in workqueue CPU isolation
           where ordered workqueues didn't follow wq_unbound_cpumask updates.
           Ordered workqueues now follow the same rules as other unbound
           workqueues.
      
         - More CPU isolation improvements: Juri fixed another deficit in
           workqueue isolation where unbound rescuers don't respect
           wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on
           isolated CPUs.
      
         - Other misc changes"
      
      * tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits)
        workqueue: Drain BH work items on hot-unplugged CPUs
        workqueue: Introduce from_work() helper for cleaner callback declarations
        workqueue: Control intensive warning threshold through cmdline
        workqueue: Make @flags handling consistent across set_work_data() and friends
        workqueue: Remove clear_work_data()
        workqueue: Factor out work_grab_pending() from __cancel_work_sync()
        workqueue: Clean up enum work_bits and related constants
        workqueue: Introduce work_cancel_flags
        workqueue: Use variable name irq_flags for saving local irq flags
        workqueue: Reorganize flush and cancel[_sync] functions
        workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
        workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
        workqueue: Cosmetic changes
        workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
        workqueue: Fix queue_work_on() with BH workqueues
        async: Use a dedicated unbound workqueue with raised min_active
        workqueue: Implement workqueue_set_min_active()
        workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
        workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
        kernel/workqueue: Let rescuers follow unbound wq cpumask changes
        ...
      ff887eb0
    • Linus Torvalds's avatar
      Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux · 8ede842f
      Linus Torvalds authored
      Pull Rust updates from Miguel Ojeda:
       "Another routine one in terms of features. We got two version upgrades
        this time, but in terms of lines, 'alloc' changes are not very large.
      
        Toolchain and infrastructure:
      
         - Upgrade to Rust 1.76.0
      
           This time around, due to how the kernel and Rust schedules have
           aligned, there are two upgrades in fact. These allow us to remove
           two more unstable features ('const_maybe_uninit_zeroed' and
           'ptr_metadata') from the list, among other improvements
      
         - Mark 'rustc' (and others) invocations as recursive, which fixes a
           new warning and prepares us for the future in case we eventually
           take advantage of the Make jobserver
      
        'kernel' crate:
      
         - Add the 'container_of!' macro
      
         - Stop using the unstable 'ptr_metadata' feature by employing the now
           stable 'byte_sub' method to implement 'Arc::from_raw()'
      
         - Add the 'time' module with a 'msecs_to_jiffies()' conversion
           function to begin with, to be used by Rust Binder
      
         - Add 'notify_sync()' and 'wait_interruptible_timeout()' methods to
           'CondVar', to be used by Rust Binder
      
         - Update integer types for 'CondVar'
      
         - Rename 'wait_list' field to 'wait_queue_head' in 'CondVar'
      
         - Implement 'Display' and 'Debug' for 'BStr'
      
         - Add the 'try_from_foreign()' method to the 'ForeignOwnable' trait
      
         - Add reexports for macros so that they can be used from the right
           module (in addition to the root)
      
         - A series of code documentation improvements, including adding
           intra-doc links, consistency improvements, typo fixes...
      
        'macros' crate:
      
         - Place generated 'init_module()' function in '.init.text'
      
        Documentation:
      
         - Add documentation on Rust doctests and how they work"
      
      * tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux: (29 commits)
        rust: upgrade to Rust 1.76.0
        kbuild: mark `rustc` (and others) invocations as recursive
        rust: add `container_of!` macro
        rust: str: implement `Display` and `Debug` for `BStr`
        rust: module: place generated init_module() function in .init.text
        rust: types: add `try_from_foreign()` method
        docs: rust: Add description of Rust documentation test as KUnit ones
        docs: rust: Move testing to a separate page
        rust: kernel: stop using ptr_metadata feature
        rust: kernel: add reexports for macros
        rust: locked_by: shorten doclink preview
        rust: kernel: remove unneeded doclink targets
        rust: kernel: add doclinks
        rust: kernel: add blank lines in front of code blocks
        rust: kernel: mark code fragments in docs with backticks
        rust: kernel: unify spelling of refcount in docs
        rust: str: move SAFETY comment in front of unsafe block
        rust: str: use `NUL` instead of 0 in doc comments
        rust: kernel: add srctree-relative doclinks
        rust: ioctl: end top-level module docs with full stop
        ...
      8ede842f
    • Linus Torvalds's avatar
      Merge tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux · 5a2a15cd
      Linus Torvalds authored
      Pull compiler attributes update from Miguel Ojeda:
       "Trivial fixes to the __counted_by comments"
      
      * tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux:
        Compiler Attributes: counted_by: fixup clang URL
        Compiler Attributes: counted_by: bump min gcc version
      5a2a15cd
    • Linus Torvalds's avatar
      Merge tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux · e5a3878c
      Linus Torvalds authored
      Pull RCU updates from Boqun Feng:
      
       - Eliminate deadlocks involving do_exit() and RCU tasks, by Paul:
         Instead of SRCU read side critical sections, now a percpu list is
         used in do_exit() for scaning yet-to-exit tasks
      
       - Fix a deadlock due to the dependency between workqueue and RCU
         expedited grace period, reported by Anna-Maria Behnsen and Thomas
         Gleixner and fixed by Frederic: Now RCU expedited always uses its own
         kthread worker instead of a workqueue
      
       - RCU NOCB updates, code cleanups, unnecessary barrier removals and
         minor bug fixes
      
       - Maintain real-time response in rcu_tasks_postscan() and a minor fix
         for tasks trace quiescence check
      
       - Misc updates, comments and readibility improvement, boot time
         parameter for lazy RCU and rcutorture improvement
      
       - Documentation updates
      
      * tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux: (34 commits)
        rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
        rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
        rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
        rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
        rcu-tasks: Initialize callback lists at rcu_init() time
        rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
        rcu-tasks: Repair RCU Tasks Trace quiescence check
        rcu/sync: remove un-used rcu_sync_enter_start function
        rcutorture: Suppress rtort_pipe_count warnings until after stalls
        srcu: Improve comments about acceleration leak
        rcu: Provide a boot time parameter to control lazy RCU
        rcu: Rename jiffies_till_flush to jiffies_lazy_flush
        doc: Update checklist.rst discussion of callback execution
        doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCU
        context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()
        doc: Add EARLY flag to early-parsed kernel boot parameters
        doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rst
        doc: Make checklist.rst note that spinlocks are implied RCU readers
        doc: Make whatisRCU.rst note that spinlocks are RCU readers
        doc: Spinlocks are implied RCU readers
        ...
      e5a3878c
    • Linus Torvalds's avatar
      Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux · 1ddeeb2a
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
      
       - MD pull requests via Song:
            - Cleanup redundant checks (Yu Kuai)
            - Remove deprecated headers (Marc Zyngier, Song Liu)
            - Concurrency fixes (Li Lingfeng)
            - Memory leak fix (Li Nan)
            - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
            - Clean up and fix for md_ioctl (Li Nan)
            - Other small fixes (Gui-Dong Han, Heming Zhao)
            - MD atomic limits (Christoph)
      
       - NVMe pull request via Keith:
            - RDMA target enhancements (Max)
            - Fabrics fixes (Max, Guixin, Hannes)
            - Atomic queue_limits usage (Christoph)
            - Const use for class_register (Ricardo)
            - Identification error handling fixes (Shin'ichiro, Keith)
      
       - Improvement and cleanup for cached request handling (Christoph)
      
       - Moving towards atomic queue limits. Core changes and driver bits so
         far (Christoph)
      
       - Fix UAF issues in aoeblk (Chun-Yi)
      
       - Zoned fix and cleanups (Damien)
      
       - s390 dasd cleanups and fixes (Jan, Miroslav)
      
       - Block issue timestamp caching (me)
      
       - noio scope guarding for zoned IO (Johannes)
      
       - block/nvme PI improvements (Kanchan)
      
       - Ability to terminate long running discard loop (Keith)
      
       - bdev revalidation fix (Li)
      
       - Get rid of old nr_queues hack for kdump kernels (Ming)
      
       - Support for async deletion of ublk (Ming)
      
       - Improve IRQ bio recycling (Pavel)
      
       - Factor in CPU capacity for remote vs local completion (Qais)
      
       - Add shared_tags configfs entry for null_blk (Shin'ichiro
      
       - Fix for a regression in page refcounts introduced by the folio
         unification (Tony)
      
       - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
         Ricardo, Roman, Tang, Uwe)
      
      * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
        block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
        block/swim: Convert to platform remove callback returning void
        cdrom: gdrom: Convert to platform remove callback returning void
        block: remove disk_stack_limits
        md: remove mddev->queue
        md: don't initialize queue limits
        md/raid10: use the atomic queue limit update APIs
        md/raid5: use the atomic queue limit update APIs
        md/raid1: use the atomic queue limit update APIs
        md/raid0: use the atomic queue limit update APIs
        md: add queue limit helpers
        md: add a mddev_is_dm helper
        md: add a mddev_add_trace_msg helper
        md: add a mddev_trace_remap helper
        bcache: move calculation of stripe_size and io_opt into bcache_device_init
        virtio_blk: Do not use disk_set_max_open/active_zones()
        aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
        block: move capacity validation to blkpg_do_ioctl()
        block: prevent division by zero in blk_rq_stat_sum()
        drbd: atomically update queue limits in drbd_reconsider_queue_parameters
        ...
      1ddeeb2a
    • Linus Torvalds's avatar
      Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux · d2c84bdc
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
      
       - Make running of task_work internal loops more fair, and unify how the
         different methods deal with them (me)
      
       - Support for per-ring NAPI. The two minor networking patches are in a
         shared branch with netdev (Stefan)
      
       - Add support for truncate (Tony)
      
       - Export SQPOLL utilization stats (Xiaobing)
      
       - Multishot fixes (Pavel)
      
       - Fix for a race in manipulating the request flags via poll (Pavel)
      
       - Cleanup the multishot checking by making it generic, moving it out of
         opcode handlers (Pavel)
      
       - Various tweaks and cleanups (me, Kunwu, Alexander)
      
      * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
        io_uring: Fix sqpoll utilization check racing with dying sqpoll
        io_uring/net: dedup io_recv_finish req completion
        io_uring: refactor DEFER_TASKRUN multishot checks
        io_uring: fix mshot io-wq checks
        io_uring/net: add io_req_msg_cleanup() helper
        io_uring/net: simplify msghd->msg_inq checking
        io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
        io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
        io_uring/net: correctly handle multishot recvmsg retry setup
        io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
        io_uring: fix io_queue_proc modifying req->flags
        io_uring: fix mshot read defer taskrun cqe posting
        io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
        io_uring/net: correct the type of variable
        io_uring/sqpoll: statistics of the true utilization of sq threads
        io_uring/net: move recv/recvmsg flags out of retry loop
        io_uring/kbuf: flag request if buffer pool is empty after buffer pick
        io_uring/net: improve the usercopy for sendmsg/recvmsg
        io_uring/net: move receive multishot out of the generic msghdr path
        io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
        ...
      d2c84bdc
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 0f1a8766
      Linus Torvalds authored
      Pull vfs uuid updates from Christian Brauner:
       "This adds two new ioctl()s for getting the filesystem uuid and
        retrieving the sysfs path based on the path of a mounted filesystem.
        Getting the filesystem uuid has been implemented in filesystem
        specific code for a while it's now lifted as a generic ioctl"
      
      * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        xfs: add support for FS_IOC_GETFSSYSFSPATH
        fs: add FS_IOC_GETFSSYSFSPATH
        fat: Hook up sb->s_uuid
        fs: FS_IOC_GETUUID
        ovl: convert to super_set_uuid()
        fs: super_set_uuid()
      0f1a8766
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 910202f0
      Linus Torvalds authored
      Pull block handle updates from Christian Brauner:
       "Last cycle we changed opening of block devices, and opening a block
        device would return a bdev_handle. This allowed us to implement
        support for restricting and forbidding writes to mounted block
        devices. It was accompanied by converting and adding helpers to
        operate on bdev_handles instead of plain block devices.
      
        That was already a good step forward but ultimately it isn't necessary
        to have special purpose helpers for opening block devices internally
        that return a bdev_handle.
      
        Fundamentally, opening a block device internally should just be
        equivalent to opening files. So now all internal opens of block
        devices return files just as a userspace open would. Instead of
        introducing a separate indirection into bdev_open_by_*() via struct
        bdev_handle bdev_file_open_by_*() is made to just return a struct
        file. Opening and closing a block device just becomes equivalent to
        opening and closing a file.
      
        This all works well because internally we already have a pseudo fs for
        block devices and so opening block devices is simple. There's a few
        places where we needed to be careful such as during boot when the
        kernel is supposed to mount the rootfs directly without init doing it.
        Here we need to take care to ensure that we flush out any asynchronous
        file close. That's what we already do for opening, unpacking, and
        closing the initramfs. So nothing new here.
      
        The equivalence of opening and closing block devices to regular files
        is a win in and of itself. But it also has various other advantages.
        We can remove struct bdev_handle completely. Various low-level helpers
        are now private to the block layer. Other helpers were simply
        removable completely.
      
        A follow-up series that is already reviewed build on this and makes it
        possible to remove bdev->bd_inode and allows various clean ups of the
        buffer head code as well. All places where we stashed a bdev_handle
        now just stash a file and use simple accessors to get to the actual
        block device which was already the case for bdev_handle"
      
      * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
        block: remove bdev_handle completely
        block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
        bdev: remove bdev pointer from struct bdev_handle
        bdev: make struct bdev_handle private to the block layer
        bdev: make bdev_{release, open_by_dev}() private to block layer
        bdev: remove bdev_open_by_path()
        reiserfs: port block device access to file
        ocfs2: port block device access to file
        nfs: port block device access to files
        jfs: port block device access to file
        f2fs: port block device access to files
        ext4: port block device access to file
        erofs: port device access to file
        btrfs: port device access to file
        bcachefs: port block device access to file
        target: port block device access to file
        s390: port block device access to file
        nvme: port block device access to file
        block2mtd: port device access to files
        bcache: port block device access to files
        ...
      910202f0
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 0c750012
      Linus Torvalds authored
      Pull file locking updates from Christian Brauner:
       "A few years ago struct file_lock_context was added to allow for
        separate lists to track different types of file locks instead of using
        a singly-linked list for all of them.
      
        Now leases no longer need to be tracked using struct file_lock.
        However, a lot of the infrastructure is identical for leases and locks
        so separating them isn't trivial.
      
        This splits a group of fields used by both file locks and leases into
        a new struct file_lock_core. The new core struct is embedded in struct
        file_lock. Coccinelle was used to convert a lot of the callers to deal
        with the move, with the remaining 25% or so converted by hand.
      
        Afterwards several internal functions in fs/locks.c are made to work
        with struct file_lock_core. Ultimately this allows to split struct
        file_lock into struct file_lock and struct file_lease. The file lease
        APIs are then converted to take struct file_lease"
      
      * tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
        filelock: fix deadlock detection in POSIX locking
        filelock: always define for_each_file_lock()
        smb: remove redundant check
        filelock: don't do security checks on nfsd setlease calls
        filelock: split leases out of struct file_lock
        filelock: remove temporary compatibility macros
        smb/server: adapt to breakup of struct file_lock
        smb/client: adapt to breakup of struct file_lock
        ocfs2: adapt to breakup of struct file_lock
        nfsd: adapt to breakup of struct file_lock
        nfs: adapt to breakup of struct file_lock
        lockd: adapt to breakup of struct file_lock
        fuse: adapt to breakup of struct file_lock
        gfs2: adapt to breakup of struct file_lock
        dlm: adapt to breakup of struct file_lock
        ceph: adapt to breakup of struct file_lock
        afs: adapt to breakup of struct file_lock
        9p: adapt to breakup of struct file_lock
        filelock: convert seqfile handling to use file_lock_core
        filelock: convert locks_translate_pid to take file_lock_core
        ...
      0c750012
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · b5683a37
      Linus Torvalds authored
      Pull pdfd updates from Christian Brauner:
      
       - Until now pidfds could only be created for thread-group leaders but
         not for threads. There was no technical reason for this. We simply
         had no users that needed support for this. Now we do have users that
         need support for this.
      
         This introduces a new PIDFD_THREAD flag for pidfd_open(). If that
         flag is set pidfd_open() creates a pidfd that refers to a specific
         thread.
      
         In addition, we now allow clone() and clone3() to be called with
         CLONE_PIDFD | CLONE_THREAD which wasn't possible before.
      
         A pidfd that refers to an individual thread differs from a pidfd that
         refers to a thread-group leader:
      
          (1) Pidfds are pollable. A task may poll a pidfd and get notified
              when the task has exited.
      
              For thread-group leader pidfds the polling task is woken if the
              thread-group is empty. In other words, if the thread-group
              leader task exits when there are still threads alive in its
              thread-group the polling task will not be woken when the
              thread-group leader exits but rather when the last thread in the
              thread-group exits.
      
              For thread-specific pidfds the polling task is woken if the
              thread exits.
      
          (2) Passing a thread-group leader pidfd to pidfd_send_signal() will
              generate thread-group directed signals like kill(2) does.
      
              Passing a thread-specific pidfd to pidfd_send_signal() will
              generate thread-specific signals like tgkill(2) does.
      
              The default scope of the signal is thus determined by the type
              of the pidfd.
      
              Since use-cases exist where the default scope of the provided
              pidfd needs to be overriden the following flags are added to
              pidfd_send_signal():
      
               - PIDFD_SIGNAL_THREAD
                 Send a thread-specific signal.
      
               - PIDFD_SIGNAL_THREAD_GROUP
                 Send a thread-group directed signal.
      
               - PIDFD_SIGNAL_PROCESS_GROUP
                 Send a process-group directed signal.
      
              The scope change will only work if the struct pid is actually
              used for this scope.
      
              For example, in order to send a thread-group directed signal the
              provided pidfd must be used as a thread-group leader and
              similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be
              used as a process group leader.
      
       - Move pidfds from the anonymous inode infrastructure to a tiny pseudo
         filesystem. This will unblock further work that we weren't able to do
         simply because of the very justified limitations of anonymous inodes.
         Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds
         to become useful for the first time. They can now be compared by
         inode number which are unique for the system lifetime.
      
         Instead of stashing struct pid in file->private_data we can now stash
         it in inode->i_private. This makes it possible to introduce concepts
         that operate on a process once all file descriptors have been closed.
         A concrete example is kill-on-last-close. Another side-effect is that
         file->private_data is now freed up for per-file options for pidfds.
      
         Now, each struct pid will refer to a different inode but the same
         struct pid will refer to the same inode if it's opened multiple
         times. In contrast to now where each struct pid refers to the same
         inode.
      
         The tiny pseudo filesystem is not visible anywhere in userspace
         exactly like e.g., pipefs and sockfs. There's no lookup, there's no
         complex inode operations, nothing. Dentries and inodes are always
         deleted when the last pidfd is closed.
      
         We allocate a new inode and dentry for each struct pid and we reuse
         that inode and dentry for all pidfds that refer to the same struct
         pid. The code is entirely optional and fairly small. If it's not
         selected we fallback to anonymous inodes. Heavily inspired by nsfs.
      
         The dentry and inode allocation mechanism is moved into generic
         infrastructure that is now shared between nsfs and pidfs. The
         path_from_stashed() helper must be provided with a stashing location,
         an inode number, a mount, and the private data that is supposed to be
         used and it will provide a path that can be passed to dentry_open().
      
         The helper will try retrieve an existing dentry from the provided
         stashing location. If a valid dentry is found it is reused. If not a
         new one is allocated and we try to stash it in the provided location.
         If this fails we retry until we either find an existing dentry or the
         newly allocated dentry could be stashed. Subsequent openers of the
         same namespace or task are then able to reuse it.
      
       - Currently it is only possible to get notified when a task has exited,
         i.e., become a zombie and userspace gets notified with EPOLLIN. We
         now also support waiting until the task has been reaped, notifying
         userspace with EPOLLHUP.
      
       - Ensure that ESRCH is reported for getfd if a task is exiting instead
         of the confusing EBADF.
      
       - Various smaller cleanups to pidfd functions.
      
      * tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
        libfs: improve path_from_stashed()
        libfs: add stashed_dentry_prune()
        libfs: improve path_from_stashed() helper
        pidfs: convert to path_from_stashed() helper
        nsfs: convert to path_from_stashed() helper
        libfs: add path_from_stashed()
        pidfd: add pidfs
        pidfd: move struct pidfd_fops
        pidfd: allow to override signal scope in pidfd_send_signal()
        pidfd: change pidfd_send_signal() to respect PIDFD_THREAD
        signal: fill in si_code in prepare_kill_siginfo()
        selftests: add ESRCH tests for pidfd_getfd()
        pidfd: getfd should always report ESRCH if a task is exiting
        pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together
        pidfd: exit: kill the no longer used thread_group_exited()
        pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN))
        pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
        pidfd: kill the no longer needed do_notify_pidfd() in de_thread()
        pidfd_poll: report POLLHUP when pid_task() == NULL
        pidfd: implement PIDFD_THREAD flag for pidfd_open()
        ...
      b5683a37
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 54126faf
      Linus Torvalds authored
      Pull iomap updates from Christian Brauner:
      
       - Restore read-write hints in struct bio through the bi_write_hint
         member for the sake of UFS devices in mobile applications. This can
         result in up to 40% lower write amplification in UFS devices. The
         patch series that builds on this will be coming in via the SCSI
         maintainers (Bart)
      
       - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
         to map multiple blocks at once as long as they're in the same folio.
         This reduces CPU usage for buffered write workloads on e.g., xfs on
         systems with lots of cores (Christoph)
      
       - Record processed bytes in iomap_iter() trace event (Kassey)
      
       - Extend iomap_writepage_map() trace event after Christoph's
         ->map_block() changes to map mutliple blocks at once (Zhang)
      
      * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
        iomap: Add processed for iomap_iter
        iomap: add pos and dirty_len into trace_iomap_writepage_map
        block, fs: Restore the per-bio/request data lifetime fields
        fs: Propagate write hints to the struct block_device inode
        fs: Move enum rw_hint into a new header file
        fs: Split fcntl_rw_hint()
        fs: Verify write lifetime constants at compile time
        fs: Fix rw_hint validation
        iomap: pass the length of the dirty region to ->map_blocks
        iomap: map multiple blocks at a time
        iomap: submit ioends immediately
        iomap: factor out a iomap_writepage_map_block helper
        iomap: only call mapping_set_error once for each failed bio
        iomap: don't chain bios
        iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
        iomap: clean up the iomap_alloc_ioend calling convention
        iomap: move all remaining per-folio logic into iomap_writepage_map
        iomap: factor out a iomap_writepage_handle_eof helper
        iomap: move the PF_MEMALLOC check to iomap_writepages
        iomap: move the io_folios field out of struct iomap_ioend
        ...
      54126faf
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 77417942
      Linus Torvalds authored
      Pull ntfs update from Christian Brauner:
       "This removes the old ntfs driver. The new ntfs3 driver is a full
        replacement that was merged over two years ago. We've went through
        various userspace and either they use ntfs3 or they use the fuse
        version of ntfs and thus build neither ntfs nor ntfs3. I think that's
        a clear sign that we should risk removing the legacy ntfs driver.
      
        Quoting from Arch Linux and Debian:
      
         - Debian does neither build the legacy ntfs nor the new ntfs3:
      
           "Not currently built with Debian's kernel packages, 'ntfs' has been
            symlinked to 'ntfs-3g' as it relates to fstab and mount commands.
      
            Debian kernels are built without support of the ntfs3 driver
            developed by Paragon Software."  (cf. [2])
      
         - Archlinux provides ntfs3 as their default since 5.15:
      
           "All officially supported kernels with versions 5.15 or newer are
            built with CONFIG_NTFS3_FS=m and thus support it. Before 5.15,
            NTFS read and write support is provided by the NTFS-3G FUSE file
            system."  (cf. [1]).
      
        It's unmaintained apart from various odd fixes as well. Worst case we
        have to reintroduce it if someone really has a valid dependency on it.
        But it's worth trying to see whether we can remove it"
      
      Link: https://wiki.archlinux.org/title/NTFS [1]
      Link: https://wiki.debian.org/NTFS [2]
      
      * tag 'vfs-6.9.ntfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: remove NTFS classic from docum. index
        fs: Remove NTFS classic
      77417942
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 7ea65c89
      Linus Torvalds authored
      Pull misc vfs updates from Christian Brauner:
       "Misc features, cleanups, and fixes for vfs and individual filesystems.
      
        Features:
      
         - Support idmapped mounts for hugetlbfs.
      
         - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
           where the passed offset is ignored if the file is O_APPEND. The new
           flag allows a caller to enforce that the offset is honored to
           conform to posix even if the file was opened in append mode.
      
         - Move i_mmap_rwsem in struct address_space to avoid false sharing
           between i_mmap and i_mmap_rwsem.
      
         - Convert efs, qnx4, and coda to use the new mount api.
      
         - Add a generic is_dot_dotdot() helper that's used by various
           filesystems and the VFS code instead of open-coding it multiple
           times.
      
         - Recently we've added stable offsets which allows stable ordering
           when iterating directories exported through NFS on e.g., tmpfs
           filesystems. Originally an xarray was used for the offset map but
           that caused slab fragmentation issues over time. This switches the
           offset map to the maple tree which has a dense mode that handles
           this scenario a lot better. Includes tests.
      
         - Finally merge the case-insensitive improvement series Gabriel has
           been working on for a long time. This cleanly propagates case
           insensitive operations through ->s_d_op which in turn allows us to
           remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
           It also improves performance by trying a case-sensitive comparison
           first and then fallback to case-insensitive lookup if that fails.
           This also fixes a bug where overlayfs would be able to be mounted
           over a case insensitive directory which would lead to all sort of
           odd behaviors.
      
        Cleanups:
      
         - Make file_dentry() a simple accessor now that ->d_real() is
           simplified because of the backing file work we did the last two
           cycles.
      
         - Use the dedicated file_mnt_idmap helper in ntfs3.
      
         - Use smp_load_acquire/store_release() in the i_size_read/write
           helpers and thus remove the hack to handle i_size reads in the
           filemap code.
      
         - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
           fs/
      
         - It's no longer necessary to perform a second built-in initramfs
           unpack call because we retain the contents of the previous
           extraction. Remove it.
      
         - Now that we have removed various allocators kfree_rcu() always
           works with kmem caches and kmalloc(). So simplify various places
           that only use an rcu callback in order to handle the kmem cache
           case.
      
         - Convert the pipe code to use a lockdep comparison function instead
           of open-coding the nesting making lockdep validation easier.
      
         - Move code into fs-writeback.c that was located in a header but can
           be made static as it's only used in that one file.
      
         - Rewrite the alignment checking iterators for iovec and bvec to be
           easier to read, and also significantly more compact in terms of
           generated code. This saves 270 bytes of text on x86-64 (with
           clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
           saves a bit of time for the same workload.
      
         - Switch various places to use KMEM_CACHE instead of
           kmem_cache_create().
      
         - Use inode_set_ctime_to_ts() in inode_set_ctime_current()
      
         - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.
      
         - Various smaller cleanups for eventfds.
      
        Fixes:
      
         - Fix various comments and typos, and unneeded initializations.
      
         - Fix stack allocation hack for clang in the select code.
      
         - Improve dump_mapping() debug code on a best-effort basis.
      
         - Fix build errors in various selftests.
      
         - Avoid wrap-around instrumentation in various places.
      
         - Don't allow user namespaces without an idmapping to be used for
           idmapped mounts.
      
         - Fix sysv sb_read() call.
      
         - Fix fallback implementation of the get_name() export operation"
      
      * tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
        hugetlbfs: support idmapped mounts
        qnx4: convert qnx4 to use the new mount api
        fs: use inode_set_ctime_to_ts to set inode ctime to current time
        libfs: Drop generic_set_encrypted_ci_d_ops
        ubifs: Configure dentry operations at dentry-creation time
        f2fs: Configure dentry operations at dentry-creation time
        ext4: Configure dentry operations at dentry-creation time
        libfs: Add helper to choose dentry operations at mount-time
        libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
        fscrypt: Drop d_revalidate once the key is added
        fscrypt: Drop d_revalidate for valid dentries during lookup
        fscrypt: Factor out a helper to configure the lookup dentry
        ovl: Always reject mounting over case-insensitive directories
        libfs: Attempt exact-match comparison first during casefolded lookup
        efs: remove SLAB_MEM_SPREAD flag usage
        jfs: remove SLAB_MEM_SPREAD flag usage
        minix: remove SLAB_MEM_SPREAD flag usage
        openpromfs: remove SLAB_MEM_SPREAD flag usage
        proc: remove SLAB_MEM_SPREAD flag usage
        qnx6: remove SLAB_MEM_SPREAD flag usage
        ...
      7ea65c89
    • Linus Torvalds's avatar
      Merge tag 'linux_kselftest-kunit-6.9-rc1' of... · 97ec9715
      Linus Torvalds authored
      Merge tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit updates from Shuah Khan:
      
       - fix to make kunit_bus_type const
      
       - kunit tool change to Print UML command
      
       - DRM device creation helpers are now using the new kunit device
         creation helpers. This change resulted in DRM helpers switching from
         using a platform_device, to a dedicated bus and device type used by
         kunit. kunit devices don't set DMA mask and this caused regression on
         some drm tests as they can't allocate DMA buffers. Fix this problem
         by setting DMA masks on the kunit device during initialization.
      
       - KUnit has several macros which accept a log message, which can
         contain printf format specifiers. Some of these (the explicit log
         macros) already use the __printf() gcc attribute to ensure the format
         specifiers are valid, but those which could fail the test, and hence
         used __kunit_do_failed_assertion() behind the scenes, did not.
      
         These include: KUNIT_EXPECT_*_MSG(), KUNIT_ASSERT_*_MSG(), and
         KUNIT_FAIL()
      
         A nine-patch series adds the __printf() attribute, and fixes all of
         the issues uncovered.
      
      * tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: Annotate _MSG assertion variants with gnu printf specifiers
        drm: tests: Fix invalid printf format specifiers in KUnit tests
        drm/xe/tests: Fix printf format specifiers in xe_migrate test
        net: test: Fix printf format specifier in skb_segment kunit test
        rtc: test: Fix invalid format specifier.
        time: test: Fix incorrect format specifier
        lib: memcpy_kunit: Fix an invalid format specifier in an assertion msg
        lib/cmdline: Fix an invalid format specifier in an assertion msg
        kunit: test: Log the correct filter string in executor_test
        kunit: Setup DMA masks on the kunit device
        kunit: make kunit_bus_type const
        kunit: Mark filter* params as rw
        kunit: tool: Print UML command
      97ec9715
    • Linus Torvalds's avatar
      Merge tag 'linux_kselftest-next-6.9-rc1' of... · d451b075
      Linus Torvalds authored
      Merge tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest update from Shuah Khan:
      
       - livepatch restructuring to move the module out of lib to be built as
         a out-of-tree modules during kselftest build. This makes it easier
         change, debug and rebuild the tests by running make on the
         selftests/livepatch directory, which is not currently possible since
         the modules on lib/livepatch are build and installed using the main
         makefile modules target.
      
       - livepatch restructuring fixes for problems found by kernel test
         robot. The change skips the test if kernel-devel isn't installed
         (default value of KDIR), or if KDIR variable passed doesn't exists.
      
       - resctrl test restructuring and new non-contiguous CBMs CAT test
      
       - new ktap_helpers to print diagnostic messages, pass/fail tests based
         on exit code, abort test, and finish the test.
      
       - a new test verify power supply properties.
      
       - a new ftrace to exercise function tracer across cpu hotplug.
      
       - timeout increase for mqueue test to allow the test to run on i3.metal
         AWS instances.
      
       - minor spelling corrections in several tests.
      
       - missing gitignore files and changes to existing gitignore files.
      
      * tag 'linux_kselftest-next-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (57 commits)
        kselftest: Add basic test for probing the rust sample modules
        selftests: lib.mk: Do not process TEST_GEN_MODS_DIR
        selftests: livepatch: Avoid running the tests if kernel-devel is missing
        selftests: livepatch: Add initial .gitignore
        selftests/resctrl: Add non-contiguous CBMs CAT test
        selftests/resctrl: Add resource_info_file_exists()
        selftests/resctrl: Split validate_resctrl_feature_request()
        selftests/resctrl: Add a helper for the non-contiguous test
        selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
        selftests: sched: Fix spelling mistake "hiearchy" -> "hierarchy"
        selftests/mqueue: Set timeout to 180 seconds
        selftests/ftrace: Add test to exercize function tracer across cpu hotplug
        selftest: ftrace: fix minor typo in log
        selftests: thermal: intel: workload_hint: add missing gitignore
        selftests: thermal: intel: power_floor: add missing gitignore
        selftests: uevent: add missing gitignore
        selftests: Add test to verify power supply properties
        selftests: ktap_helpers: Add a helper to finish the test
        selftests: ktap_helpers: Add a helper to abort the test
        selftests: ktap_helpers: Add helper to pass/fail test based on exit code
        ...
      d451b075
  2. 10 Mar, 2024 7 commits
    • Linus Torvalds's avatar
      Linux 6.8 · e8f897f4
      Linus Torvalds authored
      e8f897f4
    • Linus Torvalds's avatar
      Merge tag 'trace-ring-buffer-v6.8-rc7' of... · fa4b851b
      Linus Torvalds authored
      Merge tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull tracing fixes from Steven Rostedt:
      
       - Do not allow large strings (> 4096) as single write to trace_marker
      
         The size of a string written into trace_marker was determined by the
         size of the sub-buffer in the ring buffer. That size is dependent on
         the PAGE_SIZE of the architecture as it can be mapped into user
         space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of
         the string of writing into trace_marker 64K.
      
         One of the selftests looks at the size of the ring buffer sub-buffers
         and writes that plus more into the trace_marker. The write will take
         what it can and report back what it consumed so that the user space
         application (like echo) will write the rest of the string. The string
         is stored in the ring buffer and can be read via the "trace" or
         "trace_pipe" files.
      
         The reading of the ring buffer uses vsnprintf(), which uses a
         precision "%.*s" to make sure it only reads what is stored in the
         buffer, as a bug could cause the string to be non terminated.
      
         With the combination of the precision change and the PAGE_SIZE of 64K
         allowing huge strings to be added into the ring buffer, plus the test
         that would actually stress that limit, a bug was reported that the
         precision used was too big for "%.*s" as the string was close to 64K
         in size and the max precision of vsnprintf is 32K.
      
         Linus suggested not to have that precision as it could hide a bug if
         the string was again stored without a nul byte.
      
         Another issue that was brought up is that the trace_seq buffer is
         also based on PAGE_SIZE even though it is not tied to the
         architecture limit like the ring buffer sub-buffer is. Having it be
         64K * 2 is simply just too big and wasting memory on systems with 64K
         page sizes. It is now hardcoded to 8K which is what all other
         architectures with 4K PAGE_SIZE has.
      
         Finally, the write to trace_marker is now limited to 4K as there is
         no reason to write larger strings into trace_marker.
      
       - ring_buffer_wait() should not loop.
      
         The ring_buffer_wait() does not have the full context (yet) on if it
         should loop or not. Just exit the loop as soon as its woken up and
         let the callers decide to loop or not (they already do, so it's a bit
         redundant).
      
       - Fix shortest_full field to be the smallest amount in the ring buffer
         that a waiter is waiting for. The "shortest_full" field is updated
         when a new waiter comes in and wants to wait for a smaller amount of
         data in the ring buffer than other waiters. But after all waiters are
         woken up, it's not reset, so if another waiter comes in wanting to
         wait for more data, it will be woken up when the ring buffer has a
         smaller amount from what the previous waiters were waiting for.
      
       - The wake up all waiters on close is incorrectly called frome
         .release() and not from .flush() so it will never wake up any waiters
         as the .release() will not get called until all .read() calls are
         finished. And the wakeup is for the waiters in those .read() calls.
      
      * tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing: Use .flush() call to wake up readers
        ring-buffer: Fix resetting of shortest_full
        ring-buffer: Fix waking up ring buffer readers
        tracing: Limit trace_marker writes to just 4K
        tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE
        tracing: Remove precision vsnprintf() check from print event
      fa4b851b
    • Linus Torvalds's avatar
      Merge tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy · 210ee636
      Linus Torvalds authored
      Pull phy fixes from Vinod Koul:
      
       - fixes for Qualcomm qmp-combo driver for ordering of drm and type-c
         switch registartion due to drivers might not probe defer after having
         registered child devices to avoid triggering a probe deferral loop.
      
         This fixes internal display on Lenovo ThinkPad X13s
      
      * tag 'phy-fixes3-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
        phy: qcom-qmp-combo: fix type-c switch registration
        phy: qcom-qmp-combo: fix drm bridge registration
      210ee636
    • Steven Rostedt (Google)'s avatar
      tracing: Use .flush() call to wake up readers · e5d7c191
      Steven Rostedt (Google) authored
      The .release() function does not get called until all readers of a file
      descriptor are finished.
      
      If a thread is blocked on reading a file descriptor in ring_buffer_wait(),
      and another thread closes the file descriptor, it will not wake up the
      other thread as ring_buffer_wake_waiters() is called by .release(), and
      that will not get called until the .read() is finished.
      
      The issue originally showed up in trace-cmd, but the readers are actually
      other processes with their own file descriptors. So calling close() would wake
      up the other tasks because they are blocked on another descriptor then the
      one that was closed(). But there's other wake ups that solve that issue.
      
      When a thread is blocked on a read, it can still hang even when another
      thread closed its descriptor.
      
      This is what the .flush() callback is for. Have the .flush() wake up the
      readers.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linke li <lilinke99@qq.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: f3ddb74a ("tracing: Wake up ring buffer waiters on closing of the file")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      e5d7c191
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix resetting of shortest_full · 68282dd9
      Steven Rostedt (Google) authored
      The "shortest_full" variable is used to keep track of the waiter that is
      waiting for the smallest amount on the ring buffer before being woken up.
      When a tasks waits on the ring buffer, it passes in a "full" value that is
      a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to
      100% full buffer.
      
      As all waiters are on the same wait queue, the wake up happens for the
      waiter with the smallest percentage.
      
      The problem is that the smallest_full on the cpu_buffer that stores the
      smallest amount doesn't get reset when all the waiters are woken up. It
      does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace).
      
      This means that tasks may be woken up more often then when they want to
      be. Instead, have the shortest_full field get reset just before waking up
      all the tasks. If the tasks wait again, they will update the shortest_full
      before sleeping.
      
      Also add locking around setting of shortest_full in the poll logic, and
      change "work" to "rbwork" to match the variable name for rb_irq_work
      structures that are used in other places.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linke li <lilinke99@qq.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: 2c2b0a78 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      68282dd9
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 137e0ec0
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "KVM GUEST_MEMFD fixes for 6.8:
      
         - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
           to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not
           writable from userspace, so there would be no way to write to a
           read-only guest_memfd).
      
         - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
           clear that such VMs are purely for development and testing.
      
         - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term
           plan is to support confidential VMs with deterministic private
           memory (SNP and TDX) only in the TDP MMU.
      
         - Fix a bug in a GUEST_MEMFD dirty logging test that caused false
           passes.
      
        x86 fixes:
      
         - Fix missing marking of a guest page as dirty when emulating an
           atomic access.
      
         - Check for mmu_notifier invalidation events before faulting in the
           pfn, and before acquiring mmu_lock, to avoid unnecessary work and
           lock contention with preemptible kernels (including
           CONFIG_PREEMPT_DYNAMIC in non-preemptible mode).
      
         - Disable AMD DebugSwap by default, it breaks VMSA signing and will
           be re-enabled with a better VM creation API in 6.10.
      
         - Do the cache flush of converted pages in svm_register_enc_region()
           before dropping kvm->lock, to avoid a race with unregistering of
           the same region and the consequent use-after-free issue"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        SEV: disable SEV-ES DebugSwap by default
        KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
        KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
        KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive
        KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases
        KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
        KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
        KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
        KVM: x86: Mark target gfn of emulated atomic instruction as dirty
      137e0ec0
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix waking up ring buffer readers · b3594573
      Steven Rostedt (Google) authored
      A task can wait on a ring buffer for when it fills up to a specific
      watermark. The writer will check the minimum watermark that waiters are
      waiting for and if the ring buffer is past that, it will wake up all the
      waiters.
      
      The waiters are in a wait loop, and will first check if a signal is
      pending and then check if the ring buffer is at the desired level where it
      should break out of the loop.
      
      If a file that uses a ring buffer closes, and there's threads waiting on
      the ring buffer, it needs to wake up those threads. To do this, a
      "wait_index" was used.
      
      Before entering the wait loop, the waiter will read the wait_index. On
      wakeup, it will check if the wait_index is different than when it entered
      the loop, and will exit the loop if it is. The waker will only need to
      update the wait_index before waking up the waiters.
      
      This had a couple of bugs. One trivial one and one broken by design.
      
      The trivial bug was that the waiter checked the wait_index after the
      schedule() call. It had to be checked between the prepare_to_wait() and
      the schedule() which it was not.
      
      The main bug is that the first check to set the default wait_index will
      always be outside the prepare_to_wait() and the schedule(). That's because
      the ring_buffer_wait() doesn't have enough context to know if it should
      break out of the loop.
      
      The loop itself is not needed, because all the callers to the
      ring_buffer_wait() also has their own loop, as the callers have a better
      sense of what the context is to decide whether to break out of the loop
      or not.
      
      Just have the ring_buffer_wait() block once, and if it gets woken up, exit
      the function and let the callers decide what to do next.
      
      Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linke li <lilinke99@qq.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa ("tracing: Do not busy wait in buffer splice")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b3594573
  3. 09 Mar, 2024 8 commits
    • Linus Torvalds's avatar
      Merge tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 005f6f34
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two patches from Heiner for the i801 are targeting muxes discovered
        while working on some other features. Essentially, there is a
        reordering when adding optional slaves and proper cleanup upon
        registering a mux device.
      
        Christophe fixes the exit path in the wmt driver that was leaving the
        clocks hanging, and the last fix from Tommy avoids false error reports
        in IRQ"
      
      * tag 'i2c-for-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: aspeed: Fix the dummy irq expected print
        i2c: wmt: Fix an error handling path in wmt_i2c_probe()
        i2c: i801: Avoid potential double call to gpiod_remove_lookup_table
        i2c: i801: Fix using mux_pdev before it's set
      005f6f34
    • Linus Torvalds's avatar
      Merge tag 'firewire-fixes-6.8-final' of... · 66695e7d
      Linus Torvalds authored
      Merge tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
      
      Pull firewire fix from Takashi Sakamoto:
       "A fix to suppress a warning about unreleased IRQ for 1394 OHCI
        hardware when disabling MSI.
      
        In Linux kernel v6.5, a PCI driver for 1394 OHCI hardware was
        optimized into the managed device resources. Edmund Raile points out
        that the change brings the warning about unreleased IRQ at the call of
        pci_disable_msi(), since the API expects that the relevant IRQ has
        already been released in advance.
      
        As long as the API is called in .remove callback of PCI device
        operation, it is prohibited to maintain the IRQ as the part of managed
        device resource. As a workaround, the IRQ is explicitly released at
        .remove callback, before the call of pci_disable_msi().
      
        pci_disable_msi() is legacy API nowadays in PCI MSI implementation. I
        have a plan to replace it with the modern API in the development for
        the future version of Linux kernel. So at present I keep them as is"
      
      * tag 'firewire-fixes-6.8-final' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
        firewire: ohci: prevent leak of left-over IRQ on unbind
      66695e7d
    • Paolo Bonzini's avatar
      SEV: disable SEV-ES DebugSwap by default · 5abf6dce
      Paolo Bonzini authored
      The DebugSwap feature of SEV-ES provides a way for confidential guests to use
      data breakpoints.  However, because the status of the DebugSwap feature is
      recorded in the VMSA, enabling it by default invalidates the attestation
      signatures.  In 6.10 we will introduce a new API to create SEV VMs that
      will allow enabling DebugSwap based on what the user tells KVM to do.
      Contextually, we will change the legacy KVM_SEV_ES_INIT API to never
      enable DebugSwap.
      
      For compatibility with kernels that pre-date the introduction of DebugSwap,
      as well as with those where KVM_SEV_ES_INIT will never enable it, do not enable
      the feature by default.  If anybody wants to use it, for now they can enable
      the sev_es_debug_swap_enabled module parameter, but this will result in a
      warning.
      
      Fixes: d1f85fbe ("KVM: SEV: Enable data breakpoints in SEV-ES")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5abf6dce
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD · 39fee313
      Paolo Bonzini authored
      KVM GUEST_MEMFD fixes for 6.8:
      
       - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
         avoid creating ABI that KVM can't sanely support.
      
       - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
         clear that such VMs are purely a development and testing vehicle, and
         come with zero guarantees.
      
       - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
         is to support confidential VMs with deterministic private memory (SNP
         and TDX) only in the TDP MMU.
      
       - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
         when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
      39fee313
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-fixes-6.8-2' of https://github.com/kvm-x86/linux into HEAD · 1b6c146d
      Paolo Bonzini authored
      KVM x86 fixes for 6.8, round 2:
      
       - When emulating an atomic access, mark the gfn as dirty in the memslot
         to fix a bug where KVM could fail to mark the slot as dirty during live
         migration, ultimately resulting in guest data corruption due to a dirty
         page not being re-copied from the source to the target.
      
       - Check for mmu_notifier invalidation events before faulting in the pfn,
         and before acquiring mmu_lock, to avoid unnecessary work and lock
         contention.  Contending mmu_lock is especially problematic on preemptible
         kernels, as KVM may yield mmu_lock in response to the contention, which
         severely degrades overall performance due to vCPUs making it difficult
         for the task that triggered invalidation to make forward progress.
      
         Note, due to another kernel bug, this fix isn't limited to preemtible
         kernels, as any kernel built with CONFIG_PREEMPT_DYNAMIC=y will yield
         contended rwlocks and spinlocks.
      
         https://lore.kernel.org/all/20240110214723.695930-1-seanjc@google.com
      1b6c146d
    • Colin Ian King's avatar
      block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC · 5205a4aa
      Colin Ian King authored
      The helper function mac_fix_string is only required with CONFIG_PPC_PMAC,
      add #if CONFIG_PPC_PMAC and #endif around the function.
      
      Cleans up clang scan build warning:
      block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function]
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5205a4aa
    • Gabriel Krisman Bertazi's avatar
      io_uring: Fix sqpoll utilization check racing with dying sqpoll · 606559dc
      Gabriel Krisman Bertazi authored
      Commit 3fcb9d17 ("io_uring/sqpoll: statistics of the true
      utilization of sq threads"), currently in Jens for-next branch, peeks at
      io_sq_data->thread to report utilization statistics. But, If
      io_uring_show_fdinfo races with sqpoll terminating, even though we hold
      the ctx lock, sqd->thread might be NULL and we hit the Oops below.
      
      Note that we could technically just protect the getrusage() call and the
      sq total/work time calculations.  But showing some sq
      information (pid/cpu) and not other information (utilization) is more
      confusing than not reporting anything, IMO.  So let's hide it all if we
      happen to race with a dying sqpoll.
      
      This can be triggered consistently in my vm setup running
      sqpoll-cancel-hang.t in a loop.
      
      BUG: kernel NULL pointer dereference, address: 00000000000007b0
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17-dirty #69
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
      RIP: 0010:getrusage+0x21/0x3e0
      Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
      RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
      RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
      RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
      R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
      R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
      FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x1a/0x60
       ? page_fault_oops+0x154/0x440
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? do_user_addr_fault+0x174/0x7c0
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? exc_page_fault+0x63/0x140
       ? asm_exc_page_fault+0x22/0x30
       ? getrusage+0x21/0x3e0
       ? seq_printf+0x4e/0x70
       io_uring_show_fdinfo+0x9db/0xa10
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? vsnprintf+0x101/0x4d0
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? seq_vprintf+0x34/0x50
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? seq_printf+0x4e/0x70
       ? seq_show+0x16b/0x1d0
       ? __pfx_io_uring_show_fdinfo+0x10/0x10
       seq_show+0x16b/0x1d0
       seq_read_iter+0xd7/0x440
       seq_read+0x102/0x140
       vfs_read+0xae/0x320
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? __do_sys_newfstat+0x35/0x60
       ksys_read+0xa5/0xe0
       do_syscall_64+0x50/0x110
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      RIP: 0033:0x7f786ec1db4d
      Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
      RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d
      RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006
      RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60
      R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628
       </TASK>
      Modules linked in:
      CR2: 00000000000007b0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:getrusage+0x21/0x3e0
      Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
      RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
      RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
      RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
      R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
      R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
      FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
      PKRU: 55555554
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Fixes: 3fcb9d17 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      606559dc
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-6.8-rc8' of https://github.com/ceph/ceph-client · 09e5c48f
      Linus Torvalds authored
      Pull ceph fix from Ilya Dryomov:
       "A follow-up for sparse read fixes that went into -rc4 -- msgr2 case
        was missed and is corrected here"
      
      * tag 'ceph-for-6.8-rc8' of https://github.com/ceph/ceph-client:
        libceph: init the cursor when preparing sparse read in msgr2
      09e5c48f
  4. 08 Mar, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 10d48d70
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are a few small char/misc and other driver subsystem fixes for
        reported issues that have been in my tree.
      
        Included in here are fixes for:
      
         - iio driver fixes for reported problems
      
         - much reported bugfix for a lis3lv02d_i2c regression
      
         - comedi driver bugfix
      
         - mei new device ids
      
         - mei driver fixes
      
         - counter core fix
      
        All of these have been in linux-next with no reported issues, some for
        many weeks"
      
      * tag 'char-misc-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        mei: gsc_proxy: match component when GSC is on different bus
        misc: fastrpc: Pass proper arguments to scm call
        comedi: comedi_test: Prevent timers rescheduling during deletion
        comedi: comedi_8255: Correct error in subdevice initialization
        misc: lis3lv02d_i2c: Fix regulators getting en-/dis-abled twice on suspend/resume
        iio: accel: adxl367: fix I2C FIFO data register
        iio: accel: adxl367: fix DEVID read after reset
        iio: pressure: dlhl60d: Initialize empty DLH bytes
        iio: imu: inv_mpu6050: fix frequency setting when chip is off
        iio: pressure: Fixes BMP38x and BMP390 SPI support
        iio: imu: inv_mpu6050: fix FIFO parsing when empty
        mei: Add Meteor Lake support for IVSC device
        mei: me: add arrow lake point H DID
        mei: me: add arrow lake point S DID
        counter: fix privdata alignment
      10d48d70