1. 12 Mar, 2024 29 commits
  2. 11 Mar, 2024 11 commits
    • Linus Torvalds's avatar
      Merge tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 86833aec
      Linus Torvalds authored
      Pull x86 entry update from Thomas Gleixner:
       "A single update for the x86 entry code:
      
        The current CR3 handling for kernel page table isolation in the
        paranoid return paths which are relevant for #NMI, #MCE, #VC, #DB and
        #DF is unconditionally writing CR3 with the value retrieved on
        exception entry.
      
        In the vast majority of cases when returning to the kernel this is a
        pointless exercise because CR3 was not modified on exception entry.
        The only situation where this is necessary is when the exception
        interrupts a entry from user before switching to kernel CR3 or
        interrupts an exit to user after switching back to user CR3.
      
        As CR3 writes can be expensive on some systems this becomes measurable
        overhead with high frequency #NMIs such as perf.
      
        Avoid this overhead by checking the CR3 value, which was saved on
        entry, and write it back to CR3 only when it is a user CR3"
      
      * tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/entry: Avoid redundant CR3 write on paranoid returns
      86833aec
    • Linus Torvalds's avatar
      Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 720c8579
      Linus Torvalds authored
      Pull x86 FRED support from Thomas Gleixner:
       "Support for x86 Fast Return and Event Delivery (FRED).
      
        FRED is a replacement for IDT event delivery on x86 and addresses most
        of the technical nightmares which IDT exposes:
      
         1) Exception cause registers like CR2 need to be manually preserved
            in nested exception scenarios.
      
         2) Hardware interrupt stack switching is suboptimal for nested
            exceptions as the interrupt stack mechanism rewinds the stack on
            each entry which requires a massive effort in the low level entry
            of #NMI code to handle this.
      
         3) No hardware distinction between entry from kernel or from user
            which makes establishing kernel context more complex than it needs
            to be especially for unconditionally nestable exceptions like NMI.
      
         4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
            is a problem when the perf NMI takes a fault when collecting a
            stack trace.
      
         5) Partial restore of ESP when returning to a 16-bit segment
      
         6) Limitation of the vector space which can cause vector exhaustion
            on large systems.
      
         7) Inability to differentiate NMI sources
      
        FRED addresses these shortcomings by:
      
         1) An extended exception stack frame which the CPU uses to save
            exception cause registers. This ensures that the meta information
            for each exception is preserved on stack and avoids the extra
            complexity of preserving it in software.
      
         2) Hardware interrupt stack switching is non-rewinding if a nested
            exception uses the currently interrupt stack.
      
         3) The entry points for kernel and user context are separate and GS
            BASE handling which is required to establish kernel context for
            per CPU variable access is done in hardware.
      
         4) NMIs are now nesting protected. They are only reenabled on the
            return from NMI.
      
         5) FRED guarantees full restore of ESP
      
         6) FRED does not put a limitation on the vector space by design
            because it uses a central entry points for kernel and user space
            and the CPUstores the entry type (exception, trap, interrupt,
            syscall) on the entry stack along with the vector number. The
            entry code has to demultiplex this information, but this removes
            the vector space restriction.
      
            The first hardware implementations will still have the current
            restricted vector space because lifting this limitation requires
            further changes to the local APIC.
      
         7) FRED stores the vector number and meta information on stack which
            allows having more than one NMI vector in future hardware when the
            required local APIC changes are in place.
      
        The series implements the initial FRED support by:
      
         - Reworking the existing entry and IDT handling infrastructure to
           accomodate for the alternative entry mechanism.
      
         - Expanding the stack frame to accomodate for the extra 16 bytes FRED
           requires to store context and meta information
      
         - Providing FRED specific C entry points for events which have
           information pushed to the extended stack frame, e.g. #PF and #DB.
      
         - Providing FRED specific C entry points for #NMI and #MCE
      
         - Implementing the FRED specific ASM entry points and the C code to
           demultiplex the events
      
         - Providing detection and initialization mechanisms and the necessary
           tweaks in context switching, GS BASE handling etc.
      
        The FRED integration aims for maximum code reuse vs the existing IDT
        implementation to the extent possible and the deviation in hot paths
        like context switching are handled with alternatives to minimalize the
        impact. The low level entry and exit paths are seperate due to the
        extended stack frame and the hardware based GS BASE swichting and
        therefore have no impact on IDT based systems.
      
        It has been extensively tested on existing systems and on the FRED
        simulation and as of now there are no outstanding problems"
      
      * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
        x86/fred: Fix init_task thread stack pointer initialization
        MAINTAINERS: Add a maintainer entry for FRED
        x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
        x86/fred: Invoke FRED initialization code to enable FRED
        x86/fred: Add FRED initialization functions
        x86/syscall: Split IDT syscall setup code into idt_syscall_init()
        KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
        x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
        x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
        x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
        x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
        x86/traps: Add sysvec_install() to install a system interrupt handler
        x86/fred: FRED entry/exit and dispatch code
        x86/fred: Add a machine check entry stub for FRED
        x86/fred: Add a NMI entry stub for FRED
        x86/fred: Add a debug fault entry stub for FRED
        x86/idtentry: Incorporate definitions/declarations of the FRED entries
        x86/fred: Make exc_page_fault() work for FRED
        x86/fred: Allow single-step trap and NMI when starting a new task
        x86/fred: No ESPFIX needed when FRED is enabled
        ...
      720c8579
    • Linus Torvalds's avatar
      Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ca7e9177
      Linus Torvalds authored
      Pull x86 APIC updates from Thomas Gleixner:
       "Rework of APIC enumeration and topology evaluation.
      
        The current implementation has a couple of shortcomings:
      
         - It fails to handle hybrid systems correctly.
      
         - The APIC registration code which handles CPU number assignents is
           in the middle of the APIC code and detached from the topology
           evaluation.
      
         - The various mechanisms which enumerate APICs, ACPI, MPPARSE and
           guest specific ones, tweak global variables as they see fit or in
           case of XENPV just hack around the generic mechanisms completely.
      
         - The CPUID topology evaluation code is sprinkled all over the vendor
           code and reevaluates global variables on every hotplug operation.
      
         - There is no way to analyze topology on the boot CPU before bringing
           up the APs. This causes problems for infrastructure like PERF which
           needs to size certain aspects upfront or could be simplified if
           that would be possible.
      
         - The APIC admission and CPU number association logic is
           incomprehensible and overly complex and needs to be kept around
           after boot instead of completing this right after the APIC
           enumeration.
      
        This update addresses these shortcomings with the following changes:
      
         - Rework the CPUID evaluation code so it is common for all vendors
           and provides information about the APIC ID segments in a uniform
           way independent of the number of segments (Thread, Core, Module,
           ..., Die, Package) so that this information can be computed instead
           of rewriting global variables of dubious value over and over.
      
         - A few cleanups and simplifcations of the APIC, IO/APIC and related
           interfaces to prepare for the topology evaluation changes.
      
         - Seperation of the parser stages so the early evaluation which tries
           to find the APIC address can be seperately overridden from the late
           evaluation which enumerates and registers the local APIC as further
           preparation for sanitizing the topology evaluation.
      
         - A new registration and admission logic which
      
             - encapsulates the inner workings so that parsers and guest logic
               cannot longer fiddle in it
      
             - uses the APIC ID segments to build topology bitmaps at
               registration time
      
             - provides a sane admission logic
      
             - allows to detect the crash kernel case, where CPU0 does not run
               on the real BSP, automatically. This is required to prevent
               sending INIT/SIPI sequences to the real BSP which would reset
               the whole machine. This was so far handled by a tedious command
               line parameter, which does not even work in nested crash
               scenarios.
      
             - Associates CPU number after the enumeration completed and
               prevents the late registration of APICs, which was somehow
               tolerated before.
      
         - Converting all parsers and guest enumeration mechanisms over to the
           new interfaces.
      
           This allows to get rid of all global variable tweaking from the
           parsers and enumeration mechanisms and sanitizes the XEN[PV]
           handling so it can use CPUID evaluation for the first time.
      
         - Mopping up existing sins by taking the information from the APIC ID
           segment bitmaps.
      
           This evaluates hybrid systems correctly on the boot CPU and allows
           for cleanups and fixes in the related drivers, e.g. PERF.
      
        The series has been extensively tested and the minimal late fallout
        due to a broken ACPI/MADT table has been addressed by tightening the
        admission logic further"
      
      * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
        x86/topology: Ignore non-present APIC IDs in a present package
        x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
        smp: Provide 'setup_max_cpus' definition on UP too
        smp: Avoid 'setup_max_cpus' namespace collision/shadowing
        x86/bugs: Use fixed addressing for VERW operand
        x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
        x86/cpu/topology: Provide __num_[cores|threads]_per_package
        x86/cpu/topology: Rename topology_max_die_per_package()
        x86/cpu/topology: Rename smp_num_siblings
        x86/cpu/topology: Retrieve cores per package from topology bitmaps
        x86/cpu/topology: Use topology logical mapping mechanism
        x86/cpu/topology: Provide logical pkg/die mapping
        x86/cpu/topology: Simplify cpu_mark_primary_thread()
        x86/cpu/topology: Mop up primary thread mask handling
        x86/cpu/topology: Use topology bitmaps for sizing
        x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
        x86/xen/smp_pv: Count number of vCPUs early
        x86/cpu/topology: Assign hotpluggable CPUIDs during init
        x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
        x86/topology: Add a mechanism to track topology via APIC IDs
        ...
      ca7e9177
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d08c407f
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "A large set of updates and features for timers and timekeeping:
      
         - The hierarchical timer pull model
      
           When timer wheel timers are armed they are placed into the timer
           wheel of a CPU which is likely to be busy at the time of expiry.
           This is done to avoid wakeups on potentially idle CPUs.
      
           This is wrong in several aspects:
      
             1) The heuristics to select the target CPU are wrong by
                definition as the chance to get the prediction right is
                close to zero.
      
             2) Due to #1 it is possible that timers are accumulated on
                a single target CPU
      
             3) The required computation in the enqueue path is just overhead
                for dubious value especially under the consideration that the
                vast majority of timer wheel timers are either canceled or
                rearmed before they expire.
      
           The timer pull model avoids the above by removing the target
           computation on enqueue and queueing timers always on the CPU on
           which they get armed.
      
           This is achieved by having separate wheels for CPU pinned timers
           and global timers which do not care about where they expire.
      
           As long as a CPU is busy it handles both the pinned and the global
           timers which are queued on the CPU local timer wheels.
      
           When a CPU goes idle it evaluates its own timer wheels:
      
             - If the first expiring timer is a pinned timer, then the global
               timers can be ignored as the CPU will wake up before they
               expire.
      
             - If the first expiring timer is a global timer, then the expiry
               time is propagated into the timer pull hierarchy and the CPU
               makes sure to wake up for the first pinned timer.
      
           The timer pull hierarchy organizes CPUs in groups of eight at the
           lowest level and at the next levels groups of eight groups up to
           the point where no further aggregation of groups is required, i.e.
           the number of levels is log8(NR_CPUS). The magic number of eight
           has been established by experimention, but can be adjusted if
           needed.
      
           In each group one busy CPU acts as the migrator. It's only one CPU
           to avoid lock contention on remote timer wheels.
      
           The migrator CPU checks in its own timer wheel handling whether
           there are other CPUs in the group which have gone idle and have
           global timers to expire. If there are global timers to expire, the
           migrator locks the remote CPU timer wheel and handles the expiry.
      
           Depending on the group level in the hierarchy this handling can
           require to walk the hierarchy downwards to the CPU level.
      
           Special care is taken when the last CPU goes idle. At this point
           the CPU is the systemwide migrator at the top of the hierarchy and
           it therefore cannot delegate to the hierarchy. It needs to arm its
           own timer device to expire either at the first expiring timer in
           the hierarchy or at the first CPU local timer, which ever expires
           first.
      
           This completely removes the overhead from the enqueue path, which
           is e.g. for networking a true hotpath and trades it for a slightly
           more complex idle path.
      
           This has been in development for a couple of years and the final
           series has been extensively tested by various teams from silicon
           vendors and ran through extensive CI.
      
           There have been slight performance improvements observed on network
           centric workloads and an Intel team confirmed that this allows them
           to power down a die completely on a mult-die socket for the first
           time in a mostly idle scenario.
      
           There is only one outstanding ~1.5% regression on a specific
           overloaded netperf test which is currently investigated, but the
           rest is either positive or neutral performance wise and positive on
           the power management side.
      
         - Fixes for the timekeeping interpolation code for cross-timestamps:
      
           cross-timestamps are used for PTP to get snapshots from hardware
           timers and interpolated them back to clock MONOTONIC. The changes
           address a few corner cases in the interpolation code which got the
           math and logic wrong.
      
         - Simplifcation of the clocksource watchdog retry logic to
           automatically adjust to handle larger systems correctly instead of
           having more incomprehensible command line parameters.
      
         - Treewide consolidation of the VDSO data structures.
      
         - The usual small improvements and cleanups all over the place"
      
      * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
        timer/migration: Fix quick check reporting late expiry
        tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
        vdso/datapage: Quick fix - use asm/page-def.h for ARM64
        timers: Assert no next dyntick timer look-up while CPU is offline
        tick: Assume timekeeping is correctly handed over upon last offline idle call
        tick: Shut down low-res tick from dying CPU
        tick: Split nohz and highres features from nohz_mode
        tick: Move individual bit features to debuggable mask accesses
        tick: Move got_idle_tick away from common flags
        tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
        tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
        tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
        tick: Start centralizing tick related CPU hotplug operations
        tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
        tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
        tick: Use IS_ENABLED() whenever possible
        tick/sched: Remove useless oneshot ifdeffery
        tick/nohz: Remove duplicate between lowres and highres handlers
        tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
        hrtimer: Select housekeeping CPU during migration
        ...
      d08c407f
    • Linus Torvalds's avatar
      Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 80a76c60
      Linus Torvalds authored
      Pull clocksource updates from Thomas Gleixner:
       "Updates for timekeeping and PTP core.
      
        The cross-timestamp mechanism which allows to correlate hardware
        clocks uses clocksource pointers for describing the correlation.
      
        That's suboptimal as drivers need to obtain the pointer, which
        requires needless exports and exposing internals. This can all be
        completely avoided by assigning clocksource IDs and using them for
        describing the correlated clock source.
      
        So this adds clocksource IDs to all clocksources in the tree which can
        be exposed to this mechanism and removes the pointer and now needless
        exports.
      
        A related improvement for the core and the correlation handling has
        not made it this time, but is expected to get ready for the next
        round"
      
      * tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        kvmclock: Unexport kvmclock clocksource
        treewide: Remove system_counterval_t.cs, which is never read
        timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
        ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
        x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
        x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
        timekeeping: Add clocksource ID to struct system_counterval_t
        x86/tsc: Correct kernel-doc notation
      80a76c60
    • Linus Torvalds's avatar
      Merge tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 397935e3
      Linus Torvalds authored
      Pull cpu core updates from Thomas Gleixner:
       "A small boring set of cleanups for the SMP and CPU hotplug code"
      
      * tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu: Remove stray semicolon
        smp: Make __smp_processor_id() 0-argument macro
        cpu: Mark cpu_possible_mask as __ro_after_init
        kernel/cpu: Convert snprintf() to sysfs_emit()
        cpu/hotplug: Delete an extraneous kernel-doc description
      397935e3
    • Linus Torvalds's avatar
      Merge tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4527e837
      Linus Torvalds authored
      Pull MSI updates from Thomas Gleixner:
       "Updates for the MSI interrupt subsystem and initial RISC-V MSI
        support.
      
        The core changes have been adopted from previous work which converted
        ARM[64] to the new per device MSI domain model, which was merged to
        support multiple MSI domain per device. The ARM[64] changes are being
        worked on too, but have not been ready yet. The core and platform-MSI
        changes have been split out to not hold up RISC-V and to avoid that
        RISC-V builds on the scheduled for removal interfaces.
      
        The core support provides new interfaces to handle wire to MSI bridges
        in a straight forward way and introduces new platform-MSI interfaces
        which are built on top of the per device MSI domain model.
      
        Once ARM[64] is converted over the old platform-MSI interfaces and the
        related ugliness in the MSI core code will be removed.
      
        The actual MSI parts for RISC-V were finalized late and have been
        post-poned for the next merge window.
      
        Drivers:
      
         - Add a new driver for the Andes hart-level interrupt controller
      
         - Rework the SiFive PLIC driver to prepare for MSI suport
      
         - Expand the RISC-V INTC driver to support the new RISC-V AIA
           controller which provides the basis for MSI on RISC-V
      
         - A few fixup for the fallout of the core changes"
      
      * tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
        irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
        x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
        genirq/matrix: Dynamic bitmap allocation
        irqchip/riscv-intc: Add support for RISC-V AIA
        irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
        irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
        irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
        irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
        irqchip/sifive-plic: Use devm_xyz() for managed allocation
        irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
        irqchip/sifive-plic: Convert PLIC driver into a platform driver
        irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
        irqchip/riscv-intc: Allow large non-standard interrupt number
        genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
        irqchip/imx-intmux: Handle pure domain searches correctly
        genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
        genirq/irqdomain: Reroute device MSI create_mapping
        genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
        genirq/msi: Optionally use dev->fwnode for device domain
        genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
        ...
      4527e837
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 02d4df78
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Core:
      
         - Make affinity changes take effect immediately for interrupt
           threads. This reduces the impact on isolated CPUs as it pulls over
           the thread right away instead of doing it after the next hardware
           interrupt arrived.
      
         - Cleanup and improvements for the interrupt chip simulator
      
         - Deduplication of the interrupt descriptor initialization code so
           the sparse and non-sparse mode share more code.
      
        Drivers:
      
         - A set of conversions to platform_drivers::remove_new() which gets
           rid of the pointless return value.
      
         - A new driver for the Starfive JH8100 SoC
      
         - Support for Amlogic-T7 SoCs
      
         - Improvement for the interrupt handling and EOI management for the
           loongson interrupt controller.
      
         - The usual fixes and improvements all over the place"
      
      * tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
        irqchip/ts4800: Convert to platform_driver::remove_new() callback
        irqchip/stm32-exti: Convert to platform_driver::remove_new() callback
        irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback
        irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback
        irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback
        irqchip/pruss-intc: Convert to platform_driver::remove_new() callback
        irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback
        irqchip/madera: Convert to platform_driver::remove_new() callback
        irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback
        irqchip/keystone: Convert to platform_driver::remove_new() callback
        irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback
        irqchip/imx-intmux: Convert to platform_driver::remove_new() callback
        irqchip/imgpdc: Convert to platform_driver::remove_new() callback
        irqchip: Add StarFive external interrupt controller
        dt-bindings: interrupt-controller: Add starfive,jh8100-intc
        arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs
        irqchip/meson-gpio: Add support for Amlogic-T7 SoCs
        dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs
        irqchip/vic: Fix a kernel-doc warning
        genirq: Wake interrupt threads immediately when changing affinity
        ...
      02d4df78
    • Linus Torvalds's avatar
      Merge tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 045395d8
      Linus Torvalds authored
      Pull cgroup updates from Tejun Heo:
       "A quiet cycle. One trivial doc update patch. Two patches to drop the
        now defunct memory_spread_slab feature from cgroup1 cpuset"
      
      * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Mark memory_spread_slab as obsolete
        cgroup/cpuset: Remove cpuset_do_slab_mem_spread()
        docs: cgroup-v1: add missing code-block tags
      045395d8
    • Linus Torvalds's avatar
      Merge tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · 1a1e0989
      Linus Torvalds authored
      Pull workqueue BH conversions from Tejun Heo:
       "This contains two patches that convert tasklet users to BH workqueues:
        backtracetest and usb hcd.
      
        DM conversions are being routed through the respective subsystem tree.
        Hopefully, the next cycle will see a lot more conversions"
      
      * tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        usb: core: hcd: Convert from tasklet to BH workqueue
        backtracetest: Convert from tasklet to BH workqueue
      1a1e0989
    • Linus Torvalds's avatar
      Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · ff887eb0
      Linus Torvalds authored
      Pull workqueue updates from Tejun Heo:
       "This cycle, a lot of workqueue changes including some that are
        significant and invasive.
      
         - During v6.6 cycle, unbound workqueues were updated so that they are
           more topology aware and flexible, which among other things improved
           workqueue behavior on modern multi-L3 CPUs. In the process, commit
           636b927e ("workqueue: Make unbound workqueues to use per-cpu
           pool_workqueues") switched unbound workqueues to use per-CPU
           frontend pool_workqueues as a part of increasing front-back mapping
           flexibility.
      
           An unwelcome side effect of this change was that this made max
           concurrency enforcement per-CPU blowing up the maximum number of
           allowed concurrent executions. I incorrectly assumed that this
           wouldn't cause practical problems as most unbound workqueue users
           are self-regulate max concurrency; however, there definitely are
           which don't (e.g. on IO paths) and the drastic increase in the
           allowed max concurrency led to noticeable perf regressions in some
           use cases.
      
           This is now addressed by separating out max concurrency enforcement
           to a separate struct - wq_node_nr_active - which makes @max_active
           consistently mean system-wide max concurrency regardless of the
           number of CPUs or (finally) NUMA nodes. This is a rather invasive
           and, in places, a bit clunky; however, the clunkiness rises from
           the the inherent requirement to handle the disagreement between the
           execution locality domain and max concurrency enforcement domain on
           some modern machines.
      
           See commit 5797b1c1 ("workqueue: Implement system-wide
           nr_active enforcement for unbound workqueues") for more details.
      
         - BH workqueue support is added.
      
           They are similar to per-CPU workqueues but execute work items in
           the softirq context. This is expected to replace tasklet. However,
           currently, it's missing the ability to disable and enable work
           items which is needed to convert many tasklet users. To avoid
           crowding this merge window too much, this will be included in the
           next merge window. A separate pull request will be sent for the
           couple conversion patches that are currently pending.
      
         - Waiman plugged a long-standing hole in workqueue CPU isolation
           where ordered workqueues didn't follow wq_unbound_cpumask updates.
           Ordered workqueues now follow the same rules as other unbound
           workqueues.
      
         - More CPU isolation improvements: Juri fixed another deficit in
           workqueue isolation where unbound rescuers don't respect
           wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on
           isolated CPUs.
      
         - Other misc changes"
      
      * tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits)
        workqueue: Drain BH work items on hot-unplugged CPUs
        workqueue: Introduce from_work() helper for cleaner callback declarations
        workqueue: Control intensive warning threshold through cmdline
        workqueue: Make @flags handling consistent across set_work_data() and friends
        workqueue: Remove clear_work_data()
        workqueue: Factor out work_grab_pending() from __cancel_work_sync()
        workqueue: Clean up enum work_bits and related constants
        workqueue: Introduce work_cancel_flags
        workqueue: Use variable name irq_flags for saving local irq flags
        workqueue: Reorganize flush and cancel[_sync] functions
        workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
        workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
        workqueue: Cosmetic changes
        workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
        workqueue: Fix queue_work_on() with BH workqueues
        async: Use a dedicated unbound workqueue with raised min_active
        workqueue: Implement workqueue_set_min_active()
        workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
        workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
        kernel/workqueue: Let rescuers follow unbound wq cpumask changes
        ...
      ff887eb0