1. 24 Apr, 2024 5 commits
  2. 17 Apr, 2024 5 commits
  3. 29 Mar, 2024 1 commit
    • Ingo Molnar's avatar
      sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded... · 4475cd8b
      Ingo Molnar authored
      sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
      
      SG_OVERLOADED and SG_OVERUTILIZED flags plus the sg_status bitmask are an
      unnecessary complication that only make the code harder to read and slower.
      
      We only ever set them separately:
      
       thule:~/tip> git grep SG_OVER kernel/sched/
       kernel/sched/fair.c:            set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
       kernel/sched/fair.c:                    *sg_status |= SG_OVERLOADED;
       kernel/sched/fair.c:                    *sg_status |= SG_OVERUTILIZED;
       kernel/sched/fair.c:                            *sg_status |= SG_OVERLOADED;
       kernel/sched/fair.c:            set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
       kernel/sched/fair.c:                                       sg_status & SG_OVERUTILIZED);
       kernel/sched/fair.c:    } else if (sg_status & SG_OVERUTILIZED) {
       kernel/sched/fair.c:            set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED);
       kernel/sched/sched.h:#define SG_OVERLOADED              0x1 /* More than one runnable task on a CPU. */
       kernel/sched/sched.h:#define SG_OVERUTILIZED            0x2 /* One or more CPUs are over-utilized. */
       kernel/sched/sched.h:           set_rd_overloaded(rq->rd, SG_OVERLOADED);
      
      And use them separately, which results in suboptimal code:
      
                      /* update overload indicator if we are at root domain */
                      set_rd_overloaded(env->dst_rq->rd, sg_status & SG_OVERLOADED);
      
                      /* Update over-utilization (tipping point, U >= 0) indicator */
                      set_rd_overutilized_status(env->dst_rq->rd,
      
      Introduce separate sg_overloaded and sg_overutilized flags in update_sd_lb_stats()
      and its lower level functions, and change all of them to 'bool'.
      
      Remove the now unused SG_OVERLOADED and SG_OVERUTILIZED flags.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarShrikanth Hegde <sshegde@linux.ibm.com>
      Tested-by: default avatarShrikanth Hegde <sshegde@linux.ibm.com>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/ZgVPhODZ8/nbsqbP@gmail.com
      4475cd8b
  4. 28 Mar, 2024 7 commits
  5. 26 Mar, 2024 3 commits
  6. 25 Mar, 2024 5 commits
    • Qais Yousef's avatar
      sched/fair: Don't double balance_interval for migrate_misfit · 58eeb2d7
      Qais Yousef authored
      It is not necessarily an indication of the system being busy and
      requires a backoff of the load balancer activities. But pushing it high
      could mean generally delaying other misfit activities or other type of
      imbalances.
      
      Also don't pollute nr_balance_failed because of misfit failures. The
      value is used for enabling cache hot migration and in migrate_util/load
      types. None of which should be impacted (skewed) by misfit failures.
      Signed-off-by: default avatarQais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240324004552.999936-5-qyousef@layalina.io
      58eeb2d7
    • Qais Yousef's avatar
      sched/topology: Remove root_domain::max_cpu_capacity · fa427e8e
      Qais Yousef authored
      The value is no longer used as we now keep track of max_allowed_capacity
      for each task instead.
      Signed-off-by: default avatarQais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io
      fa427e8e
    • Qais Yousef's avatar
      sched/fair: Check if a task has a fitting CPU when updating misfit · 22d56074
      Qais Yousef authored
      If a misfit task is affined to a subset of the possible CPUs, we need to
      verify that one of these CPUs can fit it. Otherwise the load balancer
      code will continuously trigger needlessly leading the balance_interval
      to increase in return and eventually end up with a situation where real
      imbalances take a long time to address because of this impossible
      imbalance situation.
      
      This can happen in Android world where it's common for background tasks
      to be restricted to little cores.
      
      Similarly if we can't fit the biggest core, triggering misfit is
      pointless as it is the best we can ever get on this system.
      
      To be able to detect that; we use asym_cap_list to iterate through
      capacities in the system to see if the task is able to run at a higher
      capacity level based on its p->cpus_ptr. We do that when the affinity
      change, a fair task is forked, or when a task switched to fair policy.
      We store the max_allowed_capacity in task_struct to allow for cheap
      comparison in the fast path.
      
      Improve check_misfit_status() function by removing redundant checks.
      misfit_task_load will be 0 if the task can't move to a bigger CPU. And
      nohz_balancer_kick() already checks for cpu_check_capacity() before
      calling check_misfit_status().
      
      Test:
      =====
      
      Add
      
      	trace_printk("balance_interval = %lu\n", interval)
      
      in get_sd_balance_interval().
      
      run
      	if [ "$MASK" != "0" ]; then
      		adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
      	fi
      	sleep 10
      	// parse ftrace buffer counting the occurrence of each valaue
      
      Where MASK is either:
      
      	* 0: no busy task running
      	* 1: busy task is pinned to 1 cpu; handled today to not cause
      	  misfit
      	* f: busy task pinned to little cores, simulates busy background
      	  task, demonstrates the problem to be fixed
      
      Results:
      ========
      
      Note how occurrence of balance_interval = 128 overshoots for MASK = f.
      
      BEFORE
      ------
      
      	MASK=0
      
      		   1 balance_interval = 175
      		 120 balance_interval = 128
      		 846 balance_interval = 64
      		  55 balance_interval = 63
      		 215 balance_interval = 32
      		   2 balance_interval = 31
      		   2 balance_interval = 16
      		   4 balance_interval = 8
      		1870 balance_interval = 4
      		  65 balance_interval = 2
      
      	MASK=1
      
      		  27 balance_interval = 175
      		  37 balance_interval = 127
      		 840 balance_interval = 64
      		 167 balance_interval = 63
      		 449 balance_interval = 32
      		  84 balance_interval = 31
      		 304 balance_interval = 16
      		1156 balance_interval = 8
      		2781 balance_interval = 4
      		 428 balance_interval = 2
      
      	MASK=f
      
      		   1 balance_interval = 175
      		1328 balance_interval = 128
      		  44 balance_interval = 64
      		 101 balance_interval = 63
      		  25 balance_interval = 32
      		   5 balance_interval = 31
      		  23 balance_interval = 16
      		  23 balance_interval = 8
      		4306 balance_interval = 4
      		 177 balance_interval = 2
      
      AFTER
      -----
      
      Note how the high values almost disappear for all MASK values. The
      system has background tasks that could trigger the problem without
      simulate it even with MASK=0.
      
      	MASK=0
      
      		 103 balance_interval = 63
      		  19 balance_interval = 31
      		 194 balance_interval = 8
      		4827 balance_interval = 4
      		 179 balance_interval = 2
      
      	MASK=1
      
      		 131 balance_interval = 63
      		   1 balance_interval = 31
      		  87 balance_interval = 8
      		3600 balance_interval = 4
      		   7 balance_interval = 2
      
      	MASK=f
      
      		   8 balance_interval = 127
      		 182 balance_interval = 63
      		   3 balance_interval = 31
      		   9 balance_interval = 16
      		 415 balance_interval = 8
      		3415 balance_interval = 4
      		  21 balance_interval = 2
      Signed-off-by: default avatarQais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240324004552.999936-3-qyousef@layalina.io
      22d56074
    • Qais Yousef's avatar
      sched/topology: Export asym_cap_list · 77222b0d
      Qais Yousef authored
      So that we can use it to iterate through available capacities in the
      system. Sort asym_cap_list in descending order as expected users are
      likely to be interested on the highest capacity first.
      
      Make the list RCU protected to allow for cheap access in hot paths.
      Signed-off-by: default avatarQais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io
      77222b0d
    • Ingo Molnar's avatar
  7. 24 Mar, 2024 13 commits
    • Linus Torvalds's avatar
      Linux 6.9-rc1 · 4cece764
      Linus Torvalds authored
      4cece764
    • Linus Torvalds's avatar
      Merge tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi · ab8de2db
      Linus Torvalds authored
      Pull EFI fixes from Ard Biesheuvel:
      
       - Fix logic that is supposed to prevent placement of the kernel image
         below LOAD_PHYSICAL_ADDR
      
       - Use the firmware stack in the EFI stub when running in mixed mode
      
       - Clear BSS only once when using mixed mode
      
       - Check efi.get_variable() function pointer for NULL before trying to
         call it
      
      * tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
        efi: fix panic in kdump kernel
        x86/efistub: Don't clear BSS twice in mixed mode
        x86/efistub: Call mixed mode boot services on the firmware's stack
        efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
      ab8de2db
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5e74df2f
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - Ensure that the encryption mask at boot is properly propagated on
         5-level page tables, otherwise the PGD entry is incorrectly set to
         non-encrypted, which causes system crashes during boot.
      
       - Undo the deferred 5-level page table setup as it cannot work with
         memory encryption enabled.
      
       - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
         to the default value but the cached variable is not, so subsequent
         comparisons might yield the wrong result and as a consequence the
         result prevents updating the MSR.
      
       - Register the local APIC address only once in the MPPARSE enumeration
         to prevent triggering the related WARN_ONs() in the APIC and topology
         code.
      
       - Handle the case where no APIC is found gracefully by registering a
         fake APIC in the topology code. That makes all related topology
         functions work correctly and does not affect the actual APIC driver
         code at all.
      
       - Don't evaluate logical IDs during early boot as the local APIC IDs
         are not yet enumerated and the invoked function returns an error
         code. Nothing requires the logical IDs before the final CPUID
         enumeration takes place, which happens after the enumeration.
      
       - Cure the fallout of the per CPU rework on UP which misplaced the
         copying of boot_cpu_data to per CPU data so that the final update to
         boot_cpu_data got lost which caused inconsistent state and boot
         crashes.
      
       - Use copy_from_kernel_nofault() in the kprobes setup as there is no
         guarantee that the address can be safely accessed.
      
       - Reorder struct members in struct saved_context to work around another
         kmemleak false positive
      
       - Remove the buggy code which tries to update the E820 kexec table for
         setup_data as that is never passed to the kexec kernel.
      
       - Update the resource control documentation to use the proper units.
      
       - Fix a Kconfig warning observed with tinyconfig
      
      * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot/64: Move 5-level paging global variable assignments back
        x86/boot/64: Apply encryption mask to 5-level pagetable update
        x86/cpu: Add model number for another Intel Arrow Lake mobile processor
        x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
        Documentation/x86: Document that resctrl bandwidth control units are MiB
        x86/mpparse: Register APIC address only once
        x86/topology: Handle the !APIC case gracefully
        x86/topology: Don't evaluate logical IDs during early boot
        x86/cpu: Ensure that CPU info updates are propagated on UP
        kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
        x86/pm: Work around false positive kmemleak report in msr_build_context()
        x86/kexec: Do not update E820 kexec table for setup_data
        x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
      5e74df2f
    • Linus Torvalds's avatar
      Merge tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b136f68e
      Linus Torvalds authored
      Pull scheduler doc clarification from Thomas Gleixner:
       "A single update for the documentation of the base_slice_ns tunable to
        clarify that any value which is less than the tick slice has no effect
        because the scheduler tick is not guaranteed to happen within the set
        time slice"
      
      * tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
      b136f68e
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping · 864ad046
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
       "This has a set of swiotlb alignment fixes for sometimes very long
        standing bugs from Will. We've been discussion them for a while and
        they should be solid now"
      
      * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
        swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
        iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
        swiotlb: Fix alignment checks when both allocation and DMA masks are present
        swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
        swiotlb: Enforce page alignment in swiotlb_alloc()
        swiotlb: Fix double-allocation of slots due to broken alignment handling
      864ad046
    • Oleksandr Tymoshenko's avatar
      efi: fix panic in kdump kernel · 62b71cd7
      Oleksandr Tymoshenko authored
      Check if get_next_variable() is actually valid pointer before
      calling it. In kdump kernel this method is set to NULL that causes
      panic during the kexec-ed kernel boot.
      
      Tested with QEMU and OVMF firmware.
      
      Fixes: bad267f9 ("efi: verify that variable services are supported")
      Signed-off-by: default avatarOleksandr Tymoshenko <ovt@google.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      62b71cd7
    • Ard Biesheuvel's avatar
      x86/efistub: Don't clear BSS twice in mixed mode · df7ecce8
      Ard Biesheuvel authored
      Clearing BSS should only be done once, at the very beginning.
      efi_pe_entry() is the entrypoint from the firmware, which may not clear
      BSS and so it is done explicitly. However, efi_pe_entry() is also used
      as an entrypoint by the mixed mode startup code, in which case BSS will
      already have been cleared, and doing it again at this point will corrupt
      global variables holding the firmware's GDT/IDT and segment selectors.
      
      So make the memset() conditional on whether the EFI stub is running in
      native mode.
      
      Fixes: b3810c5a ("x86/efistub: Clear decompressor BSS in native EFI entrypoint")
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      df7ecce8
    • Ard Biesheuvel's avatar
      x86/efistub: Call mixed mode boot services on the firmware's stack · cefcd4fe
      Ard Biesheuvel authored
      Normally, the EFI stub calls into the EFI boot services using the stack
      that was live when the stub was entered. According to the UEFI spec,
      this stack needs to be at least 128k in size - this might seem large but
      all asynchronous processing and event handling in EFI runs from the same
      stack and so quite a lot of space may be used in practice.
      
      In mixed mode, the situation is a bit different: the bootloader calls
      the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
      entry point, where the boot stack is set up, using a fixed allocation
      of 16k. This stack is still in use when the EFI stub is started in
      64-bit mode, and so all calls back into the EFI firmware will be using
      the decompressor's limited boot stack.
      
      Due to the placement of the boot stack right after the boot heap, any
      stack overruns have gone unnoticed. However, commit
      
        5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")
      
      moved the definition of the boot heap into C code, and now the boot
      stack is placed right at the base of BSS, where any overruns will
      corrupt the end of the .data section.
      
      While it would be possible to work around this by increasing the size of
      the boot stack, doing so would affect all x86 systems, and mixed mode
      systems are a tiny (and shrinking) fraction of the x86 installed base.
      
      So instead, record the firmware stack pointer value when entering from
      the 32-bit firmware, and switch to this stack every time a EFI boot
      service call is made.
      
      Cc: <stable@kernel.org> # v6.1+
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      cefcd4fe
    • Tom Lendacky's avatar
      x86/boot/64: Move 5-level paging global variable assignments back · 9843231c
      Tom Lendacky authored
      Commit 63bed966 ("x86/startup_64: Defer assignment of 5-level paging
      global variables") moved assignment of 5-level global variables to later
      in the boot in order to avoid having to use RIP relative addressing in
      order to set them. However, when running with 5-level paging and SME
      active (mem_encrypt=on), the variables are needed as part of the page
      table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
      etc.). Since the variables haven't been set, the page table manipulation
      is done as if 4-level paging is active, causing the system to crash on
      boot.
      
      While only a subset of the assignments that were moved need to be set
      early, move all of the assignments back into check_la57_support() so that
      these assignments aren't spread between two locations. Instead of just
      reverting the fix, this uses the new RIP_REL_REF() macro when assigning
      the variables.
      
      Fixes: 63bed966 ("x86/startup_64: Defer assignment of 5-level paging global variables")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
      9843231c
    • Tom Lendacky's avatar
      x86/boot/64: Apply encryption mask to 5-level pagetable update · 4d0d7e78
      Tom Lendacky authored
      When running with 5-level page tables, the kernel mapping PGD entry is
      updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
      which, when SME is active (mem_encrypt=on), results in a page table
      entry without the encryption mask set, causing the system to crash on
      boot.
      
      Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
      that the encryption mask is set for the PGD entry.
      
      Fixes: 533568e0 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
      4d0d7e78
    • Tony Luck's avatar
    • Adamos Ttofari's avatar
      x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD · 10e4b516
      Adamos Ttofari authored
      Commit 67236547 ("x86/fpu: Update XFD state where required") and
      commit 8bf26758 ("x86/fpu: Add XFD state to fpstate") introduced a
      per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
      order to avoid unnecessary writes to the MSR.
      
      On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
      wipes out any stale state. But the per CPU cached xfd value is not
      reset, which brings them out of sync.
      
      As a consequence a subsequent xfd_update_state() might fail to update
      the MSR which in turn can result in XRSTOR raising a #NM in kernel
      space, which crashes the kernel.
      
      To fix this, introduce xfd_set_state() to write xfd_state together
      with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.
      
      Fixes: 67236547 ("x86/fpu: Update XFD state where required")
      Signed-off-by: default avatarAdamos Ttofari <attofari@amazon.de>
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com
      
      Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
      10e4b516
    • Tony Luck's avatar
      Documentation/x86: Document that resctrl bandwidth control units are MiB · a8ed59a3
      Tony Luck authored
      The memory bandwidth software controller uses 2^20 units rather than
      10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
      Linux define for 0x00100000.
      
      Update the documentation to use MiB when describing this feature.
      It's too late to fix the mount option "mba_MBps" as that is now an
      established user interface.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20240322182016.196544-1-tony.luck@intel.com
      a8ed59a3
  8. 23 Mar, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 70293240
      Linus Torvalds authored
      Pull timer fixes from Thomas Gleixner:
       "Two regression fixes for the timer and timer migration code:
      
         - Prevent endless timer requeuing which is caused by two CPUs racing
           out of idle. This happens when the last CPU goes idle and therefore
           has to ensure to expire the pending global timers and some other
           CPU come out of idle at the same time and the other CPU wins the
           race and expires the global queue. This causes the last CPU to
           chase ghost timers forever and reprogramming it's clockevent device
           endlessly.
      
           Cure this by re-evaluating the wakeup time unconditionally.
      
         - The split into local (pinned) and global timers in the timer wheel
           caused a regression for NOHZ full as it broke the idle tracking of
           global timers. On NOHZ full this prevents an self IPI being sent
           which in turn causes the timer to be not programmed and not being
           expired on time.
      
           Restore the idle tracking for the global timer base so that the
           self IPI condition for NOHZ full is working correctly again"
      
      * tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timers: Fix removed self-IPI on global timer's enqueue in nohz_full
        timers/migration: Fix endless timer requeue after idle interrupts
      70293240