1. 12 Oct, 2023 1 commit
    • Paul E. McKenney's avatar
      x86/nmi: Fix out-of-order NMI nesting checks & false positive warning · f44075ec
      Paul E. McKenney authored
      The ->idt_seq and ->recv_jiffies variables added by:
      
        1a3ea611 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
      
      ... place the exit-time check of the bottom bit of ->idt_seq after the
      this_cpu_dec_return() that re-enables NMI nesting.  This can result in
      the following sequence of events on a given CPU in kernels built with
      CONFIG_NMI_CHECK_CPU=y:
      
        o   An NMI arrives, and ->idt_seq is incremented to an odd number.
            In addition, nmi_state is set to NMI_EXECUTING==1.
      
        o   The NMI is processed.
      
        o   The this_cpu_dec_return(nmi_state) zeroes nmi_state and returns
            NMI_EXECUTING==1, thus opting out of the "goto nmi_restart".
      
        o   Another NMI arrives and ->idt_seq is incremented to an even
            number, triggering the warning.  But all is just fine, at least
            assuming we don't get so many closely spaced NMIs that the stack
            overflows or some such.
      
      Experience on the fleet indicates that the MTBF of this false positive
      is about 70 years.  Or, for those who are not quite that patient, the
      MTBF appears to be about one per week per 4,000 systems.
      
      Fix this false-positive warning by moving the "nmi_restart" label before
      the initial ->idt_seq increment/check and moving the this_cpu_dec_return()
      to follow the final ->idt_seq increment/check.  This way, all nested NMIs
      that get past the NMI_NOT_RUNNING check get a clean ->idt_seq slate.
      And if they don't get past that check, they will set nmi_state to
      NMI_LATCHED, which will cause the this_cpu_dec_return(nmi_state)
      to restart.
      
      Fixes: 1a3ea611 ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
      Reported-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lore.kernel.org/r/0cbff831-6e3d-431c-9830-ee65ee7787ff@paulmck-laptop
      f44075ec
  2. 08 Oct, 2023 4 commits
  3. 07 Oct, 2023 10 commits
  4. 06 Oct, 2023 22 commits
  5. 05 Oct, 2023 3 commits
    • Jeff Moyer's avatar
      io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls() · 0f8baa3c
      Jeff Moyer authored
      I received a bug report with the following signature:
      
      [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8
      [ 1759.944564] #PF: supervisor read access in kernel mode
      [ 1759.949732] #PF: error_code(0x0000) - not-present page
      [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0
      [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI
      [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1
      [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018
      [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0
      [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48
      [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286
      [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000
      [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400
      [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000
      [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50
      [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548
      [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000
      [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0
      [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1760.086325] PKRU: 55555554
      [ 1760.089044] Call Trace:
      [ 1760.091501] <TASK>
      [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df
      [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df
      [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0
      [ 1760.106584] ? __die_body.cold+0x8/0xd
      [ 1760.110356] ? page_fault_oops+0x134/0x170
      [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110
      [ 1760.119298] ? exc_page_fault+0xa8/0x150
      [ 1760.123247] ? asm_exc_page_fault+0x22/0x30
      [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10
      [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10
      [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0
      [ 1760.142527] __io_wq_cpu_online+0x54/0xb0
      [ 1760.146558] cpuhp_invoke_callback+0x109/0x460
      [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10
      [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10
      [ 1760.160320] cpuhp_thread_fun+0x8d/0x140
      [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0
      [ 1760.168297] kthread+0xdd/0x100
      [ 1760.171457] ? __pfx_kthread+0x10/0x10
      [ 1760.175225] ret_from_fork+0x29/0x50
      [ 1760.178826] </TASK>
      [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
      [ 1760.248623] CR2: ffffffffffffffe8
      
      A cpu hotplug callback was issued before wq->all_list was initialized.
      This results in a null pointer dereference.  The fix is to fully setup
      the io_wq before calling cpuhp_state_add_instance_nocalls().
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f8baa3c
    • Xuewen Yan's avatar
      cpufreq: schedutil: Update next_freq when cpufreq_limits change · 9e0bc36a
      Xuewen Yan authored
      When cpufreq's policy is 'single', there is a scenario that will
      cause sg_policy's next_freq to be unable to update.
      
      When the CPU's util is always max, the cpufreq will be max,
      and then if we change the policy's scaling_max_freq to be a
      lower freq, indeed, the sg_policy's next_freq need change to
      be the lower freq, however, because the cpu_is_busy, the next_freq
      would keep the max_freq.
      
      For example:
      
      The cpu7 is a single CPU:
      
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
        pid 4737's current affinity mask: ff
        pid 4737's new affinity mask: 80
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
        2301000
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
        2301000
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
        unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
        2171000
      
      At this time, the sg_policy's next_freq would stay at 2301000, which
      is wrong.
      
      To fix this, add a check for the ->need_freq_update flag.
      
      [ mingo: Clarified the changelog. ]
      Co-developed-by: default avatarGuohua Yan <guohua.yan@unisoc.com>
      Signed-off-by: default avatarXuewen Yan <xuewen.yan@unisoc.com>
      Signed-off-by: default avatarGuohua Yan <guohua.yan@unisoc.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatar"Rafael J. Wysocki" <rafael@kernel.org>
      Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
      9e0bc36a
    • Renan Guilherme Lebre Ramos's avatar