1. 05 Sep, 2024 8 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · e4b42053
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix adding a new fgraph callback after function graph tracing has
         already started.
      
         If the new caller does not initialize its hash before registering the
         fgraph_ops, it can cause a NULL pointer dereference. Fix this by
         adding a new parameter to ftrace_graph_enable_direct() passing in the
         newly added gops directly and not rely on using the fgraph_array[],
         as entries in the fgraph_array[] must be initialized.
      
         Assign the new gops to the fgraph_array[] after it goes through
         ftrace_startup_subops() as that will properly initialize the
         gops->ops and initialize its hashes.
      
       - Fix a memory leak in fgraph storage memory test.
      
         If the "multiple fgraph storage on a function" boot up selftest fails
         in the registering of the function graph tracer, it will not free the
         memory it allocated for the filter. Break the loop up into two where
         it allocates the filters first and then registers the functions where
         any errors will do the appropriate clean ups.
      
       - Only clear the timerlat timers if it has an associated kthread.
      
         In the rtla tool that uses timerlat, if it was killed just as it was
         shutting down, the signals can free the kthread and the timer. But
         the closing of the timerlat files could cause the hrtimer_cancel() to
         be called on the already freed timer. As the kthread variable is is
         set to NULL when the kthreads are stopped and the timers are freed it
         can be used to know not to call hrtimer_cancel() on the timer if the
         kthread variable is NULL.
      
       - Use a cpumask to keep track of osnoise/timerlat kthreads
      
         The timerlat tracer can use user space threads for its analysis. With
         the killing of the rtla tool, the kernel can get confused between if
         it is using a user space thread to analyze or one of its own kernel
         threads. When this confusion happens, kthread_stop() can be called on
         a user space thread and bad things happen. As the kernel threads are
         per-cpu, a bitmask can be used to know when a kernel thread is used
         or when a user space thread is used.
      
       - Add missing interface_lock to osnoise/timerlat stop_kthread()
      
         The stop_kthread() function in osnoise/timerlat clears the osnoise
         kthread variable, and if it was a user space thread does a put_task
         on it. But this can race with the closing of the timerlat files that
         also does a put_task on the kthread, and if the race happens the task
         will have put_task called on it twice and oops.
      
       - Add cond_resched() to the tracing_iter_reset() loop.
      
         The latency tracers keep writing to the ring buffer without resetting
         when it issues a new "start" event (like interrupts being disabled).
         When reading the buffer with an iterator, the tracing_iter_reset()
         sets its pointer to that start event by walking through all the
         events in the buffer until it gets to the time stamp of the start
         event. In the case of a very large buffer, the loop that looks for
         the start event has been reported taking a very long time with a non
         preempt kernel that it can trigger a soft lock up warning. Add a
         cond_resched() into that loop to make sure that doesn't happen.
      
       - Use list_del_rcu() for eventfs ei->list variable
      
         It was reported that running loops of creating and deleting kprobe
         events could cause a crash due to the eventfs list iteration hitting
         a LIST_POISON variable. This is because the list is protected by SRCU
         but when an item is deleted from the list, it was using list_del()
         which poisons the "next" pointer. This is what list_del_rcu() was to
         prevent.
      
      * tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()
        tracing/timerlat: Only clear timer if a kthread exists
        tracing/osnoise: Use a cpumask to know what threads are kthreads
        eventfs: Use list_del_rcu() for SRCU protected list variable
        tracing: Avoid possible softlockup in tracing_iter_reset()
        tracing: Fix memory leak in fgraph storage selftest
        tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
      e4b42053
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v6.11-6' of... · ad618736
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v6.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
      
      Pull x86 platform driver fixes from Ilpo Järvinen:
      
       - amd/pmf: ASUS GA403 quirk matching tweak
      
       - dell-smbios: Fix to the init function rollback path
      
      * tag 'platform-drivers-x86-v6.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
        platform/x86/amd: pmf: Make ASUS GA403 quirk generic
        platform/x86: dell-smbios: Fix error path in dell_smbios_init()
      ad618736
    • Linus Torvalds's avatar
      Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7' of... · 120434e5
      Linus Torvalds authored
      Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kunit fix fromShuah Khan:
       "One single fix to a use-after-free bug resulting from
        kunit_driver_create() failing to copy the driver name leaving it on
        the stack or freeing it"
      
      * tag 'linux_kselftest-kunit-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: Device wrappers should also manage driver name
      120434e5
    • Steven Rostedt's avatar
      tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread() · 5bfbcd1e
      Steven Rostedt authored
      The timerlat interface will get and put the task that is part of the
      "kthread" field of the osn_var to keep it around until all references are
      released. But here's a race in the "stop_kthread()" code that will call
      put_task_struct() on the kthread if it is not a kernel thread. This can
      race with the releasing of the references to that task struct and the
      put_task_struct() can be called twice when it should have been called just
      once.
      
      Take the interface_lock() in stop_kthread() to synchronize this change.
      But to do so, the function stop_per_cpu_kthreads() needs to change the
      loop from for_each_online_cpu() to for_each_possible_cpu() and remove the
      cpu_read_lock(), as the interface_lock can not be taken while the cpu
      locks are held. The only side effect of this change is that it may do some
      extra work, as the per_cpu variables of the offline CPUs would not be set
      anyway, and would simply be skipped in the loop.
      
      Remove unneeded "return;" in stop_kthread().
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Tomas Glozar <tglozar@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Link: https://lore.kernel.org/20240905113359.2b934242@gandalf.local.home
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      5bfbcd1e
    • Steven Rostedt's avatar
      tracing/timerlat: Only clear timer if a kthread exists · e6a53481
      Steven Rostedt authored
      The timerlat tracer can use user space threads to check for osnoise and
      timer latency. If the program using this is killed via a SIGTERM, the
      threads are shutdown one at a time and another tracing instance can start
      up resetting the threads before they are fully closed. That causes the
      hrtimer assigned to the kthread to be shutdown and freed twice when the
      dying thread finally closes the file descriptors, causing a use-after-free
      bug.
      
      Only cancel the hrtimer if the associated thread is still around. Also add
      the interface_lock around the resetting of the tlat_var->kthread.
      
      Note, this is just a quick fix that can be backported to stable. A real
      fix is to have a better synchronization between the shutdown of old
      threads and the starting of new ones.
      
      Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Link: https://lore.kernel.org/20240905085330.45985730@gandalf.local.home
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Reported-by: default avatarTomas Glozar <tglozar@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      e6a53481
    • Steven Rostedt's avatar
      tracing/osnoise: Use a cpumask to know what threads are kthreads · 177e1cc2
      Steven Rostedt authored
      The start_kthread() and stop_thread() code was not always called with the
      interface_lock held. This means that the kthread variable could be
      unexpectedly changed causing the kthread_stop() to be called on it when it
      should not have been, leading to:
      
       while true; do
         rtla timerlat top -u -q & PID=$!;
         sleep 5;
         kill -INT $PID;
         sleep 0.001;
         kill -TERM $PID;
         wait $PID;
        done
      
      Causing the following OOPS:
      
       Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
       KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
       CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc7-dirty #125 a533010b71dab205ad2f507188ce8c82203b0254
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
       RIP: 0010:hrtimer_active+0x58/0x300
       Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f
       RSP: 0018:ffff88811d97f940 EFLAGS: 00010202
       RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b
       RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28
       RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60
       R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d
       R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28
       FS:  0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0
       Call Trace:
        <TASK>
        ? die_addr+0x40/0xa0
        ? exc_general_protection+0x154/0x230
        ? asm_exc_general_protection+0x26/0x30
        ? hrtimer_active+0x58/0x300
        ? __pfx_mutex_lock+0x10/0x10
        ? __pfx_locks_remove_file+0x10/0x10
        hrtimer_cancel+0x15/0x40
        timerlat_fd_release+0x8e/0x1f0
        ? security_file_release+0x43/0x80
        __fput+0x372/0xb10
        task_work_run+0x11e/0x1f0
        ? _raw_spin_lock+0x85/0xe0
        ? __pfx_task_work_run+0x10/0x10
        ? poison_slab_object+0x109/0x170
        ? do_exit+0x7a0/0x24b0
        do_exit+0x7bd/0x24b0
        ? __pfx_migrate_enable+0x10/0x10
        ? __pfx_do_exit+0x10/0x10
        ? __pfx_read_tsc+0x10/0x10
        ? ktime_get+0x64/0x140
        ? _raw_spin_lock_irq+0x86/0xe0
        do_group_exit+0xb0/0x220
        get_signal+0x17ba/0x1b50
        ? vfs_read+0x179/0xa40
        ? timerlat_fd_read+0x30b/0x9d0
        ? __pfx_get_signal+0x10/0x10
        ? __pfx_timerlat_fd_read+0x10/0x10
        arch_do_signal_or_restart+0x8c/0x570
        ? __pfx_arch_do_signal_or_restart+0x10/0x10
        ? vfs_read+0x179/0xa40
        ? ksys_read+0xfe/0x1d0
        ? __pfx_ksys_read+0x10/0x10
        syscall_exit_to_user_mode+0xbc/0x130
        do_syscall_64+0x74/0x110
        ? __pfx___rseq_handle_notify_resume+0x10/0x10
        ? __pfx_ksys_read+0x10/0x10
        ? fpregs_restore_userregs+0xdb/0x1e0
        ? fpregs_restore_userregs+0xdb/0x1e0
        ? syscall_exit_to_user_mode+0x116/0x130
        ? do_syscall_64+0x74/0x110
        ? do_syscall_64+0x74/0x110
        ? do_syscall_64+0x74/0x110
        entry_SYSCALL_64_after_hwframe+0x71/0x79
       RIP: 0033:0x7ff0070eca9c
       Code: Unable to access opcode bytes at 0x7ff0070eca72.
       RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
       RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c
       RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003
       RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0
       R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003
       R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008
        </TASK>
       Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core
       ---[ end trace 0000000000000000 ]---
      
      This is because it would mistakenly call kthread_stop() on a user space
      thread making it "exit" before it actually exits.
      
      Since kthreads are created based on global behavior, use a cpumask to know
      when kthreads are running and that they need to be shutdown before
      proceeding to do new work.
      
      Link: https://lore.kernel.org/all/20240820130001.124768-1-tglozar@redhat.com/
      
      This was debugged by using the persistent ring buffer:
      
      Link: https://lore.kernel.org/all/20240823013902.135036960@goodmis.org/
      
      Note, locking was originally used to fix this, but that proved to cause too
      many deadlocks to work around:
      
        https://lore.kernel.org/linux-trace-kernel/20240823102816.5e55753b@gandalf.local.home/
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Link: https://lore.kernel.org/20240904103428.08efdf4c@gandalf.local.home
      Fixes: e88ed227 ("tracing/timerlat: Add user-space interface")
      Reported-by: default avatarTomas Glozar <tglozar@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      177e1cc2
    • Steven Rostedt's avatar
      eventfs: Use list_del_rcu() for SRCU protected list variable · d2603279
      Steven Rostedt authored
      Chi Zhiling reported:
      
        We found a null pointer accessing in tracefs[1], the reason is that the
        variable 'ei_child' is set to LIST_POISON1, that means the list was
        removed in eventfs_remove_rec. so when access the ei_child->is_freed, the
        panic triggered.
      
        by the way, the following script can reproduce this panic
      
        loop1 (){
            while true
            do
                echo "p:kp submit_bio" > /sys/kernel/debug/tracing/kprobe_events
                echo "" > /sys/kernel/debug/tracing/kprobe_events
            done
        }
        loop2 (){
            while true
            do
                tree /sys/kernel/debug/tracing/events/kprobes/
            done
        }
        loop1 &
        loop2
      
        [1]:
        [ 1147.959632][T17331] Unable to handle kernel paging request at virtual address dead000000000150
        [ 1147.968239][T17331] Mem abort info:
        [ 1147.971739][T17331]   ESR = 0x0000000096000004
        [ 1147.976172][T17331]   EC = 0x25: DABT (current EL), IL = 32 bits
        [ 1147.982171][T17331]   SET = 0, FnV = 0
        [ 1147.985906][T17331]   EA = 0, S1PTW = 0
        [ 1147.989734][T17331]   FSC = 0x04: level 0 translation fault
        [ 1147.995292][T17331] Data abort info:
        [ 1147.998858][T17331]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
        [ 1148.005023][T17331]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
        [ 1148.010759][T17331]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
        [ 1148.016752][T17331] [dead000000000150] address between user and kernel address ranges
        [ 1148.024571][T17331] Internal error: Oops: 0000000096000004 [#1] SMP
        [ 1148.030825][T17331] Modules linked in: team_mode_loadbalance team nlmon act_gact cls_flower sch_ingress bonding tls macvlan dummy ib_core bridge stp llc veth amdgpu amdxcp mfd_core gpu_sched drm_exec drm_buddy radeon crct10dif_ce video drm_suballoc_helper ghash_ce drm_ttm_helper sha2_ce ttm sha256_arm64 i2c_algo_bit sha1_ce sbsa_gwdt cp210x drm_display_helper cec sr_mod cdrom drm_kms_helper binfmt_misc sg loop fuse drm dm_mod nfnetlink ip_tables autofs4 [last unloaded: tls]
        [ 1148.072808][T17331] CPU: 3 PID: 17331 Comm: ls Tainted: G        W         ------- ----  6.6.43 #2
        [ 1148.081751][T17331] Source Version: 21b3b386e948bedd29369af66f3e98ab01b1c650
        [ 1148.088783][T17331] Hardware name: Greatwall GW-001M1A-FTF/GW-001M1A-FTF, BIOS KunLun BIOS V4.0 07/16/2020
        [ 1148.098419][T17331] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        [ 1148.106060][T17331] pc : eventfs_iterate+0x2c0/0x398
        [ 1148.111017][T17331] lr : eventfs_iterate+0x2fc/0x398
        [ 1148.115969][T17331] sp : ffff80008d56bbd0
        [ 1148.119964][T17331] x29: ffff80008d56bbf0 x28: ffff001ff5be2600 x27: 0000000000000000
        [ 1148.127781][T17331] x26: ffff001ff52ca4e0 x25: 0000000000009977 x24: dead000000000100
        [ 1148.135598][T17331] x23: 0000000000000000 x22: 000000000000000b x21: ffff800082645f10
        [ 1148.143415][T17331] x20: ffff001fddf87c70 x19: ffff80008d56bc90 x18: 0000000000000000
        [ 1148.151231][T17331] x17: 0000000000000000 x16: 0000000000000000 x15: ffff001ff52ca4e0
        [ 1148.159048][T17331] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
        [ 1148.166864][T17331] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000804391d0
        [ 1148.174680][T17331] x8 : 0000000180000000 x7 : 0000000000000018 x6 : 0000aaab04b92862
        [ 1148.182498][T17331] x5 : 0000aaab04b92862 x4 : 0000000080000000 x3 : 0000000000000068
        [ 1148.190314][T17331] x2 : 000000000000000f x1 : 0000000000007ea8 x0 : 0000000000000001
        [ 1148.198131][T17331] Call trace:
        [ 1148.201259][T17331]  eventfs_iterate+0x2c0/0x398
        [ 1148.205864][T17331]  iterate_dir+0x98/0x188
        [ 1148.210036][T17331]  __arm64_sys_getdents64+0x78/0x160
        [ 1148.215161][T17331]  invoke_syscall+0x78/0x108
        [ 1148.219593][T17331]  el0_svc_common.constprop.0+0x48/0xf0
        [ 1148.224977][T17331]  do_el0_svc+0x24/0x38
        [ 1148.228974][T17331]  el0_svc+0x40/0x168
        [ 1148.232798][T17331]  el0t_64_sync_handler+0x120/0x130
        [ 1148.237836][T17331]  el0t_64_sync+0x1a4/0x1a8
        [ 1148.242182][T17331] Code: 54ffff6c f9400676 910006d6 f9000676 (b9405300)
        [ 1148.248955][T17331] ---[ end trace 0000000000000000 ]---
      
      The issue is that list_del() is used on an SRCU protected list variable
      before the synchronization occurs. This can poison the list pointers while
      there is a reader iterating the list.
      
      This is simply fixed by using list_del_rcu() that is specifically made for
      this purpose.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240829085025.3600021-1-chizhiling@163.com/
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lore.kernel.org/20240904131605.640d42b1@gandalf.local.home
      Fixes: 43aa6f97 ("eventfs: Get rid of dentry pointers without refcounts")
      Reported-by: default avatarChi Zhiling <chizhiling@kylinos.cn>
      Tested-by: default avatarChi Zhiling <chizhiling@kylinos.cn>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d2603279
    • Zheng Yejian's avatar
      tracing: Avoid possible softlockup in tracing_iter_reset() · 49aa8a1f
      Zheng Yejian authored
      In __tracing_open(), when max latency tracers took place on the cpu,
      the time start of its buffer would be updated, then event entries with
      timestamps being earlier than start of the buffer would be skipped
      (see tracing_iter_reset()).
      
      Softlockup will occur if the kernel is non-preemptible and too many
      entries were skipped in the loop that reset every cpu buffer, so add
      cond_resched() to avoid it.
      
      Cc: stable@vger.kernel.org
      Fixes: 2f26ebd5 ("tracing: use timestamp to determine start of latency traces")
      Link: https://lore.kernel.org/20240827124654.3817443-1-zhengyejian@huaweicloud.comSuggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarZheng Yejian <zhengyejian@huaweicloud.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      49aa8a1f
  2. 04 Sep, 2024 11 commits
  3. 03 Sep, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 88fac175
      Linus Torvalds authored
      Pull fuse fixes from Miklos Szeredi:
      
       - Fix EIO if splice and page stealing are enabled on the fuse device
      
       - Disable problematic combination of passthrough and writeback-cache
      
       - Other bug fixes found by code review
      
      * tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
        fuse: disable the combination of passthrough and writeback cache
        fuse: update stats for pages in dropped aux writeback list
        fuse: clear PG_uptodate when using a stolen page
        fuse: fix memory leak in fuse_create_open
        fuse: check aborted connection before adding requests to pending list for resending
        fuse: use unsigned type for getxattr/listxattr size truncation
      88fac175
    • Filipe Manana's avatar
      btrfs: fix race between direct IO write and fsync when using same fd · cd9253c2
      Filipe Manana authored
      If we have 2 threads that are using the same file descriptor and one of
      them is doing direct IO writes while the other is doing fsync, we have a
      race where we can end up either:
      
      1) Attempt a fsync without holding the inode's lock, triggering an
         assertion failures when assertions are enabled;
      
      2) Do an invalid memory access from the fsync task because the file private
         points to memory allocated on stack by the direct IO task and it may be
         used by the fsync task after the stack was destroyed.
      
      The race happens like this:
      
      1) A user space program opens a file descriptor with O_DIRECT;
      
      2) The program spawns 2 threads using libpthread for example;
      
      3) One of the threads uses the file descriptor to do direct IO writes,
         while the other calls fsync using the same file descriptor.
      
      4) Call task A the thread doing direct IO writes and task B the thread
         doing fsyncs;
      
      5) Task A does a direct IO write, and at btrfs_direct_write() sets the
         file's private to an on stack allocated private with the member
         'fsync_skip_inode_lock' set to true;
      
      6) Task B enters btrfs_sync_file() and sees that there's a private
         structure associated to the file which has 'fsync_skip_inode_lock' set
         to true, so it skips locking the inode's VFS lock;
      
      7) Task A completes the direct IO write, and resets the file's private to
         NULL since it had no prior private and our private was stack allocated.
         Then it unlocks the inode's VFS lock;
      
      8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
         assertion that checks the inode's VFS lock is held fails, since task B
         never locked it and task A has already unlocked it.
      
      The stack trace produced is the following:
      
         assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/ordered-data.c:983!
         Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
         CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
         Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
         RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
         Code: 50 d6 86 c0 e8 (...)
         RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
         RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
         RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
         RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
         R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
         R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
         FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
         Call Trace:
          <TASK>
          ? __die_body.cold+0x14/0x24
          ? die+0x2e/0x50
          ? do_trap+0xca/0x110
          ? do_error_trap+0x6a/0x90
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? exc_invalid_op+0x50/0x70
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? asm_exc_invalid_op+0x1a/0x20
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? __seccomp_filter+0x31d/0x4f0
          __x64_sys_fdatasync+0x4f/0x90
          do_syscall_64+0x82/0x160
          ? do_futex+0xcb/0x190
          ? __x64_sys_futex+0x10e/0x1d0
          ? switch_fpu_return+0x4f/0xd0
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Another problem here is if task B grabs the private pointer and then uses
      it after task A has finished, since the private was allocated in the stack
      of task A, it results in some invalid memory access with a hard to predict
      result.
      
      This issue, triggering the assertion, was observed with QEMU workloads by
      two users in the Link tags below.
      
      Fix this by not relying on a file's private to pass information to fsync
      that it should skip locking the inode and instead pass this information
      through a special value stored in current->journal_info. This is safe
      because in the relevant section of the direct IO write path we are not
      holding a transaction handle, so current->journal_info is NULL.
      
      The following C program triggers the issue:
      
         $ cat repro.c
         /* Get the O_DIRECT definition. */
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <stdio.h>
         #include <stdlib.h>
         #include <unistd.h>
         #include <stdint.h>
         #include <fcntl.h>
         #include <errno.h>
         #include <string.h>
         #include <pthread.h>
      
         static int fd;
      
         static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
         {
             while (count > 0) {
                 ssize_t ret;
      
                 ret = pwrite(fd, buf, count, offset);
                 if (ret < 0) {
                     if (errno == EINTR)
                         continue;
                     return ret;
                 }
                 count -= ret;
                 buf += ret;
             }
             return 0;
         }
      
         static void *fsync_loop(void *arg)
         {
             while (1) {
                 int ret;
      
                 ret = fsync(fd);
                 if (ret != 0) {
                     perror("Fsync failed");
                     exit(6);
                 }
             }
         }
      
         int main(int argc, char *argv[])
         {
             long pagesize;
             void *write_buf;
             pthread_t fsyncer;
             int ret;
      
             if (argc != 2) {
                 fprintf(stderr, "Use: %s <file path>\n", argv[0]);
                 return 1;
             }
      
             fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
             if (fd == -1) {
                 perror("Failed to open/create file");
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             if (pagesize == -1) {
                 perror("Failed to get page size");
                 return 2;
             }
      
             ret = posix_memalign(&write_buf, pagesize, pagesize);
             if (ret) {
                 perror("Failed to allocate buffer");
                 return 3;
             }
      
             ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create writer thread: %d\n", ret);
                 return 4;
             }
      
             while (1) {
                 ret = do_write(fd, write_buf, pagesize, 0);
                 if (ret != 0) {
                     perror("Write failed");
                     exit(5);
                 }
             }
      
             return 0;
         }
      
         $ mkfs.btrfs -f /dev/sdi
         $ mount /dev/sdi /mnt/sdi
         $ timeout 10 ./repro /mnt/sdi/foo
      
      Usually the race is triggered within less than 1 second. A test case for
      fstests will follow soon.
      Reported-by: default avatarPaulo Dias <paulo.miguel.dias@gmail.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187Reported-by: default avatarAndreas Jahn <jahn-andi@web.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
      Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
      Fixes: 939b656b ("btrfs: fix corruption after buffer fault in during direct IO append write")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd9253c2
    • Luke D. Jones's avatar
      platform/x86/amd: pmf: Make ASUS GA403 quirk generic · d34af755
      Luke D. Jones authored
      The original quirk should match to GA403U so that the full
      range of GA403U models can benefit.
      Signed-off-by: default avatarLuke D. Jones <luke@ljones.dev>
      Link: https://lore.kernel.org/r/20240831003905.1060977-1-luke@ljones.devReviewed-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Signed-off-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      d34af755
    • Helge Deller's avatar
      parisc: Delay write-protection until mark_rodata_ro() call · 213aa670
      Helge Deller authored
      Do not write-protect the kernel read-only and __ro_after_init sections
      earlier than before mark_rodata_ro() is called.  This fixes a boot issue on
      parisc which is triggered by commit 91a1d97e ("jump_label,module: Don't
      alloc static_key_mod for __ro_after_init keys"). That commit may modify
      static key contents in the __ro_after_init section at bootup, so this
      section needs to be writable at least until mark_rodata_ro() is called.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Reported-by: default avatarmatoro <matoro_mailinglist_kernel@matoro.tk>
      Reported-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Tested-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Link: https://lore.kernel.org/linux-parisc/096cad5aada514255cd7b0b9dbafc768@matoro.tk/#r
      Fixes: 91a1d97e ("jump_label,module: Don't alloc static_key_mod for __ro_after_init keys")
      Cc: stable@vger.kernel.org # v6.10+
      213aa670
  4. 02 Sep, 2024 17 commits
    • Naohiro Aota's avatar
      btrfs: zoned: handle broken write pointer on zones · b1934cd6
      Naohiro Aota authored
      Btrfs rejects to mount a FS if it finds a block group with a broken write
      pointer (e.g, unequal write pointers on two zones of RAID1 block group).
      Since such case can happen easily with a power-loss or crash of a system,
      we need to handle the case more gently.
      
      Handle such block group by making it unallocatable, so that there will be
      no writes into it. That can be done by setting the allocation pointer at
      the end of allocating region (= block_group->zone_capacity). Then, existing
      code handle zone_unusable properly.
      
      Having proper zone_capacity is necessary for the change. So, set it as fast
      as possible.
      
      We cannot handle RAID0 and RAID10 case like this. But, they are anyway
      unable to read because of a missing stripe.
      
      Fixes: 265f7237 ("btrfs: zoned: allow DUP on meta-data block groups")
      Fixes: 568220fa ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
      CC: stable@vger.kernel.org # 6.1+
      Reported-by: default avatarHAN Yuwei <hrx@bupt.moe>
      Cc: Xuefer <xuefer@gmail.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1934cd6
    • Arnaldo Carvalho de Melo's avatar
      perf daemon: Fix the build on more 32-bit architectures · e162cb25
      Arnaldo Carvalho de Melo authored
      FYI: I'm carrying this on perf-tools-next.
      
      The previous attempt fixed the build on debian:experimental-x-mipsel,
      but when building on a larger set of containers I noticed it broke the
      build on some other 32-bit architectures such as:
      
        42     7.87 ubuntu:18.04-x-arm            : FAIL gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)
          builtin-daemon.c: In function 'cmd_session_list':
          builtin-daemon.c:692:16: error: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'long int' [-Werror=format=]
             fprintf(out, "%c%" PRIu64,
                          ^~~~~
          builtin-daemon.c:694:13:
              csv_sep, (curr - daemon->start) / 60);
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~
          In file included from builtin-daemon.c:3:0:
          /usr/arm-linux-gnueabihf/include/inttypes.h:105:34: note: format string is defined here
           # define PRIu64  __PRI64_PREFIX "u"
      
      So lets cast that time_t (32-bit/64-bit) to uint64_t to make sure it
      builds everywhere.
      
      Fixes: 4bbe6002 ("perf daemon: Fix the build on 32-bit architectures")
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/ZsPmldtJ0D9Cua9_@x1Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      e162cb25
    • Xu Yang's avatar
      perf python: include "util/sample.h" · aee1d559
      Xu Yang authored
      The 32-bit arm build system will complain:
      
      tools/perf/util/python.c:75:28: error: field ‘sample’ has incomplete type
         75 |         struct perf_sample sample;
      
      However, arm64 build system doesn't complain this.
      
      The root cause is arm64 define "HAVE_KVM_STAT_SUPPORT := 1" in
      tools/perf/arch/arm64/Makefile, but arm arch doesn't define this.
      This will lead to kvm-stat.h include other header files on arm64 build
      system, especially "util/sample.h" for util/python.c.
      
      This will try to directly include "util/sample.h" for "util/python.c" to
      avoid such build issue on arm platform.
      Signed-off-by: default avatarXu Yang <xu.yang_2@nxp.com>
      Cc: imx@lists.linux.dev
      Link: https://lore.kernel.org/r/20240819023403.201324-1-xu.yang_2@nxp.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      aee1d559
    • Namhyung Kim's avatar
      perf lock contention: Fix spinlock and rwlock accounting · 287bd5cf
      Namhyung Kim authored
      The spinlock and rwlock use a single-element per-cpu array to track
      current locks due to performance reason.  But this means the key is
      always available and it cannot simply account lock stats in the array
      because some of them are invalid.
      
      In fact, the contention_end() program in the BPF invalidates the entry
      by setting the 'lock' value to 0 instead of deleting the entry for the
      hashmap.  So it should skip entries with the lock value of 0 in the
      account_end_timestamp().
      
      Otherwise, it'd have spurious high contention on an idle machine:
      
        $ sudo perf lock con -ab -Y spinlock sleep 3
         contended   total wait     max wait     avg wait         type   caller
      
                 8      4.72 s       1.84 s     590.46 ms     spinlock   rcu_core+0xc7
                 8      1.87 s       1.87 s     233.48 ms     spinlock   process_one_work+0x1b5
                 2      1.87 s       1.87 s     933.92 ms     spinlock   worker_thread+0x1a2
                 3      1.81 s       1.81 s     603.93 ms     spinlock   tmigr_update_events+0x13c
                 2      1.72 s       1.72 s     861.98 ms     spinlock   tick_do_update_jiffies64+0x25
                 6     42.48 us     13.02 us      7.08 us     spinlock   futex_q_lock+0x2a
                 1     13.03 us     13.03 us     13.03 us     spinlock   futex_wake+0xce
                 1     11.61 us     11.61 us     11.61 us     spinlock   rcu_core+0xc7
      
      I don't believe it has contention on a spinlock longer than 1 second.
      After this change, it only reports some small contentions.
      
        $ sudo perf lock con -ab -Y spinlock sleep 3
         contended   total wait     max wait     avg wait         type   caller
      
                 4    133.51 us     43.29 us     33.38 us     spinlock   tick_do_update_jiffies64+0x25
                 4     69.06 us     31.82 us     17.27 us     spinlock   process_one_work+0x1b5
                 2     50.66 us     25.77 us     25.33 us     spinlock   rcu_core+0xc7
                 1     28.45 us     28.45 us     28.45 us     spinlock   rcu_core+0xc7
                 1     24.77 us     24.77 us     24.77 us     spinlock   tmigr_update_events+0x13c
                 1     23.34 us     23.34 us     23.34 us     spinlock   raw_spin_rq_lock_nested+0x15
      
      Fixes: b5711042 ("perf lock contention: Use per-cpu array map for spinlocks")
      Reported-by: default avatarXi Wang <xii@google.com>
      Cc: Song Liu <song@kernel.org>
      Cc: bpf@vger.kernel.org
      Link: https://lore.kernel.org/r/20240828052953.1445862-1-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      287bd5cf
    • Veronika Molnarova's avatar
      perf test pmu: Set uninitialized PMU alias to null · 1c7fb536
      Veronika Molnarova authored
      Commit 3e0bf9fd ("perf pmu: Restore full PMU name wildcard
      support") adds a test case "PMU cmdline match" that covers PMU name
      wildcard support provided by function perf_pmu__match(). The test works
      with a wide range of supported combinations of PMU name matching but
      omits the case that if the perf_pmu__match() cannot match the PMU name
      to the wildcard, it tries to match its alias. However, this variable is
      not set up, causing the test case to fail when run with subprocesses or
      to segfault if run as a single process.
      
        ./perf test -vv 9
          9: Sysfs PMU tests                                                 :
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
          9.6: PMU cmdline match                                             : FAILED!
      
        ./perf test -F 9
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
        Segmentation fault (core dumped)
      
      Initialize the PMU alias to null for all tests of perf_pmu__match()
      as this functionality is not being tested and the alias matching works
      exactly the same as the matching of the PMU name.
      
        ./perf test -F 9
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
          9.6: PMU cmdline match                                             : Ok
      
      Fixes: 3e0bf9fd ("perf pmu: Restore full PMU name wildcard support")
      Signed-off-by: default avatarVeronika Molnarova <vmolnaro@redhat.com>
      Cc: james.clark@arm.com
      Cc: mpetlan@redhat.com
      Cc: rstoyano@redhat.com
      Link: https://lore.kernel.org/r/20240808103749.9356-1-vmolnaro@redhat.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      1c7fb536
    • Fedor Pchelkin's avatar
      btrfs: qgroup: don't use extent changeset when not needed · c346c629
      Fedor Pchelkin authored
      The local extent changeset is passed to clear_record_extent_bits() where
      it may have some additional memory dynamically allocated for ulist. When
      qgroup is disabled, the memory is leaked because in this case the
      changeset is not released upon __btrfs_qgroup_release_data() return.
      
      Since the recorded contents of the changeset are not used thereafter, just
      don't pass it.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com
      Fixes: af0e2aab ("btrfs: qgroup: flush reservations during quota disable")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c346c629
    • Armin Wolf's avatar
      hwmon: (hp-wmi-sensors) Check if WMI event data exists · a54da9df
      Armin Wolf authored
      The BIOS can choose to return no event data in response to a
      WMI event, so the ACPI object passed to the WMI notify handler
      can be NULL.
      
      Check for such a situation and ignore the event in such a case.
      
      Fixes: 23902f98 ("hwmon: add HP WMI Sensors driver")
      Signed-off-by: default avatarArmin Wolf <W_Armin@gmx.de>
      Reviewed-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Message-ID: <20240901031055.3030-2-W_Armin@gmx.de>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      a54da9df
    • Linus Torvalds's avatar
      Merge tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux · 67784a74
      Linus Torvalds authored
      Pull ata fix from Damien Le Moal:
      
       - Fix a potential memory leak in the ata host initialization code (from
         Zheng)
      
      * tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
        ata: libata: Fix memory leak for error path in ata_host_alloc()
      67784a74
    • Suren Baghdasaryan's avatar
      alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n · 052a45c1
      Suren Baghdasaryan authored
      codetag_module_init() is used to initialize sections containing allocation
      tags.  This function is used to initialize module sections as well as core
      kernel sections, in which case the module parameter is set to NULL.  This
      function has to be called even when CONFIG_MODULES=n to initialize core
      kernel allocation tag sections.  When CONFIG_MODULES=n, this function is a
      NOP, which is wrong.  This leads to /proc/allocinfo reported as empty. 
      Fix this by making it independent of CONFIG_MODULES.
      
      Link: https://lkml.kernel.org/r/20240828231536.1770519-1-surenb@google.com
      Fixes: 916cc516 ("lib: code tagging framework")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      052a45c1
    • Adrian Huang's avatar
      mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area · 409faf8c
      Adrian Huang authored
      When running the vmalloc stress on a 448-core system, observe the average
      latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
      'funclatency.py' tool [1].
      
        # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
      
           usecs             : count    distribution
              0 -> 1         : 0       |                                        |
              2 -> 3         : 29      |                                        |
              4 -> 7         : 19      |                                        |
              8 -> 15        : 56      |                                        |
             16 -> 31        : 483     |****                                    |
             32 -> 63        : 1548    |************                            |
             64 -> 127       : 2634    |*********************                   |
            128 -> 255       : 2535    |*********************                   |
            256 -> 511       : 1776    |**************                          |
            512 -> 1023      : 1015    |********                                |
           1024 -> 2047      : 573     |****                                    |
           2048 -> 4095      : 488     |****                                    |
           4096 -> 8191      : 1091    |*********                               |
           8192 -> 16383     : 3078    |*************************               |
          16384 -> 32767     : 4821    |****************************************|
          32768 -> 65535     : 3318    |***************************             |
          65536 -> 131071    : 1718    |**************                          |
         131072 -> 262143    : 2220    |******************                      |
         262144 -> 524287    : 1147    |*********                               |
         524288 -> 1048575   : 1179    |*********                               |
        1048576 -> 2097151   : 822     |******                                  |
        2097152 -> 4194303   : 906     |*******                                 |
        4194304 -> 8388607   : 2148    |*****************                       |
        8388608 -> 16777215  : 4497    |*************************************   |
       16777216 -> 33554431  : 289     |**                                      |
      
        avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
      
        The worst case is over 16-33 seconds, so soft lockup is triggered [2].
      
      [Root Cause]
      1) Each purge_list has the long list. The following shows the number of
         vmap_area is purged.
      
         crash> p vmap_nodes
         vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
         crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
           nr_purged = 663070
           ...
           nr_purged = 821670
           nr_purged = 692214
           nr_purged = 726808
           ...
      
      2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
         operation when purging each vmap_area. However, the iteration is over
         600000 vmap_area (See 'nr_purged' above).
      
         Here is objdump output:
      
           $ objdump -D vmlinux
           ffffffff813e8c80 <purge_vmap_node>:
           ...
           ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
           ...
      
         Quote from "Instruction tables" pdf file [3]:
           Instructions with a LOCK prefix have a long latency that depends on
           cache organization and possibly RAM speed. If there are multiple
           processors or cores or direct memory access (DMA) devices, then all
           locked instructions will lock a cache line for exclusive access,
           which may involve RAM access. A LOCK prefix typically costs more
           than a hundred clock cycles, even on single-processor systems.
      
         That's why the latency of purge_vmap_node() dramatically increases
         on a many-core system: One core is busy on purging each vmap_area of
         the *long* purge_list and executing atomic_long_sub() for each
         vmap_area, while other cores free vmalloc allocations and execute
         atomic_long_add_return() in free_vmap_area_noflush().
      
      [Solution]
      Employ a local variable to record the total purged pages, and execute
      atomic_long_sub() after the traversal of the purge_list is done. The
      experiment result shows the latency improvement is 99%.
      
      [Experiment Result]
      1) System Configuration: Three servers (with HT-enabled) are tested.
           * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
           * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
           * 448-core server: AMD Zen 4 Processor*2
      
      2) Kernel Config
           * CONFIG_KASAN is disabled
      
      3) The data in column "w/o patch" and "w/ patch"
           * Unit: micro seconds (us)
           * Each data is the average of 3-time measurements
      
               System        w/o patch (us)   w/ patch (us)    Improvement (%)
           ---------------   --------------   -------------    -------------
           72-core server          2194              14            99.36%
           192-core server       143799            1139            99.21%
           448-core server      1992122            6883            99.65%
      
      [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
      [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
      [3] https://www.agner.org/optimize/instruction_tables.pdf
      
      Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.comSigned-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      409faf8c
    • Jan Kuliga's avatar
      mailmap: update entry for Jan Kuliga · 4f295229
      Jan Kuliga authored
      Soon I won't be able to use my current email address.
      
      Link: https://lkml.kernel.org/r/20240830095658.1203198-1-jankul@alatek.krakow.plSigned-off-by: default avatarJan Kuliga <jankul@alatek.krakow.pl>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f295229
    • Hao Ge's avatar
      codetag: debug: mark codetags for poisoned page as empty · 5e9784e9
      Hao Ge authored
      When PG_hwpoison pages are freed they are treated differently in
      free_pages_prepare() and instead of being released they are isolated.
      
      Page allocation tag counters are decremented at this point since the page
      is considered not in use.  Later on when such pages are released by
      unpoison_memory(), the allocation tag counters will be decremented again
      and the following warning gets reported:
      
      [  113.930443][ T3282] ------------[ cut here ]------------
      [  113.931105][ T3282] alloc_tag was not set
      [  113.931576][ T3282] WARNING: CPU: 2 PID: 3282 at ./include/linux/alloc_tag.h:130 pgalloc_tag_sub.part.66+0x154/0x164
      [  113.932866][ T3282] Modules linked in: hwpoison_inject fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_man4
      [  113.941638][ T3282] CPU: 2 UID: 0 PID: 3282 Comm: madvise11 Kdump: loaded Tainted: G        W          6.11.0-rc4-dirty #18
      [  113.943003][ T3282] Tainted: [W]=WARN
      [  113.943453][ T3282] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
      [  113.944378][ T3282] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  113.945319][ T3282] pc : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946016][ T3282] lr : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946706][ T3282] sp : ffff800087093a10
      [  113.947197][ T3282] x29: ffff800087093a10 x28: ffff0000d7a9d400 x27: ffff80008249f0a0
      [  113.948165][ T3282] x26: 0000000000000000 x25: ffff80008249f2b0 x24: 0000000000000000
      [  113.949134][ T3282] x23: 0000000000000001 x22: 0000000000000001 x21: 0000000000000000
      [  113.950597][ T3282] x20: ffff0000c08fcad8 x19: ffff80008251e000 x18: ffffffffffffffff
      [  113.952207][ T3282] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800081746210
      [  113.953161][ T3282] x14: 0000000000000000 x13: 205d323832335420 x12: 5b5d353031313339
      [  113.954120][ T3282] x11: ffff800087093500 x10: 000000000000005d x9 : 00000000ffffffd0
      [  113.955078][ T3282] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008236ba90 x6 : c0000000ffff7fff
      [  113.956036][ T3282] x5 : ffff000b34bf4dc8 x4 : ffff8000820aba90 x3 : 0000000000000001
      [  113.956994][ T3282] x2 : ffff800ab320f000 x1 : 841d1e35ac932e00 x0 : 0000000000000000
      [  113.957962][ T3282] Call trace:
      [  113.958350][ T3282]  pgalloc_tag_sub.part.66+0x154/0x164
      [  113.959000][ T3282]  pgalloc_tag_sub+0x14/0x1c
      [  113.959539][ T3282]  free_unref_page+0xf4/0x4b8
      [  113.960096][ T3282]  __folio_put+0xd4/0x120
      [  113.960614][ T3282]  folio_put+0x24/0x50
      [  113.961103][ T3282]  unpoison_memory+0x4f0/0x5b0
      [  113.961678][ T3282]  hwpoison_unpoison+0x30/0x48 [hwpoison_inject]
      [  113.962436][ T3282]  simple_attr_write_xsigned.isra.34+0xec/0x1cc
      [  113.963183][ T3282]  simple_attr_write+0x38/0x48
      [  113.963750][ T3282]  debugfs_attr_write+0x54/0x80
      [  113.964330][ T3282]  full_proxy_write+0x68/0x98
      [  113.964880][ T3282]  vfs_write+0xdc/0x4d0
      [  113.965372][ T3282]  ksys_write+0x78/0x100
      [  113.965875][ T3282]  __arm64_sys_write+0x24/0x30
      [  113.966440][ T3282]  invoke_syscall+0x7c/0x104
      [  113.966984][ T3282]  el0_svc_common.constprop.1+0x88/0x104
      [  113.967652][ T3282]  do_el0_svc+0x2c/0x38
      [  113.968893][ T3282]  el0_svc+0x3c/0x1b8
      [  113.969379][ T3282]  el0t_64_sync_handler+0x98/0xbc
      [  113.969980][ T3282]  el0t_64_sync+0x19c/0x1a0
      [  113.970511][ T3282] ---[ end trace 0000000000000000 ]---
      
      To fix this, clear the page tag reference after the page got isolated
      and accounted for.
      
      Link: https://lkml.kernel.org/r/20240825163649.33294-1-hao.ge@linux.dev
      Fixes: d224eb02 ("codetag: debug: mark codetags for reserved pages as empty")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hao Ge <gehao@kylinos.cn>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e9784e9
    • Mike Yuan's avatar
      mm/memcontrol: respect zswap.writeback setting from parent cg too · e3992573
      Mike Yuan authored
      Currently, the behavior of zswap.writeback wrt.  the cgroup hierarchy
      seems a bit odd.  Unlike zswap.max, it doesn't honor the value from parent
      cgroups.  This surfaced when people tried to globally disable zswap
      writeback, i.e.  reserve physical swap space only for hibernation [1] -
      disabling zswap.writeback only for the root cgroup results in subcgroups
      with zswap.writeback=1 still performing writeback.
      
      The inconsistency became more noticeable after I introduced the
      MemoryZSwapWriteback= systemd unit setting [2] for controlling the knob.
      The patch assumed that the kernel would enforce the value of parent
      cgroups.  It could probably be workarounded from systemd's side, by going
      up the slice unit tree and inheriting the value.  Yet I think it's more
      sensible to make it behave consistently with zswap.max and friends.
      
      [1] https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate#Disable_zswap_writeback_to_use_the_swap_space_only_for_hibernation
      [2] https://github.com/systemd/systemd/pull/31734
      
      Link: https://lkml.kernel.org/r/20240823162506.12117-1-me@yhndnzj.com
      Fixes: 501a06fe ("zswap: memcontrol: implement zswap writeback disabling")
      Signed-off-by: default avatarMike Yuan <me@yhndnzj.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3992573
    • Marc Zyngier's avatar
      scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum · a3f6a89c
      Marc Zyngier authored
      Richard reports that since 772dd034 ("mm: enumerate all gfp flags"),
      gfp-translate is broken, as the bit numbers are implicit, leaving the
      shell script unable to extract them.  Even more, some bits are now at a
      variable location, making it double extra hard to parse using a simple
      shell script.
      
      Use a brute-force approach to the problem by generating a small C stub
      that will use the enum to dump the interesting bits.
      
      As an added bonus, we are now able to identify invalid bits for a given
      configuration.  As an added drawback, we cannot parse include files that
      predate this change anymore.  Tough luck.
      
      Link: https://lkml.kernel.org/r/20240823163850.3791201-1-maz@kernel.org
      Fixes: 772dd034 ("mm: enumerate all gfp flags")
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Reported-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Petr Tesařík <petr@tesarici.cz>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3f6a89c
    • Usama Arif's avatar
      Revert "mm: skip CMA pages when they are not available" · bfe0857c
      Usama Arif authored
      This reverts commit 5da226db ("mm: skip CMA pages when they are not
      available") and b7108d66 ("Multi-gen LRU: skip CMA pages when they are
      not eligible").
      
      lruvec->lru_lock is highly contended and is held when calling
      isolate_lru_folios.  If the lru has a large number of CMA folios
      consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
      isolate_lru_folios can hold the lock for a very long time while it skips
      those.  For FIO workload, ~150million order=0 folios were skipped to
      isolate a few ZONE_DMA folios [1].  This can cause lockups [1] and high
      memory pressure for extended periods of time [2].
      
      Remove skipping CMA for MGLRU as well, as it was introduced in sort_folio
      for the same resaon as 5da226db.
      
      [1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
      [2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/
      
      [usamaarif642@gmail.com: also revert b7108d66, per Johannes]
        Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
        Link: https://lkml.kernel.org/r/357ac325-4c61-497a-92a3-bdbd230d5ec9@gmail.com
      Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
      Fixes: 5da226db ("mm: skip CMA pages when they are not available")
      Signed-off-by: default avatarUsama Arif <usamaarif642@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
      Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bfe0857c
    • Liam R. Howlett's avatar
      maple_tree: remove rcu_read_lock() from mt_validate() · f806de88
      Liam R. Howlett authored
      The write lock should be held when validating the tree to avoid updates
      racing with checks.  Holding the rcu read lock during a large tree
      validation may also cause a prolonged rcu read window and "rcu_preempt
      detected stalls" warnings.
      
      Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/
      Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reported-by: syzbot+036af2f0c7338a33b0cd@syzkaller.appspotmail.com
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f806de88
    • Petr Tesarik's avatar
      kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y · 6dacd79d
      Petr Tesarik authored
      Fix the condition to exclude the elfcorehdr segment from the SHA digest
      calculation.
      
      The j iterator is an index into the output sha_regions[] array, not into
      the input image->segment[] array.  Once it reaches
      image->elfcorehdr_index, all subsequent segments are excluded.  Besides,
      if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
      may be wrongly included in the calculation.
      
      Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
      Fixes: f7cc804a ("kexec: exclude elfcorehdr from the segment digest")
      Signed-off-by: default avatarPetr Tesarik <ptesarik@suse.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: Eric DeVolder <eric_devolder@yahoo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dacd79d