1. 08 Aug, 2024 21 commits
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 9466b6ae
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Have reading of event format files test if the metadata still exists.
      
         When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
         is set to state that it is to prevent any new references to it from
         happening while waiting for existing references to close. When the
         last reference closes, the metadata is freed. But the "format" was
         missing a check to this flag (along with some other files) that
         allowed new references to happen, and a use-after-free bug to occur.
      
       - Have the trace event meta data use the refcount infrastructure
         instead of relying on its own atomic counters.
      
       - Have tracefs inodes use alloc_inode_sb() for allocation instead of
         using kmem_cache_alloc() directly.
      
       - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
         callers expect a real object or an ERR_PTR.
      
       - Have release_ei() use call_srcu() and not call_rcu() as all the
         protection is on SRCU and not RCU.
      
       - Fix ftrace_graph_ret_addr() to use the task passed in and not
         current.
      
       - Fix overflow bug in get_free_elt() where the counter can overflow the
         integer and cause an infinite loop.
      
       - Remove unused function ring_buffer_nr_pages()
      
       - Have tracefs freeing use the inode RCU infrastructure instead of
         creating its own.
      
         When the kernel had randomize structure fields enabled, the rcu field
         of the tracefs_inode was overlapping the rcu field of the inode
         structure, and corrupting it. Instead, use the destroy_inode()
         callback to do the initial cleanup of the code, and then have
         free_inode() free it.
      
      * tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracefs: Use generic inode RCU for synchronizing freeing
        ring-buffer: Remove unused function ring_buffer_nr_pages()
        tracing: Fix overflow in get_free_elt()
        function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
        eventfs: Use SRCU for freeing eventfs_inodes
        eventfs: Don't return NULL in eventfs_create_dir()
        tracefs: Fix inode allocation
        tracing: Use refcount for trace_event_file reference counter
        tracing: Have format file honor EVENT_FILE_FL_FREED
      9466b6ae
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs · b3f5620f
      Linus Torvalds authored
      Pull bcachefs fixes from Kent Overstreet:
       "Assorted little stuff:
      
         - lockdep fixup for lockdep_set_notrack_class()
      
         - we can now remove a device when using erasure coding without
           deadlocking, though we still hit other issues
      
         - the 'allocator stuck' timeout is now configurable, and messages are
           ratelimited. The default timeout has been increased from 10 seconds
           to 30"
      
      * tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
        bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
        bcachefs: Make allocator stuck timeout configurable, ratelimit messages
        bcachefs: Add missing path_traverse() to btree_iter_next_node()
        bcachefs: ec should not allocate from ro devs
        bcachefs: Improved allocator debugging for ec
        bcachefs: Add missing bch2_trans_begin() call
        bcachefs: Add a comment for bucket helper types
        bcachefs: Don't rely on implicit unsigned -> signed integer conversion
        lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
        bcachefs: Fix double free of ca->buckets_nouse
      b3f5620f
    • Linus Torvalds's avatar
      module: warn about excessively long module waits · cb5b81bc
      Linus Torvalds authored
      Russell King reported that the arm cbc(aes) crypto module hangs when
      loaded, and Herbert Xu bisected it to commit 9b9879fc ("modules:
      catch concurrent module loads, treat them as idempotent"), and noted:
      
       "So what's happening here is that the first modprobe tries to load a
        fallback CBC implementation, in doing so it triggers a load of the
        exact same module due to module aliases.
      
        IOW we're loading aes-arm-bs which provides cbc(aes). However, this
        needs a fallback of cbc(aes) to operate, which is made out of the
        generic cbc module + any implementation of aes, or ecb(aes). The
        latter happens to also be provided by aes-arm-cb so that's why it
        tries to load the same module again"
      
      So loading the aes-arm-bs module ends up wanting to recursively load
      itself, and the recursive load then ends up waiting for the original
      module load to complete.
      
      This is a regression, in that it used to be that we just tried to load
      the module multiple times, and then as we went on to install it the
      second time we would instead just error out because the module name
      already existed.
      
      That is actually also exactly what the original "catch concurrent loads"
      patch did in commit 9828ed3f ("module: error out early on concurrent
      load of the same module file"), but it turns out that it ends up being
      racy, in that erroring out before the module has been fully initialized
      will cause failures in dependent module loading.
      
      See commit ac2263b5 (which was the revert of that "error out early")
      commit for details about why erroring out before the module has been
      initialized is actually fundamentally racy.
      
      Now, for the actual recursive module load (as opposed to just
      concurrently loading the same module twice), the race is not an issue.
      
      At the same time it's hard for the kernel to see that this is recursion,
      because the module load is always done from a usermode helper, so the
      recursion is not some simple callchain within the kernel.
      
      End result: this is not the real fix, but this at least adds a warning
      for the situation (admittedly much too late for all the debugging pain
      that Russell and Herbert went through) and if we can come to a
      resolution on how to detect the recursion properly, this re-organizes
      the code to make that easier.
      
      Link: https://lore.kernel.org/all/ZrFHLqvFqhzykuYw@shell.armlinux.org.uk/Reported-by: default avatarRussell King <linux@armlinux.org.uk>
      Debugged-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb5b81bc
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.11-1' of... · cf6d429e
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
       "Enable general EFI poweroff method to make poweroff usable on
        hardwares which lack ACPI S5, use accessors to page table entries
        instead of direct dereference to avoid potential problems, and two
        trivial kvm cleanups"
      
      * tag 'loongarch-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: KVM: Remove undefined a6 argument comment for kvm_hypercall()
        LoongArch: KVM: Remove unnecessary definition of KVM_PRIVATE_MEM_SLOTS
        LoongArch: Use accessors to page table entries instead of direct dereference
        LoongArch: Enable general EFI poweroff method
      cf6d429e
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of... · 660e4b18
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "Nine hotfixes. Five are cc:stable, the others either pertain to
        post-6.10 material or aren't considered necessary for earlier kernels.
      
        Five are MM and four are non-MM. No identifiable theme here - please
        see the individual changelogs"
      
      * tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        padata: Fix possible divide-by-0 panic in padata_mt_helper()
        mailmap: update entry for David Heidelberg
        memcg: protect concurrent access to mem_cgroup_idr
        mm: shmem: fix incorrect aligned index when checking conflicts
        mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem
        mm: list_lru: fix UAF for memory cgroup
        kcov: properly check for softirq context
        MAINTAINERS: Update LTP members and web
        selftests: mm: add s390 to ARCH check
      660e4b18
    • Waiman Long's avatar
      padata: Fix possible divide-by-0 panic in padata_mt_helper() · 6d45e1c9
      Waiman Long authored
      We are hit with a not easily reproducible divide-by-0 panic in padata.c at
      bootup time.
      
        [   10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI
        [   10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1
        [   10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021
        [   10.017908] Workqueue: events_unbound padata_mt_helper
        [   10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0
          :
        [   10.017963] Call Trace:
        [   10.017968]  <TASK>
        [   10.018004]  ? padata_mt_helper+0x39/0xb0
        [   10.018084]  process_one_work+0x174/0x330
        [   10.018093]  worker_thread+0x266/0x3a0
        [   10.018111]  kthread+0xcf/0x100
        [   10.018124]  ret_from_fork+0x31/0x50
        [   10.018138]  ret_from_fork_asm+0x1a/0x30
        [   10.018147]  </TASK>
      
      Looking at the padata_mt_helper() function, the only way a divide-by-0
      panic can happen is when ps->chunk_size is 0.  The way that chunk_size is
      initialized in padata_do_multithreaded(), chunk_size can be 0 when the
      min_chunk in the passed-in padata_mt_job structure is 0.
      
      Fix this divide-by-0 panic by making sure that chunk_size will be at least
      1 no matter what the input parameters are.
      
      Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com
      Fixes: 004ed426 ("padata: add basic support for multithreaded jobs")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d45e1c9
    • David Heidelberg's avatar
      mailmap: update entry for David Heidelberg · f2087995
      David Heidelberg authored
      Link my old gmail address to my active email.
      
      Link: https://lkml.kernel.org/r/20240804054704.859503-1-david@ixit.czSigned-off-by: default avatarDavid Heidelberg <david@ixit.cz>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f2087995
    • Shakeel Butt's avatar
      memcg: protect concurrent access to mem_cgroup_idr · 9972605a
      Shakeel Butt authored
      Commit 73f576c0 ("mm: memcontrol: fix cgroup creation failure after
      many small jobs") decoupled the memcg IDs from the CSS ID space to fix the
      cgroup creation failures.  It introduced IDR to maintain the memcg ID
      space.  The IDR depends on external synchronization mechanisms for
      modifications.  For the mem_cgroup_idr, the idr_alloc() and idr_replace()
      happen within css callback and thus are protected through cgroup_mutex
      from concurrent modifications.  However idr_remove() for mem_cgroup_idr
      was not protected against concurrency and can be run concurrently for
      different memcgs when they hit their refcnt to zero.  Fix that.
      
      We have been seeing list_lru based kernel crashes at a low frequency in
      our fleet for a long time.  These crashes were in different part of
      list_lru code including list_lru_add(), list_lru_del() and reparenting
      code.  Upon further inspection, it looked like for a given object (dentry
      and inode), the super_block's list_lru didn't have list_lru_one for the
      memcg of that object.  The initial suspicions were either the object is
      not allocated through kmem_cache_alloc_lru() or somehow
      memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but
      returned success.  No evidence were found for these cases.
      
      Looking more deeply, we started seeing situations where valid memcg's id
      is not present in mem_cgroup_idr and in some cases multiple valid memcgs
      have same id and mem_cgroup_idr is pointing to one of them.  So, the most
      reasonable explanation is that these situations can happen due to race
      between multiple idr_remove() calls or race between
      idr_alloc()/idr_replace() and idr_remove().  These races are causing
      multiple memcgs to acquire the same ID and then offlining of one of them
      would cleanup list_lrus on the system for all of them.  Later access from
      other memcgs to the list_lru cause crashes due to missing list_lru_one.
      
      Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev
      Fixes: 73f576c0 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
      Signed-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9972605a
    • Baolin Wang's avatar
      mm: shmem: fix incorrect aligned index when checking conflicts · 4cbf320b
      Baolin Wang authored
      In the shmem_suitable_orders() function, xa_find() is used to check for
      conflicts in the pagecache to select suitable huge orders.  However, when
      checking each huge order in every loop, the aligned index is calculated
      from the previous iteration, which may cause suitable huge orders to be
      missed.
      
      We should use the original index each time in the loop to calculate a new
      aligned index for checking conflicts to avoid this issue.
      
      Link: https://lkml.kernel.org/r/07433b0f16a152bffb8cee34934a5c040e8e2ad6.1722404078.git.baolin.wang@linux.alibaba.com
      Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4cbf320b
    • Baolin Wang's avatar
      mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem · b66b1b71
      Baolin Wang authored
      Similar to commit d659b715 ("mm/huge_memory: avoid PMD-size page
      cache if needed"), ARM64 can support 512MB PMD-sized THP when the base
      page size is 64KB, which is larger than the maximum supported page cache
      size MAX_PAGECACHE_ORDER.
      
      This is not expected.  To fix this issue, use THP_ORDERS_ALL_FILE_DEFAULT
      for shmem to filter allowable huge orders.
      
      [baolin.wang@linux.alibaba.com: remove comment, per Barry]
        Link: https://lkml.kernel.org/r/c55d7ef7-78aa-4ed6-b897-c3e03a3f3ab7@linux.alibaba.com
      [wangkefeng.wang@huawei.com: remove local `orders']
        Link: https://lkml.kernel.org/r/87769ae8-b6c6-4454-925d-1864364af9c8@huawei.com
      Link: https://lkml.kernel.org/r/117121665254442c3c7f585248296495e5e2b45c.1722404078.git.baolin.wang@linux.alibaba.com
      Fixes: e7a2ab7b ("mm: shmem: add mTHP support for anonymous shmem")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b66b1b71
    • Muchun Song's avatar
      mm: list_lru: fix UAF for memory cgroup · 5161b487
      Muchun Song authored
      The mem_cgroup_from_slab_obj() is supposed to be called under rcu lock or
      cgroup_mutex or others which could prevent returned memcg from being
      freed.  Fix it by adding missing rcu read lock.
      
      Found by code inspection.
      
      [songmuchun@bytedance.com: only grab rcu lock when necessary, per Vlastimil]
        Link: https://lkml.kernel.org/r/20240801024603.1865-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20240718083607.42068-1-songmuchun@bytedance.com
      Fixes: 0a97c01c ("list_lru: allow explicit memcg and NUMA node selection")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5161b487
    • Andrey Konovalov's avatar
      kcov: properly check for softirq context · 7d4df2da
      Andrey Konovalov authored
      When collecting coverage from softirqs, KCOV uses in_serving_softirq() to
      check whether the code is running in the softirq context.  Unfortunately,
      in_serving_softirq() is > 0 even when the code is running in the hardirq
      or NMI context for hardirqs and NMIs that happened during a softirq.
      
      As a result, if a softirq handler contains a remote coverage collection
      section and a hardirq with another remote coverage collection section
      happens during handling the softirq, KCOV incorrectly detects a nested
      softirq coverate collection section and prints a WARNING, as reported by
      syzbot.
      
      This issue was exposed by commit a7f3813e ("usb: gadget: dummy_hcd:
      Switch to hrtimer transfer scheduler"), which switched dummy_hcd to using
      hrtimer and made the timer's callback be executed in the hardirq context.
      
      Change the related checks in KCOV to account for this behavior of
      in_serving_softirq() and make KCOV ignore remote coverage collection
      sections in the hardirq and NMI contexts.
      
      This prevents the WARNING printed by syzbot but does not fix the inability
      of KCOV to collect coverage from the __usb_hcd_giveback_urb when dummy_hcd
      is in use (caused by a7f3813e); a separate patch is required for that.
      
      Link: https://lkml.kernel.org/r/20240729022158.92059-1-andrey.konovalov@linux.dev
      Fixes: 5ff3b30a ("kcov: collect coverage from interrupts")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reported-by: syzbot+2388cdaeb6b10f0c13ac@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2388cdaeb6b10f0c13acAcked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Marcello Sylvester Bauer <sylv@sylv.io>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d4df2da
    • Petr Vorel's avatar
      MAINTAINERS: Update LTP members and web · 37bf7fbe
      Petr Vorel authored
      LTP project uses now readthedocs.org instance instead of GitHub wiki.
      
      LTP maintainers are listed in alphabetical order.
      
      Link: https://lkml.kernel.org/r/20240726072009.1021599-1-pvorel@suse.czSigned-off-by: default avatarPetr Vorel <pvorel@suse.cz>
      Reviewed-by: default avatarLi Wang <liwang@redhat.com>
      Reviewed-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Xiao Yang <yangx.jy@fujitsu.com>
      Cc: Yang Xu <xuyang2018.jy@fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37bf7fbe
    • Nico Pache's avatar
      selftests: mm: add s390 to ARCH check · 30b651c8
      Nico Pache authored
      commit 0518dbe9 ("selftests/mm: fix cross compilation with LLVM")
      changed the env variable for the architecture from MACHINE to ARCH.
      
      This is preventing 3 required TEST_GEN_FILES from being included when
      cross compiling s390x and errors when trying to run the test suite.  This
      is due to the ARCH variable already being set and the arch folder name
      being s390.
      
      Add "s390" to the filtered list to cover this case and have the 3 files
      included in the build.
      
      Link: https://lkml.kernel.org/r/20240724213517.23918-1-npache@redhat.com
      Fixes: 0518dbe9 ("selftests/mm: fix cross compilation with LLVM")
      Signed-off-by: default avatarNico Pache <npache@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30b651c8
    • Kent Overstreet's avatar
      bcachefs: Use bch2_wait_on_allocator() in btree node alloc path · 73dc1656
      Kent Overstreet authored
      If the allocator gets stuck, we need to know why.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      73dc1656
    • Kent Overstreet's avatar
      bcachefs: Make allocator stuck timeout configurable, ratelimit messages · cecf7279
      Kent Overstreet authored
      Limit these messages to once every 2 minutes to avoid spamming logs;
      with multiple devices the output can be quite significant.
      
      Also, up the default timeout to 30 seconds from 10 seconds.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      cecf7279
    • Kent Overstreet's avatar
      bcachefs: Add missing path_traverse() to btree_iter_next_node() · 6d496e02
      Kent Overstreet authored
      This fixes a bug exposed by the next path - we pop an assert in
      path_set_should_be_locked().
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      6d496e02
    • Steven Rostedt's avatar
      tracefs: Use generic inode RCU for synchronizing freeing · 0b6743bd
      Steven Rostedt authored
      With structure layout randomization enabled for 'struct inode' we need to
      avoid overlapping any of the RCU-used / initialized-only-once members,
      e.g. i_lru or i_sb_list to not corrupt related list traversals when making
      use of the rcu_head.
      
      For an unlucky structure layout of 'struct inode' we may end up with the
      following splat when running the ftrace selftests:
      
      [<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object])
      [<...>] ------------[ cut here ]------------
      [<...>] kernel BUG at lib/list_debug.c:54!
      [<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G                 N  6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65
      [<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
      [<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0
      [<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f
      [<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283
      [<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000
      [<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001
      [<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25
      [<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d
      [<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000
      [<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object]
      [<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0
      [<...>] RSI: __func__.47+0x4340/0x4400
      [<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object]
      [<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550]
      [<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550]
      [<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550]
      [<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object]
      [<...>] FS:  00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000
      [<...>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0
      [<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [<...>] ASID: 0003
      [<...>] Stack:
      [<...>]  ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0
      [<...>]  ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f
      [<...>]  0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392
      [<...>] Call Trace:
      [<...>]  <TASK>
      [<...>]  [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0
      [<...>]  [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48
      [<...>]  [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88
      [<...>]  [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90
      [<...>]  [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8
      [<...>]  [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8
      [<...>]  [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40
      [<...>]  [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48
      [<...>]  [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70
      [<...>]  [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78
      [<...>]  [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80
      [<...>]  [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8
      [<...>]  [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0
      [<...>]  [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8
      [<...>]  [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18
      [<...>]  [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38
      [<...>]  [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70
      [<...>]  [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90
      [<...>]  [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0
      [<...>]  [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28
      [<...>]  [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60
      [<...>]  [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88
      [<...>]  [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8
      [<...>]  [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8
      [<...>]  [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0
      [<...>]  [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30
      [<...>]  [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70
      [<...>]  [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88
      [<...>]  [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8
      [<...>]  [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8
      [<...>]  [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8
      [<...>]  </TASK>
      [<...>]  <PTREGS>
      [<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
      [<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48
      [<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
      [<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a
      [<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70
      [<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0
      [<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000
      [<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0
      [<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
      [<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read|write|mayread|maywrite|growsdown|account)]+0x0/0xb8 [userland map]
      [<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
      [<...>]  </PTREGS>
      [<...>] Modules linked in:
      [<...>] ---[ end trace 0000000000000000 ]---
      
      The list debug message as well as RBX's symbolic value point out that the
      object in question was allocated from 'tracefs_inode_cache' and that the
      list's '->next' member is at offset 0. Dumping the layout of the relevant
      parts of 'struct tracefs_inode' gives the following:
      
        struct tracefs_inode {
          union {
            struct inode {
              struct list_head {
                struct list_head * next;                    /*     0     8 */
                struct list_head * prev;                    /*     8     8 */
              } i_lru;
              [...]
            } vfs_inode;
            struct callback_head {
              void (*func)(struct callback_head *);         /*     0     8 */
              struct callback_head * next;                  /*     8     8 */
            } rcu;
          };
          [...]
        };
      
      Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will
      destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in
      call_rcu() or later when calling the RCU callback. This will disturb
      concurrent list traversals as well as object reuse which assumes these
      list heads will keep their integrity.
      
      For reproduction, the following diff manually overlays 'i_lru' with
      'rcu' as, otherwise, one would require some good portion of luck for
      gambling an unlucky RANDSTRUCT seed:
      
        --- a/include/linux/fs.h
        +++ b/include/linux/fs.h
        @@ -629,6 +629,7 @@ struct inode {
         	umode_t			i_mode;
         	unsigned short		i_opflags;
         	kuid_t			i_uid;
        +	struct list_head	i_lru;		/* inode LRU list */
         	kgid_t			i_gid;
         	unsigned int		i_flags;
      
        @@ -690,7 +691,6 @@ struct inode {
         	u16			i_wb_frn_avg_time;
         	u16			i_wb_frn_history;
         #endif
        -	struct list_head	i_lru;		/* inode LRU list */
         	struct list_head	i_sb_list;
         	struct list_head	i_wb_list;	/* backing dev writeback list */
         	union {
      
      The tracefs inode does not need to supply its own RCU delayed destruction
      of its inode. The inode code itself offers both a "destroy_inode()"
      callback that gets called when the last reference of the inode is
      released, and the "free_inode()" which is called after a RCU
      synchronization period from the "destroy_inode()".
      
      The tracefs code can unlink the inode from its list in the destroy_inode()
      callback, and the simply free it from the free_inode() callback. This
      should provide the same protection.
      
      Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Ajay Kaher <ajay.kaher@broadcom.com>
      Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com>
      Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home
      Fixes: baa23a8d ("tracefs: Reset permissions on remount if permissions are options")
      Reported-by: default avatarMathias Krause <minipli@grsecurity.net>
      Reported-by: default avatarBrad Spengler <spender@grsecurity.net>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0b6743bd
    • Jianhui Zhou's avatar
      ring-buffer: Remove unused function ring_buffer_nr_pages() · 58f7e4d7
      Jianhui Zhou authored
      Because ring_buffer_nr_pages() is not an inline function and user accesses
      buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages
      is removed.
      Signed-off-by: default avatarJianhui Zhou <912460177@qq.com>
      Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.comSigned-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      58f7e4d7
    • Tze-nan Wu's avatar
      tracing: Fix overflow in get_free_elt() · bcf86c01
      Tze-nan Wu authored
      "tracing_map->next_elt" in get_free_elt() is at risk of overflowing.
      
      Once it overflows, new elements can still be inserted into the tracing_map
      even though the maximum number of elements (`max_elts`) has been reached.
      Continuing to insert elements after the overflow could result in the
      tracing_map containing "tracing_map->max_size" elements, leaving no empty
      entries.
      If any attempt is made to insert an element into a full tracing_map using
      `__tracing_map_insert()`, it will cause an infinite loop with preemption
      disabled, leading to a CPU hang problem.
      
      Fix this by preventing any further increments to "tracing_map->next_elt"
      once it reaches "tracing_map->max_elt".
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Co-developed-by: default avatarCheng-Jui Wang <cheng-jui.wang@mediatek.com>
      Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.comSigned-off-by: default avatarCheng-Jui Wang <cheng-jui.wang@mediatek.com>
      Signed-off-by: default avatarTze-nan Wu <Tze-nan.Wu@mediatek.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      bcf86c01
    • Petr Pavlu's avatar
      function_graph: Fix the ret_stack used by ftrace_graph_ret_addr() · 604b72b3
      Petr Pavlu authored
      When ftrace_graph_ret_addr() is invoked to convert a found stack return
      address to its original value, the function can end up producing the
      following crash:
      
      [   95.442712] BUG: kernel NULL pointer dereference, address: 0000000000000028
      [   95.442720] #PF: supervisor read access in kernel mode
      [   95.442724] #PF: error_code(0x0000) - not-present page
      [   95.442727] PGD 0 P4D 0-
      [   95.442731] Oops: Oops: 0000 [#1] PREEMPT SMP PTI
      [   95.442736] CPU: 1 UID: 0 PID: 2214 Comm: insmod Kdump: loaded Tainted: G           OE K    6.11.0-rc1-default #1 67c62a3b3720562f7e7db5f11c1fdb40b7a2857c
      [   95.442747] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [K]=LIVEPATCH
      [   95.442750] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
      [   95.442754] RIP: 0010:ftrace_graph_ret_addr+0x42/0xc0
      [   95.442766] Code: [...]
      [   95.442773] RSP: 0018:ffff979b80ff7718 EFLAGS: 00010006
      [   95.442776] RAX: ffffffff8ca99b10 RBX: ffff979b80ff7760 RCX: ffff979b80167dc0
      [   95.442780] RDX: ffffffff8ca99b10 RSI: ffff979b80ff7790 RDI: 0000000000000005
      [   95.442783] RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000
      [   95.442786] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff8e9491e0
      [   95.442790] R13: ffffffff8d6f70f0 R14: ffff979b80167da8 R15: ffff979b80167dc8
      [   95.442793] FS:  00007fbf83895740(0000) GS:ffff8a0afdd00000(0000) knlGS:0000000000000000
      [   95.442797] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   95.442800] CR2: 0000000000000028 CR3: 0000000005070002 CR4: 0000000000370ef0
      [   95.442806] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   95.442809] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   95.442816] Call Trace:
      [   95.442823]  <TASK>
      [   95.442896]  unwind_next_frame+0x20d/0x830
      [   95.442905]  arch_stack_walk_reliable+0x94/0xe0
      [   95.442917]  stack_trace_save_tsk_reliable+0x7d/0xe0
      [   95.442922]  klp_check_and_switch_task+0x55/0x1a0
      [   95.442931]  task_call_func+0xd3/0xe0
      [   95.442938]  klp_try_switch_task.part.5+0x37/0x150
      [   95.442942]  klp_try_complete_transition+0x79/0x2d0
      [   95.442947]  klp_enable_patch+0x4db/0x890
      [   95.442960]  do_one_initcall+0x41/0x2e0
      [   95.442968]  do_init_module+0x60/0x220
      [   95.442975]  load_module+0x1ebf/0x1fb0
      [   95.443004]  init_module_from_file+0x88/0xc0
      [   95.443010]  idempotent_init_module+0x190/0x240
      [   95.443015]  __x64_sys_finit_module+0x5b/0xc0
      [   95.443019]  do_syscall_64+0x74/0x160
      [   95.443232]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [   95.443236] RIP: 0033:0x7fbf82f2c709
      [   95.443241] Code: [...]
      [   95.443247] RSP: 002b:00007fffd5ea3b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      [   95.443253] RAX: ffffffffffffffda RBX: 000056359c48e750 RCX: 00007fbf82f2c709
      [   95.443257] RDX: 0000000000000000 RSI: 000056356ed4efc5 RDI: 0000000000000003
      [   95.443260] RBP: 000056356ed4efc5 R08: 0000000000000000 R09: 00007fffd5ea3c10
      [   95.443263] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
      [   95.443267] R13: 000056359c48e6f0 R14: 0000000000000000 R15: 0000000000000000
      [   95.443272]  </TASK>
      [   95.443274] Modules linked in: [...]
      [   95.443385] Unloaded tainted modules: intel_uncore_frequency(E):1 isst_if_common(E):1 skx_edac(E):1
      [   95.443414] CR2: 0000000000000028
      
      The bug can be reproduced with kselftests:
      
       cd linux/tools/testing/selftests
       make TARGETS='ftrace livepatch'
       (cd ftrace; ./ftracetest test.d/ftrace/fgraph-filter.tc)
       (cd livepatch; ./test-livepatch.sh)
      
      The problem is that ftrace_graph_ret_addr() is supposed to operate on the
      ret_stack of a selected task but wrongly accesses the ret_stack of the
      current task. Specifically, the above NULL dereference occurs when
      task->curr_ret_stack is non-zero, but current->ret_stack is NULL.
      
      Correct ftrace_graph_ret_addr() to work with the right ret_stack.
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reported-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Link: https://lore.kernel.org/20240803131211.17255-1-petr.pavlu@suse.com
      Fixes: 7aa1eaef ("function_graph: Allow multiple users to attach to function graph")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      604b72b3
  2. 07 Aug, 2024 17 commits
  3. 06 Aug, 2024 2 commits