1. 09 Jul, 2024 2 commits
    • Suren Baghdasaryan's avatar
      arch/xtensa: always_inline get_current() and current_thread_info() · 86e50ab6
      Suren Baghdasaryan authored
      Mark get_current() and current_thread_info() functions as always_inline to
      fix the following modpost warning:
      
      WARNING: modpost: vmlinux: section mismatch in reference: get_current+0xc (section: .text.unlikely) -> initcall_level_names (section: .init.data)
      
      The warning happens when these functions are called from an __init
      function and they don't get inlined (remain in the .text section) while
      the value they return points into .init.data section.  Assuming
      get_current() always returns a valid address, this situation can happen
      only during init stage and accessing .init.data from .text section during
      that stage should pose no issues.
      
      Link: https://lkml.kernel.org/r/20240704132506.1011978-2-surenb@google.com
      Fixes: 22d407b1 ("lib: add allocation tagging support for memory allocation profiling")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      86e50ab6
    • Suren Baghdasaryan's avatar
      sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings · 5a5aa3c3
      Suren Baghdasaryan authored
      Mark alloc_tag_{save|restore} as always_inline to fix the following
      modpost warnings:
      
      WARNING: modpost: vmlinux: section mismatch in reference: alloc_tag_save+0x1c (section: .text.unlikely) -> initcall_level_names (section: .init.data)
      WARNING: modpost: vmlinux: section mismatch in reference: alloc_tag_restore+0x3c (section: .text.unlikely) -> initcall_level_names (section: .init.data)
      
      The warnings happen when these functions are called from an __init
      function and they don't get inlined (remain in the .text section) while
      the value returned by get_current() points into .init.data section. 
      Assuming get_current() always returns a valid address, this situation can
      happen only during init stage and accessing .init.data from .text section
      during that stage should pose no issues.
      
      Link: https://lkml.kernel.org/r/20240704132506.1011978-1-surenb@google.com
      Fixes: 22d407b1 ("lib: add allocation tagging support for memory allocation profiling")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202407032306.gi9nZsBi-lkp@intel.com/
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a5aa3c3
  2. 06 Jul, 2024 4 commits
    • Lorenzo Stoakes's avatar
    • Hugh Dickins's avatar
      mm: fix crashes from deferred split racing folio migration · be9581ea
      Hugh Dickins authored
      Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
      flags when freeing, yet the flags shown are not bad: PG_locked had been
      set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
      deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
      symptoms implying double free by deferred split and large folio migration.
      
      6.7 commit 9bcef597 ("mm: memcg: fix split queue list crash when large
      folio migration") was right to fix the memcg-dependent locking broken in
      85ce2c51 ("memcontrol: only transfer the memcg data for migration"),
      but missed a subtlety of deferred_split_scan(): it moves folios to its own
      local list to work on them without split_queue_lock, during which time
      folio->_deferred_list is not empty, but even the "right" lock does nothing
      to secure the folio and the list it is on.
      
      Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
      folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
      while the old folio's reference count is temporarily frozen to 0 - adding
      such a freeze in the !mapping case too (originally, folio lock and
      unmapping and no swap cache left an anon folio unreachable, so no freezing
      was needed there: but the deferred split queue offers a way to reach it).
      
      Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
      Fixes: 9bcef597 ("mm: memcg: fix split queue list crash when large folio migration")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be9581ea
    • Paul Menzel's avatar
      lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat · 2fe29fe9
      Paul Menzel authored
      On a system with Perl 5.12.1, commit 5ef6dc08
      ("lib/build_OID_registry: don't mention the full path of the script in
      output") causes the build to fail with the error below.
      
           Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
           syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
           Execution of ./lib/build_OID_registry aborted due to compilation errors.
           make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255
      
      Ahmad Fatoum analyzed that non-destructive substitution is only supported since
      Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying
      `$0`, introduce a dedicated variable to support older Perl versions.
      
      Link: https://lkml.kernel.org/r/20240702223512.8329-2-pmenzel@molgen.mpg.de
      Link: https://lkml.kernel.org/r/20240701155802.75152-1-pmenzel@molgen.mpg.de
      Fixes: 5ef6dc08 ("lib/build_OID_registry: don't mention the full path of the script in output")
      Link: https://lore.kernel.org/all/259f7a87-2692-480e-9073-1c1c35b52f67@molgen.mpg.de/Signed-off-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Suggested-by: default avatarAhmad Fatoum <a.fatoum@pengutronix.de>
      Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Cc: Nicolas Schier <nicolas@fjasle.eu>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Ahmad Fatoum <a.fatoum@pengutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2fe29fe9
    • Yang Shi's avatar
      mm: gup: stop abusing try_grab_folio · f442fa61
      Yang Shi authored
      A kernel warning was reported when pinning folio in CMA memory when
      launching SEV virtual machine.  The splat looks like:
      
      [  464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
      [  464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
      [  464.325477] RIP: 0010:__get_user_pages+0x423/0x520
      [  464.325515] Call Trace:
      [  464.325520]  <TASK>
      [  464.325523]  ? __get_user_pages+0x423/0x520
      [  464.325528]  ? __warn+0x81/0x130
      [  464.325536]  ? __get_user_pages+0x423/0x520
      [  464.325541]  ? report_bug+0x171/0x1a0
      [  464.325549]  ? handle_bug+0x3c/0x70
      [  464.325554]  ? exc_invalid_op+0x17/0x70
      [  464.325558]  ? asm_exc_invalid_op+0x1a/0x20
      [  464.325567]  ? __get_user_pages+0x423/0x520
      [  464.325575]  __gup_longterm_locked+0x212/0x7a0
      [  464.325583]  internal_get_user_pages_fast+0xfb/0x190
      [  464.325590]  pin_user_pages_fast+0x47/0x60
      [  464.325598]  sev_pin_memory+0xca/0x170 [kvm_amd]
      [  464.325616]  sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
      
      Per the analysis done by yangge, when starting the SEV virtual machine, it
      will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. 
      But the page is in CMA area, so fast GUP will fail then fallback to the
      slow path due to the longterm pinnalbe check in try_grab_folio().
      
      The slow path will try to pin the pages then migrate them out of CMA area.
      But the slow path also uses try_grab_folio() to pin the page, it will
      also fail due to the same check then the above warning is triggered.
      
      In addition, the try_grab_folio() is supposed to be used in fast path and
      it elevates folio refcount by using add ref unless zero.  We are guaranteed
      to have at least one stable reference in slow path, so the simple atomic add
      could be used.  The performance difference should be trivial, but the
      misuse may be confusing and misleading.
      
      Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
      to try_grab_folio(), and use them in the proper paths.  This solves both
      the abuse and the kernel warning.
      
      The proper naming makes their usecase more clear and should prevent from
      abusing in the future.
      
      peterx said:
      
      : The user will see the pin fails, for gpu-slow it further triggers the WARN
      : right below that failure (as in the original report):
      : 
      :         folio = try_grab_folio(page, page_increm - 1,
      :                                 foll_flags);
      :         if (WARN_ON_ONCE(!folio)) { <------------------------ here
      :                 /*
      :                         * Release the 1st page ref if the
      :                         * folio is problematic, fail hard.
      :                         */
      :                 gup_put_folio(page_folio(page), 1,
      :                                 foll_flags);
      :                 ret = -EFAULT;
      :                 goto out;
      :         }
      
      [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/
      
      [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast]
        Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com
      Fixes: 57edfcfd ("mm/gup: accelerate thp gup even for "pages != NULL"")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avataryangge <yangge1116@126.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f442fa61
  3. 04 Jul, 2024 12 commits
    • Ryusuke Konishi's avatar
      nilfs2: fix kernel bug on rename operation of broken directory · a9e1ddc0
      Ryusuke Konishi authored
      Syzbot reported that in rename directory operation on broken directory on
      nilfs2, __block_write_begin_int() called to prepare block write may fail
      BUG_ON check for access exceeding the folio/page size.
      
      This is because nilfs_dotdot(), which gets parent directory reference
      entry ("..") of the directory to be moved or renamed, does not check
      consistency enough, and may return location exceeding folio/page size for
      broken directories.
      
      Fix this issue by checking required directory entries ("." and "..") in
      the first chunk of the directory in nilfs_dotdot().
      
      Link: https://lkml.kernel.org/r/20240628165107.9006-1-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+d3abed1ad3d367fa2627@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=d3abed1ad3d367fa2627
      Fixes: 2ba466d7 ("nilfs2: directory entry operations")
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9e1ddc0
    • Yu Zhao's avatar
      mm/hugetlb_vmemmap: fix race with speculative PFN walkers · bd225530
      Yu Zhao authored
      While investigating HVO for THPs [1], it turns out that speculative PFN
      walkers like compaction can race with vmemmap modifications, e.g.,
      
        CPU 1 (vmemmap modifier)         CPU 2 (speculative PFN walker)
        -------------------------------  ------------------------------
        Allocates an LRU folio page1
                                         Sees page1
        Frees page1
      
        Allocates a hugeTLB folio page2
        (page1 being a tail of page2)
      
        Updates vmemmap mapping page1
                                         get_page_unless_zero(page1)
      
      Even though page1->_refcount is zero after HVO, get_page_unless_zero() can
      still try to modify this read-only field, resulting in a crash.
      
      An independent report [2] confirmed this race.
      
      There are two discussed approaches to fix this race:
      1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
         triggering a PF.
      2. Use RCU to make sure get_page_unless_zero() either sees zero
         page->_refcount through the old vmemmap or non-zero page->_refcount
         through the new one.
      
      The second approach is preferred here because:
      1. It can prevent illegal modifications to struct page[] that has been
         HVO'ed;
      2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
         similar races in other places, e.g., arch_remove_memory() on x86
         [3], which frees vmemmap mapping offlined struct page[].
      
      While adding synchronize_rcu(), the goal is to be surgical, rather than
      optimized.  Specifically, calls to synchronize_rcu() on the error handling
      paths can be coalesced, but it is not done for the sake of Simplicity:
      noticeably, this fix removes ~50% more lines than it adds.
      
      According to the hugetlb_optimize_vmemmap section in
      Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or
      freeing hugeTLB pages "~2x slower than before".  Having synchronize_rcu()
      on top makes those operations even worse, and this also affects the user
      interface /proc/sys/vm/nr_overcommit_hugepages.
      
      This is *very* hard to trigger:
      
      1. Most hugeTLB use cases I know of are static, i.e., reserved at
         boot time, because allocating at runtime is not reliable at all.
      
      2. On top of that, someone has to be very unlucky to get tripped
         over above, because the race window is so small -- I wasn't able to
         trigger it with a stress testing that does nothing but that (with
         THPs though).
      
      [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
      [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/
      [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/
      
      Link: https://lkml.kernel.org/r/20240627222705.2974207-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Yang Shi <yang@os.amperecomputing.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd225530
    • Nhat Pham's avatar
      cachestat: do not flush stats in recency check · 5a4d8944
      Nhat Pham authored
      syzbot detects that cachestat() is flushing stats, which can sleep, in its
      RCU read section (see [1]).  This is done in the workingset_test_recent()
      step (which checks if the folio's eviction is recent).
      
      Move the stat flushing step to before the RCU read section of cachestat,
      and skip stat flushing during the recency check.
      
      [1]: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/
      
      Link: https://lkml.kernel.org/r/20240627201737.3506959-1-nphamcs@gmail.com
      Fixes: b0068472 ("mm: workingset: move the stats flush into workingset_test_recent()")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Reported-by: syzbot+b7f13b2d0cc156edf61a@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/Debugged-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: <stable@vger.kernel.org>	[6.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a4d8944
    • Gavin Shan's avatar
      mm/shmem: disable PMD-sized page cache if needed · 9fd154ba
      Gavin Shan authored
      For shmem files, it's possible that PMD-sized page cache can't be
      supported by xarray.  For example, 512MB page cache on ARM64 when the base
      page size is 64KB can't be supported by xarray.  It leads to errors as the
      following messages indicate when this sort of xarray entry is split.
      
      WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6   \
      nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject        \
      nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4  \
      ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs  \
      libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \
      net_failover virtio_console virtio_blk failover dimlib virtio_mmio
      CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff8000882af5f0
      x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768
      x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858
      x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000
      x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c
      x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0
      x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       shmem_undo_range+0x2bc/0x6a8
       shmem_fallocate+0x134/0x430
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is larger
      than MAX_PAGECACHE_ORDER.  As Matthew Wilcox pointed, the page cache in a
      shmem file isn't represented by a multi-index entry and doesn't have this
      limitation when the xarry entry is split until commit 6b24ca4a ("mm:
      Use multi-index entries in the page cache").
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-5-gshan@redhat.com
      Fixes: 6b24ca4a ("mm: Use multi-index entries in the page cache")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9fd154ba
    • Gavin Shan's avatar
      mm/filemap: skip to create PMD-sized page cache if needed · 3390916a
      Gavin Shan authored
      On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB.  The
      PMD-sized page cache can't be supported by xarray as the following error
      messages indicate.
      
      ------------[ cut here ]------------
      WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
      nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
      nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
      ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
      fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
      sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
      dimlib virtio_mmio
      CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff800087a4f6c0
      x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
      x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
      x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
      x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
      x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
      x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       truncate_inode_pages_range+0x1b4/0x4a8
       truncate_pagecache_range+0x84/0xa0
       xfs_flush_unmap_range+0x70/0x90 [xfs]
       xfs_file_fallocate+0xfc/0x4d8 [xfs]
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      Fix it by skipping to allocate PMD-sized page cache when its size is
      larger than MAX_PAGECACHE_ORDER.  For this specific case, we will fall to
      regular path where the readahead window is determined by BDI's sysfs file
      (read_ahead_kb).
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-4-gshan@redhat.com
      Fixes: 4687fdbb ("mm/filemap: Support VM_HUGEPAGE for file mappings")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3390916a
    • Gavin Shan's avatar
      mm/readahead: limit page cache size in page_cache_ra_order() · 1f789a45
      Gavin Shan authored
      In page_cache_ra_order(), the maximal order of the page cache to be
      allocated shouldn't be larger than MAX_PAGECACHE_ORDER.  Otherwise, it's
      possible the large page cache can't be supported by xarray when the
      corresponding xarray entry is split.
      
      For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is
      64KB.  The PMD-sized page cache can't be supported by xarray.
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-3-gshan@redhat.com
      Fixes: 793917d9 ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f789a45
    • Gavin Shan's avatar
      mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray · 099d9064
      Gavin Shan authored
      Patch series "mm/filemap: Limit page cache size to that supported by
      xarray", v2.
      
      Currently, xarray can't support arbitrary page cache size.  More details
      can be found from the WARN_ON() statement in xas_split_alloc().  In our
      test whose code is attached below, we hit the WARN_ON() on ARM64 system
      where the base page size is 64KB and huge page size is 512MB.  The issue
      was reported long time ago and some discussions on it can be found here
      [1].
      
      [1] https://www.spinics.net/lists/linux-xfs/msg75404.html
      
      In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one
      supported by xarray and avoid PMD-sized page cache if needed.  The code
      changes are suggested by David Hildenbrand.
      
      PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray
      PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path
      PATCH[4] avoids PMD-sized page cache for shmem files if needed
      
      Test program
      ============
      # cat test.c
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <string.h>
      #include <fcntl.h>
      #include <errno.h>
      #include <sys/syscall.h>
      #include <sys/mman.h>
      
      #define TEST_XFS_FILENAME	"/tmp/data"
      #define TEST_SHMEM_FILENAME	"/dev/shm/data"
      #define TEST_MEM_SIZE		0x20000000
      
      int main(int argc, char **argv)
      {
      	const char *filename;
      	int fd = 0;
      	void *buf = (void *)-1, *p;
      	int pgsize = getpagesize();
      	int ret;
      
      	if (pgsize != 0x10000) {
      		fprintf(stderr, "64KB base page size is required\n");
      		return -EPERM;
      	}
      
      	system("echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled");
      	system("rm -fr /tmp/data");
      	system("rm -fr /dev/shm/data");
      	system("echo 1 > /proc/sys/vm/drop_caches");
      
      	/* Open xfs or shmem file */
      	filename = TEST_XFS_FILENAME;
      	if (argc > 1 && !strcmp(argv[1], "shmem"))
      		filename = TEST_SHMEM_FILENAME;
      
      	fd = open(filename, O_CREAT | O_RDWR | O_TRUNC);
      	if (fd < 0) {
      		fprintf(stderr, "Unable to open <%s>\n", filename);
      		return -EIO;
      	}
      
      	/* Extend file size */
      	ret = ftruncate(fd, TEST_MEM_SIZE);
      	if (ret) {
      		fprintf(stderr, "Error %d to ftruncate()\n", ret);
      		goto cleanup;
      	}
      
      	/* Create VMA */
      	buf = mmap(NULL, TEST_MEM_SIZE,
      		   PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      	if (buf == (void *)-1) {
      		fprintf(stderr, "Unable to mmap <%s>\n", filename);
      		goto cleanup;
      	}
      
      	fprintf(stdout, "mapped buffer at 0x%p\n", buf);
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE);
              if (ret) {
      		fprintf(stderr, "Unable to madvise(MADV_HUGEPAGE)\n");
      		goto cleanup;
      	}
      
      	/* Populate VMA */
      	ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_WRITE);
      	if (ret) {
      		fprintf(stderr, "Error %d to madvise(MADV_POPULATE_WRITE)\n", ret);
      		goto cleanup;
      	}
      
      	/* Punch the file to enforce xarray split */
      	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
              		TEST_MEM_SIZE - pgsize, pgsize);
      	if (ret)
      		fprintf(stderr, "Error %d to fallocate()\n", ret);
      
      cleanup:
      	if (buf != (void *)-1)
      		munmap(buf, TEST_MEM_SIZE);
      	if (fd > 0)
      		close(fd);
      
      	return 0;
      }
      
      # gcc test.c -o test
      # cat /proc/1/smaps | grep KernelPageSize | head -n 1
      KernelPageSize:       64 kB
      # ./test shmem
         :
      ------------[ cut here ]------------
      WARNING: CPU: 17 PID: 5253 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
      nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
      nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
      ip_set nf_tables rfkill nfnetlink vfat fat virtio_balloon          \
      drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64  \
      virtio_net sha1_ce net_failover failover virtio_console virtio_blk \
      dimlib virtio_mmio
      CPU: 17 PID: 5253 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #12
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff80008a92f5b0
      x29: ffff80008a92f5b0 x28: ffff80008a92f610 x27: ffff80008a92f728
      x26: 0000000000000cc0 x25: 000000000000000d x24: ffff0000cf00c858
      x23: ffff80008a92f610 x22: ffffffdfc0600000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0600000 x18: 0000000000000000
      x17: 0000000000000000 x16: 0000018000000000 x15: 3374004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 3374000000000000 x10: 3374e1c0ffff6000 x9 : ffffb463a84c681c
      x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00011c976ce0
      x5 : ffffb463aa47e378 x4 : 0000000000000000 x3 : 0000000000000cc0
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       shmem_undo_range+0x2bc/0x6a8
       shmem_fallocate+0x134/0x430
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      
      This patch (of 4):
      
      The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64 with
      64KB base page size.  The xarray entry with this order can't be split as
      the following error messages indicate.
      
      ------------[ cut here ]------------
      WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
      nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
      nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
      ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
      fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
      sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
      dimlib virtio_mmio
      CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff800087a4f6c0
      x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
      x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
      x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
      x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
      x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
      x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       truncate_inode_pages_range+0x1b4/0x4a8
       truncate_pagecache_range+0x84/0xa0
       xfs_flush_unmap_range+0x70/0x90 [xfs]
       xfs_file_fallocate+0xfc/0x4d8 [xfs]
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order
      by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from
      13 to 11 when CONFIG_BASE_SMALL is disabled.
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-1-gshan@redhat.com
      Link: https://lkml.kernel.org/r/20240627003953.1262512-2-gshan@redhat.com
      Fixes: 793917d9 ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      099d9064
    • SeongJae Park's avatar
      mm/damon/core: merge regions aggressively when max_nr_regions is unmet · 310d6c15
      SeongJae Park authored
      DAMON keeps the number of regions under max_nr_regions by skipping regions
      split operations when doing so can make the number higher than the limit. 
      It works well for preventing violation of the limit.  But, if somehow the
      violation happens, it cannot recovery well depending on the situation.  In
      detail, if the real number of regions having different access pattern is
      higher than the limit, the mechanism cannot reduce the number below the
      limit.  In such a case, the system could suffer from high monitoring
      overhead of DAMON.
      
      The violation can actually happen.  For an example, the user could reduce
      max_nr_regions while DAMON is running, to be lower than the current number
      of regions.  Fix the problem by repeating the merge operations with
      increasing aggressiveness in kdamond_merge_regions() for the case, until
      the limit is met.
      
      [sj@kernel.org: increase regions merge aggressiveness while respecting min_nr_regions]
        Link: https://lkml.kernel.org/r/20240626164753.46270-1-sj@kernel.org
      [sj@kernel.org: ensure max threshold attempt for max_nr_regions violation]
        Link: https://lkml.kernel.org/r/20240627163153.75969-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240624175814.89611-1-sj@kernel.org
      Fixes: b9a6ac4e ("mm/damon: adaptively adjust regions")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      310d6c15
    • Audra Mitchell's avatar
      Fix userfaultfd_api to return EINVAL as expected · 1723f04c
      Audra Mitchell authored
      Currently if we request a feature that is not set in the Kernel config we
      fail silently and return all the available features.  However, the man
      page indicates we should return an EINVAL.
      
      We need to fix this issue since we can end up with a Kernel warning should
      a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
      the config not set with this feature.
      
       [  200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660
       [  200.820738] Modules linked in:
       [  200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8
       [  200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022
       [  200.885052] RIP: 0010:zap_pte_range+0x43d/0x660
      
      Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com
      Fixes: e06f1e1d ("userfaultfd: wp: enabled write protection in userfaultfd API")
      Signed-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1723f04c
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: check if a hash-index is in cpu_possible_mask · a34acf30
      Uladzislau Rezki (Sony) authored
      The problem is that there are systems where cpu_possible_mask has gaps
      between set CPUs, for example SPARC.  In this scenario addr_to_vb_xa()
      hash function can return an index which accesses to not-possible and not
      setup CPU area using per_cpu() macro.  This results in an oops on SPARC.
      
      A per-cpu vmap_block_queue is also used as hash table, incorrectly
      assuming the cpu_possible_mask has no gaps.  Fix it by adjusting an index
      to a next possible CPU.
      
      Link: https://lkml.kernel.org/r/20240626140330.89836-1-urezki@gmail.com
      Fixes: 062eacf5 ("mm: vmalloc: remove a global vmap_blocks xarray")
      Reported-by: default avatarNick Bowler <nbowler@draconx.ca>
      Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hailong.Liu <hailong.liu@oppo.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a34acf30
    • Waiman Long's avatar
      mm: prevent derefencing NULL ptr in pfn_section_valid() · 82f0b6f0
      Waiman Long authored
      Commit 5ec8e8ea ("mm/sparsemem: fix race in accessing
      memory_section->usage") changed pfn_section_valid() to add a READ_ONCE()
      call around "ms->usage" to fix a race with section_deactivate() where
      ms->usage can be cleared.  The READ_ONCE() call, by itself, is not enough
      to prevent NULL pointer dereference.  We need to check its value before
      dereferencing it.
      
      Link: https://lkml.kernel.org/r/20240626001639.1350646-1-longman@redhat.com
      Fixes: 5ec8e8ea ("mm/sparsemem: fix race in accessing memory_section->usage")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      82f0b6f0
    • Yang Shi's avatar
      mm: page_ref: remove folio_try_get_rcu() · fa2690af
      Yang Shi authored
      The below bug was reported on a non-SMP kernel:
      
      [  275.267158][ T4335] ------------[ cut here ]------------
      [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
      [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
      [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e #1
      [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
      [  275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [  275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
      [  275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
      [  275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
      [  275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
      [  275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
      [  275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
      [  275.276790][ T4335] FS:  00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
      [  275.277570][ T4335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
      [  275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  275.280201][ T4335] Call Trace:
      [  275.280499][ T4335]  <TASK>
      [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
      [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
      [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174)
      [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212)
      [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264)
      [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
      [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148)
      [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1))
      [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188)
      [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825)
      [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1))
      [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209)
      [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204)
      [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722)
      [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106)
      [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350)
      [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350)
      [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1))
      [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573)
      [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360
      [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10
      [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573)
      [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0
      [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228)
      [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10
      [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284)
      [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259)
      [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
      [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291)
      [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112)
      [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1))
      [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
      [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
      [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
      [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      
      This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in
      folio_ref_try_add_rcu() for non-SMP kernel.
      
      The process_vm_readv() calls GUP to pin the THP. An optimization for
      pinning THP instroduced by commit 57edfcfd ("mm/gup: accelerate thp
      gup even for "pages != NULL"") calls try_grab_folio() to pin the THP,
      but try_grab_folio() is supposed to be called in atomic context for
      non-SMP kernel, for example, irq disabled or preemption disabled, due to
      the optimization introduced by commit e286781d ("mm: speculative
      page references").
      
      The commit efa7df3e ("mm: align larger anonymous mappings on THP
      boundaries") is not actually the root cause although it was bisected to.
      It just makes the problem exposed more likely.
      
      The follow up discussion suggested the optimization for non-SMP kernel
      may be out-dated and not worth it anymore [1].  So removing the
      optimization to silence the BUG.
      
      However calling try_grab_folio() in GUP slow path actually is
      unnecessary, so the following patch will clean this up.
      
      [1] https://lore.kernel.org/linux-mm/821cf1d6-92b9-4ac4-bacc-d8f2364ac14f@paulmck-laptop/
      
      Link: https://lkml.kernel.org/r/20240625205350.1777481-1-yang@os.amperecomputing.com
      Fixes: 57edfcfd ("mm/gup: accelerate thp gup even for "pages != NULL"")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarOliver Sang <oliver.sang@intel.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa2690af
  4. 03 Jul, 2024 6 commits
    • Ryusuke Konishi's avatar
      nilfs2: fix incorrect inode allocation from reserved inodes · 93aef9ed
      Ryusuke Konishi authored
      If the bitmap block that manages the inode allocation status is corrupted,
      nilfs_ifile_create_inode() may allocate a new inode from the reserved
      inode area where it should not be allocated.
      
      Previous fix commit d325dc6e ("nilfs2: fix use-after-free bug of
      struct nilfs_root"), fixed the problem that reserved inodes with inode
      numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
      bitmap corruption, but since the start number of non-reserved inodes is
      read from the super block and may change, in which case inode allocation
      may occur from the extended reserved inode area.
      
      If that happens, access to that inode will cause an IO error, causing the
      file system to degrade to an error state.
      
      Fix this potential issue by adding a wraparound option to the common
      metadata object allocation routine and by modifying
      nilfs_ifile_create_inode() to disable the option so that it only allocates
      inodes with inode numbers greater than or equal to the inode number read
      in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
      inodes.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93aef9ed
    • Ryusuke Konishi's avatar
      nilfs2: add missing check for inode numbers on directory entries · bb76c6c2
      Ryusuke Konishi authored
      Syzbot reported that mounting and unmounting a specific pattern of
      corrupted nilfs2 filesystem images causes a use-after-free of metadata
      file inodes, which triggers a kernel bug in lru_add_fn().
      
      As Jan Kara pointed out, this is because the link count of a metadata file
      gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
      tries to delete that inode (ifile inode in this case).
      
      The inconsistency occurs because directories containing the inode numbers
      of these metadata files that should not be visible in the namespace are
      read without checking.
      
      Fix this issue by treating the inode numbers of these internal files as
      errors in the sanity check helper when reading directory folios/pages.
      
      Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
      analysis.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8Reported-by: default avatarJan Kara <jack@suse.cz>
      Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb76c6c2
    • Ryusuke Konishi's avatar
      nilfs2: fix inode number range checks · e2fec219
      Ryusuke Konishi authored
      Patch series "nilfs2: fix potential issues related to reserved inodes".
      
      This series fixes one use-after-free issue reported by syzbot, caused by
      nilfs2's internal inode being exposed in the namespace on a corrupted
      filesystem, and a couple of flaws that cause problems if the starting
      number of non-reserved inodes written in the on-disk super block is
      intentionally (or corruptly) changed from its default value.  
      
      
      This patch (of 3):
      
      In the current implementation of nilfs2, "nilfs->ns_first_ino", which
      gives the first non-reserved inode number, is read from the superblock,
      but its lower limit is not checked.
      
      As a result, if a number that overlaps with the inode number range of
      reserved inodes such as the root directory or metadata files is set in the
      super block parameter, the inode number test macros (NILFS_MDT_INODE and
      NILFS_VALID_INODE) will not function properly.
      
      In addition, these test macros use left bit-shift calculations using with
      the inode number as the shift count via the BIT macro, but the result of a
      shift calculation that exceeds the bit width of an integer is undefined in
      the C specification, so if "ns_first_ino" is set to a large value other
      than the default value NILFS_USER_INO (=11), the macros may potentially
      malfunction depending on the environment.
      
      Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
      by preventing bit shifts equal to or greater than the NILFS_USER_INO
      constant in the inode number test macros.
      
      Also, change the type of "ns_first_ino" from signed integer to unsigned
      integer to avoid the need for type casting in comparisons such as the
      lower bound check introduced this time.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
      Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2fec219
    • Jan Kara's avatar
      mm: avoid overflows in dirty throttling logic · 385d838d
      Jan Kara authored
      The dirty throttling logic is interspersed with assumptions that dirty
      limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
      fit into 64-bits).  If limits end up being larger, we will hit overflows,
      possible divisions by 0 etc.  Fix these problems by never allowing so
      large dirty limits as they have dubious practical value anyway.  For
      dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
      so large limits.  For dirty_ratio / dirty_background_ratio it isn't so
      simple as the dirty limit is computed from the amount of available memory
      which can change due to memory hotplug etc.  So when converting dirty
      limits from ratios to numbers of pages, we just don't allow the result to
      exceed UINT_MAX.
      
      This is root-only triggerable problem which occurs when the operator
      sets dirty limits to >16 TB.
      
      Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-By: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      385d838d
    • Jan Kara's avatar
      Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again" · 30139c70
      Jan Kara authored
      Patch series "mm: Avoid possible overflows in dirty throttling".
      
      Dirty throttling logic assumes dirty limits in page units fit into
      32-bits.  This patch series makes sure this is true (see patch 2/2 for
      more details).
      
      
      This patch (of 2):
      
      This reverts commit 9319b647.
      
      The commit is broken in several ways.  Firstly, the removed (u64) cast
      from the multiplication will introduce a multiplication overflow on 32-bit
      archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
      default settings with 4GB of RAM will trigger this).  Secondly, the
      div64_u64() is unnecessarily expensive on 32-bit archs.  We have
      div64_ul() in case we want to be safe & cheap.  Thirdly, if dirty
      thresholds are larger than 1<<32 pages, then dirty balancing is going to
      blow up in many other spectacular ways anyway so trying to fix one
      possible overflow is just moot.
      
      Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
      Fixes: 9319b647 ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-By: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30139c70
    • Jinliang Zheng's avatar
      mm: optimize the redundant loop of mm_update_owner_next() · cf3f9a59
      Jinliang Zheng authored
      When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
      /proc or ptrace or page migration (get_task_mm()), it is impossible to
      find an appropriate task_struct in the loop whose mm_struct is the same as
      the target mm_struct.
      
      If the above race condition is combined with the stress-ng-zombie and
      stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
      write_lock_irq() for tasklist_lock.
      
      Recognize this situation in advance and exit early.
      
      Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.comSigned-off-by: default avatarJinliang Zheng <alexjlzheng@tencent.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tycho Andersen <tandersen@netflix.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf3f9a59
  5. 30 Jun, 2024 16 commits
    • Linus Torvalds's avatar
      Linux 6.10-rc6 · 22a40d14
      Linus Torvalds authored
      22a40d14
    • Linus Torvalds's avatar
      Merge tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux · aca7c377
      Linus Torvalds authored
      Pull ata fixes from Niklas Cassel:
      
       - Add NOLPM quirk for for all Crucial BX SSD1 models.
      
         Considering that we now have had bug reports for 3 different BX SSD1
         variants from Crucial with the same product name, make the quirk more
         inclusive, to catch more device models from the same generation.
      
       - Fix a trivial NULL pointer dereference in the error path for
         ata_host_release().
      
       - Create a ata_port_free(), so that we don't miss freeing ata_port
         struct members when freeing a struct ata_port.
      
       - Fix a trivial double free in the error path for ata_host_alloc().
      
       - Ensure that we remove the libata "remapped NVMe device count" sysfs
         entry on .probe() error.
      
      * tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
        ata: ahci: Clean up sysfs file on error
        ata: libata-core: Fix double free on error
        ata,scsi: libata-core: Do not leak memory for ata_port struct members
        ata: libata-core: Fix null pointer dereference on error
        ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
      aca7c377
    • Niklas Cassel's avatar
      ata: ahci: Clean up sysfs file on error · eeb25a09
      Niklas Cassel authored
      .probe() (ahci_init_one()) calls sysfs_add_file_to_group(), however,
      if probe() fails after this call, we currently never call
      sysfs_remove_file_from_group().
      
      (The sysfs_remove_file_from_group() call in .remove() (ahci_remove_one())
      does not help, as .remove() is not called on .probe() error.)
      
      Thus, if probe() fails after the sysfs_add_file_to_group() call, the next
      time we insmod the module we will get:
      
      sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:04.0/remapped_nvme'
      CPU: 11 PID: 954 Comm: modprobe Not tainted 6.10.0-rc5 #43
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x5d/0x80
       sysfs_warn_dup.cold+0x17/0x23
       sysfs_add_file_mode_ns+0x11a/0x130
       sysfs_add_file_to_group+0x7e/0xc0
       ahci_init_one+0x31f/0xd40 [ahci]
      
      Fixes: 894fba7f ("ata: ahci: Add sysfs attribute to show remapped NVMe device count")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Link: https://lore.kernel.org/r/20240629124210.181537-10-cassel@kernel.orgSigned-off-by: default avatarNiklas Cassel <cassel@kernel.org>
      eeb25a09
    • Niklas Cassel's avatar
      ata: libata-core: Fix double free on error · ab9e0c52
      Niklas Cassel authored
      If e.g. the ata_port_alloc() call in ata_host_alloc() fails, we will jump
      to the err_out label, which will call devres_release_group().
      devres_release_group() will trigger a call to ata_host_release().
      ata_host_release() calls kfree(host), so executing the kfree(host) in
      ata_host_alloc() will lead to a double free:
      
      kernel BUG at mm/slub.c:553!
      Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 11 PID: 599 Comm: (udev-worker) Not tainted 6.10.0-rc5 #47
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
      RIP: 0010:kfree+0x2cf/0x2f0
      Code: 5d 41 5e 41 5f 5d e9 80 d6 ff ff 4d 89 f1 41 b8 01 00 00 00 48 89 d9 48 89 da
      RSP: 0018:ffffc90000f377f0 EFLAGS: 00010246
      RAX: ffff888112b1f2c0 RBX: ffff888112b1f2c0 RCX: ffff888112b1f320
      RDX: 000000000000400b RSI: ffffffffc02c9de5 RDI: ffff888112b1f2c0
      RBP: ffffc90000f37830 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffc90000f37610 R11: 617461203a736b6e R12: ffffea00044ac780
      R13: ffff888100046400 R14: ffffffffc02c9de5 R15: 0000000000000006
      FS:  00007f2f1cabe980(0000) GS:ffff88813b380000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f2f1c3acf75 CR3: 0000000111724000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body.cold+0x19/0x27
       ? die+0x2e/0x50
       ? do_trap+0xca/0x110
       ? do_error_trap+0x6a/0x90
       ? kfree+0x2cf/0x2f0
       ? exc_invalid_op+0x50/0x70
       ? kfree+0x2cf/0x2f0
       ? asm_exc_invalid_op+0x1a/0x20
       ? ata_host_alloc+0xf5/0x120 [libata]
       ? ata_host_alloc+0xf5/0x120 [libata]
       ? kfree+0x2cf/0x2f0
       ata_host_alloc+0xf5/0x120 [libata]
       ata_host_alloc_pinfo+0x14/0xa0 [libata]
       ahci_init_one+0x6c9/0xd20 [ahci]
      
      Ensure that we will not call kfree(host) twice, by performing the kfree()
      only if the devres_open_group() call failed.
      
      Fixes: dafd6c49 ("libata: ensure host is free'd on error exit paths")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Link: https://lore.kernel.org/r/20240629124210.181537-9-cassel@kernel.orgSigned-off-by: default avatarNiklas Cassel <cassel@kernel.org>
      ab9e0c52
    • Niklas Cassel's avatar
      ata,scsi: libata-core: Do not leak memory for ata_port struct members · f6549f53
      Niklas Cassel authored
      libsas is currently not freeing all the struct ata_port struct members,
      e.g. ncq_sense_buf for a driver supporting Command Duration Limits (CDL).
      
      Add a function, ata_port_free(), that is used to free a ata_port,
      including its struct members. It makes sense to keep the code related to
      freeing a ata_port in its own function, which will also free all the
      struct members of struct ata_port.
      
      Fixes: 18bd7718 ("scsi: ata: libata: Handle completion of CDL commands using policy 0xD")
      Reviewed-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Link: https://lore.kernel.org/r/20240629124210.181537-8-cassel@kernel.orgSigned-off-by: default avatarNiklas Cassel <cassel@kernel.org>
      f6549f53
    • Niklas Cassel's avatar
      ata: libata-core: Fix null pointer dereference on error · 5d92c7c5
      Niklas Cassel authored
      If the ata_port_alloc() call in ata_host_alloc() fails,
      ata_host_release() will get called.
      
      However, the code in ata_host_release() tries to free ata_port struct
      members unconditionally, which can lead to the following:
      
      BUG: unable to handle page fault for address: 0000000000003990
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 10 PID: 594 Comm: (udev-worker) Not tainted 6.10.0-rc5 #44
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
      RIP: 0010:ata_host_release.cold+0x2f/0x6e [libata]
      Code: e4 4d 63 f4 44 89 e2 48 c7 c6 90 ad 32 c0 48 c7 c7 d0 70 33 c0 49 83 c6 0e 41
      RSP: 0018:ffffc90000ebb968 EFLAGS: 00010246
      RAX: 0000000000000041 RBX: ffff88810fb52e78 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88813b3218c0 RDI: ffff88813b3218c0
      RBP: ffff88810fb52e40 R08: 0000000000000000 R09: 6c65725f74736f68
      R10: ffffc90000ebb738 R11: 73692033203a746e R12: 0000000000000004
      R13: 0000000000000000 R14: 0000000000000011 R15: 0000000000000006
      FS:  00007f6cc55b9980(0000) GS:ffff88813b300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000003990 CR3: 00000001122a2000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body.cold+0x19/0x27
       ? page_fault_oops+0x15a/0x2f0
       ? exc_page_fault+0x7e/0x180
       ? asm_exc_page_fault+0x26/0x30
       ? ata_host_release.cold+0x2f/0x6e [libata]
       ? ata_host_release.cold+0x2f/0x6e [libata]
       release_nodes+0x35/0xb0
       devres_release_group+0x113/0x140
       ata_host_alloc+0xed/0x120 [libata]
       ata_host_alloc_pinfo+0x14/0xa0 [libata]
       ahci_init_one+0x6c9/0xd20 [ahci]
      
      Do not access ata_port struct members unconditionally.
      
      Fixes: 633273a3 ("libata-pmp: hook PMP support and enable it")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarJohn Garry <john.g.garry@oracle.com>
      Link: https://lore.kernel.org/r/20240629124210.181537-7-cassel@kernel.orgSigned-off-by: default avatarNiklas Cassel <cassel@kernel.org>
      5d92c7c5
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.10-3' of... · e0b668b0
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Remove the executable bit from installed DTB files
      
       - Escape $ in subshell execution in the debian-orig target
      
       - Fix RPM builds with CONFIG_MODULES=n
      
       - Fix xconfig with the O= option
      
       - Fix scripts_gdb with the O= option
      
      * tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kbuild: scripts/gdb: bring the "abspath" back
        kbuild: Use $(obj)/%.cc to fix host C++ module builds
        kbuild: rpm-pkg: fix build error with CONFIG_MODULES=n
        kbuild: Fix build target deb-pkg: ln: failed to create hard link
        kbuild: doc: Update default INSTALL_MOD_DIR from extra to updates
        kbuild: Install dtb files as 0644 in Makefile.dtbinst
      e0b668b0
    • Linus Torvalds's avatar
      x86-32: fix cmpxchg8b_emu build error with clang · 76932725
      Linus Torvalds authored
      The kernel test robot reported that clang no longer compiles the 32-bit
      x86 kernel in some configurations due to commit 95ece481
      ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
      functions").
      
      The build fails with
      
        arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available
      
      and the reason seems to be that not only does the cmpxchg8b instruction
      need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
      fallback the inline asm also wants a fifth fixed register for the
      address (it uses %esi for that, but that's just a software convention
      with cmpxchg8b_emu).
      
      Avoiding using another pointer input to the asm (and just forcing it to
      use the "0(%esi)" addressing that we end up requiring for the sw
      fallback) seems to fix the issue.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/
      Fixes: 95ece481 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions")
      Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/Suggested-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Reviewed-and-Tested-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76932725
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 84dd4373
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small driver fixes for 6.10-rc6. Included in here are:
      
         - IIO driver fixes for reported issues
      
         - Counter driver fix for a reported problem.
      
        All of these have been in linux-next this week with no reported
        issues"
      
      * tag 'char-misc-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        counter: ti-eqep: enable clock at probe
        iio: chemical: bme680: Fix sensor data read operation
        iio: chemical: bme680: Fix overflows in compensate() functions
        iio: chemical: bme680: Fix calibration data variable
        iio: chemical: bme680: Fix pressure value output
        iio: humidity: hdc3020: fix hysteresis representation
        iio: dac: fix ad9739a random config compile error
        iio: accel: fxls8962af: select IIO_BUFFER & IIO_KFIFO_BUF
        iio: adc: ad7266: Fix variable checking bug
        iio: xilinx-ams: Don't include ams_ctrl_channels in scan_mask
      84dd4373
    • Linus Torvalds's avatar
      Merge tag 'staging-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 12529aa1
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are two small staging driver fixes for 6.10-rc6, both for the
        vc04_services drivers:
      
         - build fix if CONFIG_DEBUGFS was not set
      
         - initialization check fix that was much reported.
      
        Both of these have been in linux-next this week with no reported
        issues"
      
      * tag 'staging-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vchiq_debugfs: Fix build if CONFIG_DEBUG_FS is not set
        staging: vc04_services: vchiq_arm: Fix initialisation check
      12529aa1
    • Linus Torvalds's avatar
      Merge tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 3e334486
      Linus Torvalds authored
      Pull tty / serial / console fixes from Greg KH:
       "Here are a bunch of fixes/reverts for 6.10-rc6.  Include in here are:
      
         - revert the bunch of tty/serial/console changes that landed in -rc1
           that didn't quite work properly yet.
      
           Everyone agreed to just revert them for now and will work on making
           them better for a future release instead of trying to quick fix the
           existing changes this late in the release cycle
      
         - 8250 driver port count bugfix
      
         - Other tiny serial port bugfixes for reported issues
      
        All of these have been in linux-next this week with no reported
        issues"
      
      * tag 'tty-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        Revert "printk: Save console options for add_preferred_console_match()"
        Revert "printk: Don't try to parse DEVNAME:0.0 console options"
        Revert "printk: Flag register_console() if console is set on command line"
        Revert "serial: core: Add support for DEVNAME:0.0 style naming for kernel console"
        Revert "serial: core: Handle serial console options"
        Revert "serial: 8250: Add preferred console in serial8250_isa_init_ports()"
        Revert "Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial ports"
        Revert "serial: 8250: Fix add preferred console for serial8250_isa_init_ports()"
        Revert "serial: core: Fix ifdef for serial base console functions"
        serial: bcm63xx-uart: fix tx after conversion to uart_port_tx_limited()
        serial: core: introduce uart_port_tx_limited_flags()
        Revert "serial: core: only stop transmit when HW fifo is empty"
        serial: imx: set receiver level before starting uart
        tty: mcf: MCF54418 has 10 UARTS
        serial: 8250_omap: Implementation of Errata i2310
        tty: serial: 8250: Fix port count mismatch with the device
      3e334486
    • Linus Torvalds's avatar
      Merge tag 'usb-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 2c01c3d5
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a handful of small USB driver fixes for 6.10-rc6 to resolve
        some reported issues. Included in here are:
      
         - typec driver bugfixes
      
         - usb gadget driver reverts for commits that were reported to have
           problems
      
         - resource leak bugfix
      
         - gadget driver bugfixes
      
         - dwc3 driver bugfixes
      
         - usb atm driver bugfix for when syzbot got loose on it
      
        All of these have been in linux-next this week with no reported issues"
      
      * tag 'usb-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: dwc3: core: Workaround for CSR read timeout
        Revert "usb: gadget: u_ether: Replace netif_stop_queue with netif_device_detach"
        Revert "usb: gadget: u_ether: Re-attach netif device to mirror detachment"
        usb: gadget: aspeed_udc: fix device address configuration
        usb: dwc3: core: remove lock of otg mode during gadget suspend/resume to avoid deadlock
        usb: typec: ucsi: glink: fix child node release in probe function
        usb: musb: da8xx: fix a resource leak in probe()
        usb: typec: ucsi_acpi: Add LG Gram quirk
        usb: ucsi: stm32: fix command completion handling
        usb: atm: cxacru: fix endpoint checking in cxacru_bind()
        usb: gadget: printer: fix races against disable
        usb: gadget: printer: SS+ support
      2c01c3d5
    • Linus Torvalds's avatar
      Merge tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3ffea9a7
      Linus Torvalds authored
      Pull smp fixes from Borislav Petkov:
      
       - Fix "nosmp" and "maxcpus=0" after the parallel CPU bringup work went
         in and broke them
      
       - Make sure CPU hotplug dynamic prepare states are actually executed
      
      * tag 'smp_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        cpu: Fix broken cmdline "nosmp" and "maxcpus=0"
        cpu/hotplug: Fix dynstate assignment in __cpuhp_setup_state_cpuslocked()
      3ffea9a7
    • Linus Torvalds's avatar
      Merge tag 'irq_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4e412160
      Linus Torvalds authored
      Pull irq fixes from Borislav Petkov:
      
       - Make sure multi-bridge machines get all eiointc interrupt controllers
         initialized even if the number of CPUs has been limited by a cmdline
         param
      
       - Make sure interrupt lines on liointc hw are configured properly even
         when interrupt routing changes
      
       - Avoid use-after-free in the error path of the MSI init code
      
      * tag 'irq_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        PCI/MSI: Fix UAF in msi_capability_init
        irqchip/loongson-liointc: Set different ISRs for different cores
        irqchip/loongson-eiointc: Use early_cpu_to_node() instead of cpu_to_node()
      4e412160
    • Linus Torvalds's avatar
      Merge tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 03c8b0bd
      Linus Torvalds authored
      Pull timer fix from Borislav Petkov:
      
       - Warn when an hrtimer doesn't get a callback supplied
      
      * tag 'timers_urgent_for_v6.10_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        hrtimer: Prevent queuing of hrtimer without a function callback
      03c8b0bd
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-6.10-rc-fixes' of git://www.linux-watchdog.org/linux-watchdog · 327fceff
      Linus Torvalds authored
      Pull watchdog fixes from Wim Van Sebroeck:
      
       - lenovo_se10_wdt: add HAS_IOPORT dependency
      
       - add missing MODULE_DESCRIPTION() macros
      
      * tag 'linux-watchdog-6.10-rc-fixes' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: add missing MODULE_DESCRIPTION() macros
        watchdog: lenovo_se10_wdt: add HAS_IOPORT dependency
      327fceff