1. 24 Feb, 2024 7 commits
    • Andrew Morton's avatar
    • Aneesh Kumar K.V (IBM)'s avatar
      mm/debug_vm_pgtable: fix BUG_ON with pud advanced test · 720da1e5
      Aneesh Kumar K.V (IBM) authored
      Architectures like powerpc add debug checks to ensure we find only devmap
      PUD pte entries.  These debug checks are only done with CONFIG_DEBUG_VM. 
      This patch marks the ptes used for PUD advanced test devmap pte entries so
      that we don't hit on debug checks on architecture like ppc64 as below.
      
      WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
      ....
      NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
      LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
      Call Trace:
      [c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
      [c000000004a2f980] [000d34c100000000] 0xd34c100000000
      [c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
      [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
      [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Also
      
       kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
       ....
      
       NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
       LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       Call Trace:
       [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
       [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
       [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Link: https://lkml.kernel.org/r/20240129060022.68044-1-aneesh.kumar@kernel.org
      Fixes: 27af67f3 ("powerpc/book3s64/mm: enable transparent pud hugepage")
      Signed-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      720da1e5
    • Nhat Pham's avatar
      mm: cachestat: fix folio read-after-free in cache walk · 3a75cb05
      Nhat Pham authored
      In cachestat, we access the folio from the page cache's xarray to compute
      its page offset, and check for its dirty and writeback flags.  However, we
      do not hold a reference to the folio before performing these actions,
      which means the folio can concurrently be released and reused as another
      folio/page/slab.
      
      Get around this altogether by just using xarray's existing machinery for
      the folio page offsets and dirty/writeback states.
      
      This changes behavior for tmpfs files to now always report zeroes in their
      dirty and writeback counters.  This is okay as tmpfs doesn't follow
      conventional writeback cache behavior: its pages get "cleaned" during
      swapout, after which they're no longer resident etc.
      
      Link: https://lkml.kernel.org/r/20240220153409.GA216065@cmpxchg.org
      Fixes: cf264e13 ("cachestat: implement cachestat syscall")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Cc: <stable@vger.kernel.org>	[6.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a75cb05
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add memory mapping entry with reviewers · 00130266
      Lorenzo Stoakes authored
      Recently there have been a number of patches which have affected various
      aspects of the memory mapping logic as implemented in mm/mmap.c where it
      would have been useful for regular contributors to have been notified.
      
      Add an entry for this part of mm in particular with regular contributors
      tagged as reviewers.
      
      Link: https://lkml.kernel.org/r/20240220064410.4639-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      00130266
    • Byungchul Park's avatar
      mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index · 2774f256
      Byungchul Park authored
      With numa balancing on, when a numa system is running where a numa node
      doesn't have its local memory so it has no managed zones, the following
      oops has been observed.  It's because wakeup_kswapd() is called with a
      wrong zone index, -1.  Fixed it by checking the index before calling
      wakeup_kswapd().
      
      > BUG: unable to handle page fault for address: 00000000000033f3
      > #PF: supervisor read access in kernel mode
      > #PF: error_code(0x0000) - not-present page
      > PGD 0 P4D 0
      > Oops: 0000 [#1] PREEMPT SMP NOPTI
      > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
      > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
      > Code: (omitted)
      > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
      > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
      > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
      > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
      > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > PKRU: 55555554
      > Call Trace:
      >  <TASK>
      > ? __die
      > ? page_fault_oops
      > ? __pte_offset_map_lock
      > ? exc_page_fault
      > ? asm_exc_page_fault
      > ? wakeup_kswapd
      > migrate_misplaced_page
      > __handle_mm_fault
      > handle_mm_fault
      > do_user_addr_fault
      > exc_page_fault
      > asm_exc_page_fault
      > RIP: 0033:0x55b897ba0808
      > Code: (omitted)
      > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
      > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
      > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
      > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
      > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
      >  </TASK>
      
      Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.comSigned-off-by: default avatarByungchul Park <byungchul@sk.com>
      Reported-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Fixes: c574bbe9 ("NUMA balancing: optimize page placement for memory tiering system")
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2774f256
    • Marco Elver's avatar
      kasan: revert eviction of stack traces in generic mode · 711d3491
      Marco Elver authored
      This partially reverts commits cc478e0b, 63b85ac5, 08d7c94d,
      a414d428, and 773688a6 to make use of variable-sized stack depot
      records, since eviction of stack entries from stack depot forces fixed-
      sized stack records.  Care was taken to retain the code cleanups by the
      above commits.
      
      Eviction was added to generic KASAN as a response to alleviating the
      additional memory usage from fixed-sized stack records, but this still
      uses more memory than previously.
      
      With the re-introduction of variable-sized records for stack depot, we can
      just switch back to non-evictable stack records again, and return back to
      the previous performance and memory usage baseline.
      
      Before (observed after a KASAN kernel boot):
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      After:
      
        pools: 319
        refcounted_allocations: 0
        refcounted_frees: 0
        refcounted_in_use: 0
        freelist_size: 0
        persistent_count: 29397
        persistent_bytes: 5183536
      
      As can be seen from the counters, with a generic KASAN config, refcounted
      allocations and evictions are no longer used.  Due to using variable-sized
      records, I observe a reduction of 278 stack depot pools (saving 4448 KiB)
      with my test setup.
      
      Link: https://lkml.kernel.org/r/20240129100708.39460-2-elver@google.com
      Fixes: cc478e0b ("kasan: avoid resetting aux_lock")
      Fixes: 63b85ac5 ("kasan: stop leaking stack trace handles")
      Fixes: 08d7c94d ("kasan: memset free track in qlink_free")
      Fixes: a414d428 ("kasan: handle concurrent kasan_record_aux_stack calls")
      Fixes: 773688a6 ("kasan: use stack_depot_put for Generic mode")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      711d3491
    • Marco Elver's avatar
      stackdepot: use variable size records for non-evictable entries · 31639fd6
      Marco Elver authored
      With the introduction of stack depot evictions, each stack record is now
      fixed size, so that future reuse after an eviction can safely store
      differently sized stack traces.  In all cases that do not make use of
      evictions, this wastes lots of space.
      
      Fix it by re-introducing variable size stack records (up to the max
      allowed size) for entries that will never be evicted.  We know if an entry
      will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided,
      since a later stack_depot_put() attempt is undefined behavior.
      
      With my current kernel config that enables KASAN and also SLUB owner
      tracking, I observe (after a kernel boot) a whopping reduction of 296
      stack depot pools, which translates into 4736 KiB saved.  The savings here
      are from SLUB owner tracking only, because KASAN generic mode still uses
      refcounting.
      
      Before:
      
        pools: 893
        allocations: 29841
        frees: 6524
        in_use: 23317
        freelist_size: 3454
      
      After:
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      [elver@google.com: fix -Wstringop-overflow warning]
        Link: https://lore.kernel.org/all/20240201135747.18eca98e@canb.auug.org.au/
        Link: https://lkml.kernel.org/r/20240201090434.1762340-1-elver@google.com
        Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20240129100708.39460-1-elver@google.com
      Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Fixes: 108be8de ("lib/stackdepot: allow users to evict stack traces")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31639fd6
  2. 22 Feb, 2024 33 commits
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: fix wrong quotas diabling condition · 7d8cebb9
      SeongJae Park authored
      After the introduction of DAMOS quotas, DAMOS quotas is not disabled if
      both size and time quotas are zero but the quota goal is set.  The new
      rule is also applied to DAMON sysfs interface, but the usage doc is not
      updated.  Update it.
      
      Link: https://lkml.kernel.org/r/20240217005842.87348-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d8cebb9
    • SeongJae Park's avatar
      Docs/mm/damon: move monitoring target regions setup detail from the usage to the design document · 2d89957c
      SeongJae Park authored
      Design doc is aimed to have all concept level details, while the usage doc
      is focused on only how the features can be used.  Some details about
      monitoring target regions construction is on the usage doc.  Move the
      details about the monitoring target regions construction differences for
      DAMON operations set from the usage to the design doc.
      
      Link: https://lkml.kernel.org/r/20240217005842.87348-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d89957c
    • SeongJae Park's avatar
      Docs/mm/damon: move DAMON operation sets list from the usage to the design document · 669971b4
      SeongJae Park authored
      The list of DAMON operation sets and their explanation, which may better
      to be on design document, is written on the usage document.  Move the
      detail to design document and make the usage document only reference the
      design document.
      
      [sj@kernel.org: fix a typo on a reference link]
        Link: https://lkml.kernel.org/r/20240221170852.55529-2-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240217005842.87348-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      669971b4
    • SeongJae Park's avatar
      Docs/mm/damon: move the list of DAMOS actions to design doc · 5b7708e6
      SeongJae Park authored
      DAMOS operation actions are explained nearly twice on the DAMON usage
      document, once for the sysfs interface, and then again for the debugfs
      interface.  Duplication is bad.  Also it would better to keep this kind of
      concept level details in design document and keep the usage document small
      and focus on only the usage.  Move the list to design document and update
      usage document to reference it.
      
      Link: https://lkml.kernel.org/r/20240217005842.87348-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b7708e6
    • SeongJae Park's avatar
      Docs/mm/damon/maintainer-profile: fix reference links for mm-[un]stable tree · 0a1ebc17
      SeongJae Park authored
      Patch series "Docs/mm/damon: misc readability improvements".
      
      Fix trivial mistakes and improve layout of information on different
      documents for DAMON.
      
      
      This patch (of 5):
      
      A couple of sentences on maintainer-profile.rst are having reference links
      for mm-unstable and mm-stable trees with wrong rst markup.  Fix those.
      
      Link: https://lkml.kernel.org/r/20240217005842.87348-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240217005842.87348-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a1ebc17
    • Lokesh Gidra's avatar
      userfaultfd: use per-vma locks in userfaultfd operations · 867a43a3
      Lokesh Gidra authored
      All userfaultfd operations, except write-protect, opportunistically use
      per-vma locks to lock vmas.  On failure, attempt again inside mmap_lock
      critical section.
      
      Write-protect operation requires mmap_lock as it iterates over multiple
      vmas.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      867a43a3
    • Lokesh Gidra's avatar
      mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK · 32af81af
      Lokesh Gidra authored
      vma_assert_locked() is needed to replace mmap_assert_locked() once we
      start using per-vma locks in userfaultfd operations.
      
      In !CONFIG_PER_VMA_LOCK case when mm is locked, it implies that the given
      VMA is locked.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-4-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32af81af
    • Lokesh Gidra's avatar
      userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx · 5e4c24a5
      Lokesh Gidra authored
      Increments and loads to mmap_changing are always in mmap_lock critical
      section.  This ensures that if userspace requests event notification for
      non-cooperative operations (e.g.  mremap), userfaultfd operations don't
      occur concurrently.
      
      This can be achieved by using a separate read-write semaphore in
      userfaultfd_ctx such that increments are done in write-mode and loads in
      read-mode, thereby eliminating the dependency on mmap_lock for this
      purpose.
      
      This is a preparatory step before we replace mmap_lock usage with per-vma
      locks in fill/move ioctls.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e4c24a5
    • Lokesh Gidra's avatar
      userfaultfd: move userfaultfd_ctx struct to header file · f91e6b41
      Lokesh Gidra authored
      Patch series "per-vma locks in userfaultfd", v7.
      
      Performing userfaultfd operations (like copy/move etc.) in critical
      section of mmap_lock (read-mode) causes significant contention on the lock
      when operations requiring the lock in write-mode are taking place
      concurrently.  We can use per-vma locks instead to significantly reduce
      the contention issue.
      
      Android runtime's Garbage Collector uses userfaultfd for concurrent
      compaction.  mmap-lock contention during compaction potentially causes
      jittery experience for the user.  During one such reproducible scenario,
      we observed the following improvements with this patch-set:
      
      - Wall clock time of compaction phase came down from ~3s to <500ms
      - Uninterruptible sleep time (across all threads in the process) was
        ~10ms (none in mmap_lock) during compaction, instead of >20s
      
      
      This patch (of 4):
      
      Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c.
      There are no other changes in the struct.
      
      This is required to prepare for using per-vma locks in userfaultfd
      operations.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com
      Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f91e6b41
    • Juntong Deng's avatar
      kasan: increase the number of bits to shift when recording extra timestamps · 952237b5
      Juntong Deng authored
      In 5d4c6ac9 ("kasan: record and report more information") I thought
      that printk only displays a maximum of 99999 seconds, but actually printk
      can display a larger number of seconds.
      
      So increase the number of bits to shift when recording the extra timestamp
      (44 bits), without affecting the precision, shift it right by 9 bits,
      discarding all bits that do not affect the microsecond part (nanoseconds
      will not be shown).
      
      Currently the maximum time that can be displayed is 9007199.254740s,
      because
      
      11111111111111111111111111111111111111111111 (44 bits) << 9
      = 11111111111111111111111111111111111111111111000000000
      = 9007199.254740
      
      Link: https://lkml.kernel.org/r/AM6PR03MB58481629F2F28CE007412139994D2@AM6PR03MB5848.eurprd03.prod.outlook.com
      Fixes: 5d4c6ac9 ("kasan: record and report more information")
      Signed-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      952237b5
    • Matthew Wilcox (Oracle)'s avatar
      rmap: replace two calls to compound_order with folio_order · 059ab7be
      Matthew Wilcox (Oracle) authored
      Removes two unnecessary conversions from folio to page.  Should be no
      difference in behaviour.
      
      Link: https://lkml.kernel.org/r/20240215205307.674707-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      059ab7be
    • Mathieu Desnoyers's avatar
      dax: fix incorrect list of data cache aliasing architectures · 902ccb86
      Mathieu Desnoyers authored
      commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      prevents DAX from building on architectures with virtually aliased
      dcache with:
      
        depends on !(ARM || MIPS || SPARC)
      
      This check is too broad (e.g. recent ARMv7 don't have virtually aliased
      dcaches), and also misses many other architectures with virtually
      aliased data cache.
      
      This is a regression introduced in the v4.0 Linux kernel where the
      dax mount option is removed for 32-bit ARMv7 boards which have no data
      cache aliasing, and therefore should work fine with FS_DAX.
      
      This was turned into the following check in alloc_dax() by a preparatory
      change:
      
              if (ops && (IS_ENABLED(CONFIG_ARM) ||
                  IS_ENABLED(CONFIG_MIPS) ||
                  IS_ENABLED(CONFIG_SPARC)))
                      return NULL;
      
      Use cpu_dcache_is_aliasing() instead to figure out whether the environment
      has aliasing data caches.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-10-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      902ccb86
    • Mathieu Desnoyers's avatar
      Introduce cpu_dcache_is_aliasing() across all architectures · 8690bbcf
      Mathieu Desnoyers authored
      Introduce a generic way to query whether the data cache is virtually
      aliased on all architectures. Its purpose is to ensure that subsystems
      which are incompatible with virtually aliased data caches (e.g. FS_DAX)
      can reliably query this.
      
      For data cache aliasing, there are three scenarios dependending on the
      architecture. Here is a breakdown based on my understanding:
      
      A) The data cache is always aliasing:
      
      * arc
      * csky
      * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
      * sh
      * parisc
      
      B) The data cache aliasing is statically known or depends on querying CPU
         state at runtime:
      
      * arm (cache_is_vivt() || cache_is_vipt_aliasing())
      * mips (cpu_has_dc_aliases)
      * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
      * sparc32 (vac_cache_size > PAGE_SIZE)
      * sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
      * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)
      
      C) The data cache is never aliasing:
      
      * alpha
      * arm64 (aarch64)
      * hexagon
      * loongarch (but with incoherent write buffers, which are disabled since
                   commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
      * microblaze
      * openrisc
      * powerpc
      * riscv
      * s390
      * um
      * x86
      
      Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
      implement "cpu_dcache_is_aliasing()".
      
      Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
      cpu_dcache_is_aliasing() simply evaluates to "false".
      
      Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
      work. This would be useful to gate features like XIP on architectures
      which have aliasing CPU dcache-icache but not CPU dcache-dcache.
      
      Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
      to clarify that we really mean "CPU data cache" and "CPU cache" to
      eliminate any possible confusion with VFS "dentry cache" and "page
      cache".
      
      Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
      Link: https://lkml.kernel.org/r/20240215144633.96437-9-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8690bbcf
    • Mathieu Desnoyers's avatar
      dax: check for data cache aliasing at runtime · 1df4ca01
      Mathieu Desnoyers authored
      Replace the following fs/Kconfig:FS_DAX dependency:
      
        depends on !(ARM || MIPS || SPARC)
      
      By a runtime check within alloc_dax(). This runtime check returns
      ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
      the kernel is using an aliased mapping) on an architecture which
      has data cache aliasing.
      
      Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
      CONFIG_DAX=n for consistency.
      
      This is done in preparation for using cpu_dcache_is_aliasing() in a
      following change which will properly support architectures which detect
      data cache aliasing at runtime.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-8-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1df4ca01
    • Mathieu Desnoyers's avatar
      virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal · 562ce828
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of virtio
      virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
      non-fatal.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-7-mathieu.desnoyers@efficios.comCo-developed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      562ce828
    • Mathieu Desnoyers's avatar
      dcssblk: handle alloc_dax() -EOPNOTSUPP failure · cf7fe690
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of dcssblk
      dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.
      
      Considering that s390 is not a data cache aliasing architecture,
      and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
      from alloc_dax() should make dcssblk_add_store() fail.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-6-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf7fe690
    • Mathieu Desnoyers's avatar
      dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal · c2929072
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
      to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-5-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Suggested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2929072
    • Mathieu Desnoyers's avatar
      nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal · f4d373dd
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
      pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
      
      [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-4-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4d373dd
    • Mathieu Desnoyers's avatar
      dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n · 6d439c18
      Mathieu Desnoyers authored
      Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
      CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
      never returns NULL.
      
      This is done in preparation for using cpu_dcache_is_aliasing() in a
      following change which will properly support architectures which detect
      data cache aliasing at runtime.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-3-mathieu.desnoyers@efficios.com
      Fixes: 4e4ced93 ("dax: Move mandatory ->zero_page_range() check in alloc_dax()")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d439c18
    • Mathieu Desnoyers's avatar
      dax: add empty static inline for CONFIG_DAX=n · 2807c54b
      Mathieu Desnoyers authored
      Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression",
      v6.
      
      This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
      even on ARMv7 which does not have virtually aliased data caches:
      
      commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      
      Even though it used to work fine before.
      
      The root of the issue here is the fact that DAX was never designed to
      handle virtually aliasing data caches (VIVT and VIPT with aliasing data
      cache). It touches the pages through their linear mapping, which is not
      consistent with the userspace mappings with virtually aliasing data
      caches.
      
      This patch series introduces cpu_dcache_is_aliasing() with the new
      Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
      architectures. The implementation of cpu_dcache_is_aliasing() is either
      evaluated to a constant at compile-time or a runtime check, which is
      what is needed on ARM.
      
      With this we can basically narrow down the list of architectures which
      are unsupported by DAX to those which are really affected.
      
      
      This patch (of 9):
      
      When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache()
      and set_dax_nomc() need to be either within regions of code or compile
      units which are explicitly not compiled, or they need to rely on compiler
      optimizations to eliminate calls to those undefined symbols.
      
      It appears that at least the openrisc and loongarch architectures don't
      end up eliminating those undefined symbols even if they are provably
      within code which is eliminated due to conditional branches depending on
      constants.
      
      Implement empty static inline functions for set_dax_nocache() and
      set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are
      removed.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240215144633.96437-2-mathieu.desnoyers@efficios.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402140037.wGfA1kqX-lkp@intel.com/Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402131351.a0FZOgEG-lkp@intel.com/
      Fixes: 7ac5360c ("dax: remove the copy_from_iter and copy_to_iter methods")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2807c54b
    • Mathieu Desnoyers's avatar
      nvdimm/pmem: fix leak on dax_add_host() failure · f6932a27
      Mathieu Desnoyers authored
      Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
      before setting pmem->dax_dev, which therefore issues the two following
      calls on NULL pointers:
      
      out_cleanup_dax:
              kill_dax(pmem->dax_dev);
              put_dax(pmem->dax_dev);
      
      Link: https://lkml.kernel.org/r/20240208184913.484340-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240208184913.484340-2-mathieu.desnoyers@efficios.comSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarFan Ni <fan.ni@samsung.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f6932a27
    • Ryan Roberts's avatar
      arm64/mm: automatically fold contpte mappings · f0c22649
      Ryan Roberts authored
      There are situations where a change to a single PTE could cause the
      contpte block in which it resides to become foldable (i.e.  could be
      repainted with the contiguous bit).  Such situations arise, for example,
      when user space temporarily changes protections, via mprotect, for
      individual pages, such can be the case for certain garbage collectors.
      
      We would like to detect when such a PTE change occurs.  However this can
      be expensive due to the amount of checking required.  Therefore only
      perform the checks when an indiviual PTE is modified via mprotect
      (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
      when we are setting the final PTE in a contpte-aligned block.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-19-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0c22649
    • Ryan Roberts's avatar
      arm64/mm: __always_inline to improve fork() perf · b972fc6a
      Ryan Roberts authored
      As set_ptes() and wrprotect_ptes() become a bit more complex, the compiler
      may choose not to inline them.  But this is critical for fork()
      performance.  So mark the functions, along with contpte_try_unfold() which
      is called by them, as __always_inline.  This is worth ~1% on the fork()
      microbenchmark with order-0 folios (the common case).
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-18-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b972fc6a
    • Ryan Roberts's avatar
      arm64/mm: implement pte_batch_hint() · fb5451e5
      Ryan Roberts authored
      When core code iterates over a range of ptes and calls ptep_get() for each
      of them, if the range happens to cover contpte mappings, the number of pte
      reads becomes amplified by a factor of the number of PTEs in a contpte
      block.  This is because for each call to ptep_get(), the implementation
      must read all of the ptes in the contpte block to which it belongs to
      gather the access and dirty bits.
      
      This causes a hotspot for fork(), as well as operations that unmap memory
      such as munmap(), exit and madvise(MADV_DONTNEED).  Fortunately we can fix
      this by implementing pte_batch_hint() which allows their iterators to skip
      getting the contpte tail ptes when gathering the batch of ptes to operate
      on.  This results in the number of PTE reads returning to 1 per pte.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-17-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb5451e5
    • Ryan Roberts's avatar
      mm: add pte_batch_hint() to reduce scanning in folio_pte_batch() · c6ec76a2
      Ryan Roberts authored
      Some architectures (e.g.  arm64) can tell from looking at a pte, if some
      follow-on ptes also map contiguous physical memory with the same pgprot. 
      (for arm64, these are contpte mappings).
      
      Take advantage of this knowledge to optimize folio_pte_batch() so that it
      can skip these ptes when scanning to create a batch.  By default, if an
      arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the
      changes are optimized out and the behaviour is as before.
      
      arm64 will opt-in to providing this hint in the next patch, which will
      greatly reduce the cost of ptep_get() when scanning a range of contptes.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-16-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6ec76a2
    • Ryan Roberts's avatar
      arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs · 6b1e4efb
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the
      exit/munmap/dontneed performance regression introduced by the initial
      contpte commit.  Subsequent patches will solve it entirely.
      
      During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
      cleared.  Previously this was done 1 PTE at a time.  But the core-mm
      supports batched clear via the new [get_and_]clear_full_ptes() APIs.  So
      let's implement those APIs and for fully covered contpte mappings, we no
      longer need to unfold the contpte.  This significantly reduces unfolding
      operations, reducing the number of tlbis that must be issued.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-15-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b1e4efb
    • Ryan Roberts's avatar
      arm64/mm: implement new wrprotect_ptes() batch API · 311a6cf2
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the fork performance
      regression introduced by the initial contpte commit.  Subsequent patches
      will solve it entirely.
      
      During fork(), any private memory in the parent must be write-protected. 
      Previously this was done 1 PTE at a time.  But the core-mm supports
      batched wrprotect via the new wrprotect_ptes() API.  So let's implement
      that API and for fully covered contpte mappings, we no longer need to
      unfold the contpte.  This has 2 benefits:
      
        - reduced unfolding, reduces the number of tlbis that must be issued.
        - The memory remains contpte-mapped ("folded") in the parent, so it
          continues to benefit from the more efficient use of the TLB after
          the fork.
      
      The optimization to wrprotect a whole contpte block without unfolding is
      possible thanks to the tightening of the Arm ARM in respect to the
      definition and behaviour when 'Misprogramming the Contiguous bit'.  See
      section D21194 at https://developer.arm.com/documentation/102105/ja-07/
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-14-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      311a6cf2
    • Ryan Roberts's avatar
      arm64/mm: wire up PTE_CONT for user mappings · 4602e575
      Ryan Roberts authored
      With the ptep API sufficiently refactored, we can now introduce a new
      "contpte" API layer, which transparently manages the PTE_CONT bit for user
      mappings.
      
      In this initial implementation, only suitable batches of PTEs, set via
      set_ptes(), are mapped with the PTE_CONT bit.  Any subsequent modification
      of individual PTEs will cause an "unfold" operation to repaint the contpte
      block as individual PTEs before performing the requested operation. 
      While, a modification of a single PTE could cause the block of PTEs to
      which it belongs to become eligible for "folding" into a contpte entry,
      "folding" is not performed in this initial implementation due to the costs
      of checking the requirements are met.  Due to this, contpte mappings will
      degrade back to normal pte mappings over time if/when protections are
      changed.  This will be solved in a future patch.
      
      Since a contpte block only has a single access and dirty bit, the semantic
      here changes slightly; when getting a pte (e.g.  ptep_get()) that is part
      of a contpte mapping, the access and dirty information are pulled from the
      block (so all ptes in the block return the same access/dirty info).  When
      changing the access/dirty info on a pte (e.g.  ptep_set_access_flags())
      that is part of a contpte mapping, this change will affect the whole
      contpte block.  This is works fine in practice since we guarantee that
      only a single folio is mapped by a contpte block, and the core-mm tracks
      access/dirty information per folio.
      
      In order for the public functions, which used to be pure inline, to
      continue to be callable by modules, export all the contpte_* symbols that
      are now called by those public inline functions.
      
      The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
      at build time.  It defaults to enabled as long as its dependency,
      TRANSPARENT_HUGEPAGE is also enabled.  The core-mm depends upon
      TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
      enabled, then there is no chance of meeting the physical contiguity
      requirement for contpte mappings.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-13-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4602e575
    • Ryan Roberts's avatar
      arm64/mm: dplit __flush_tlb_range() to elide trailing DSB · d9d8dc2b
      Ryan Roberts authored
      Split __flush_tlb_range() into __flush_tlb_range_nosync() +
      __flush_tlb_range(), in the same way as the existing flush_tlb_page()
      arrangement.  This allows calling __flush_tlb_range_nosync() to elide the
      trailing DSB.  Forthcoming "contpte" code will take advantage of this when
      clearing the young bit from a contiguous range of ptes.
      
      Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs() has
      changed, but now aligns with the ordering of __flush_tlb_page().  It has
      been discussed that __flush_tlb_page() may be wrong though.  Regardless,
      both will be resolved separately if needed.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-12-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9d8dc2b
    • Ryan Roberts's avatar
      arm64/mm: new ptep layer to manage contig bit · 5a00bfd6
      Ryan Roberts authored
      Create a new layer for the in-table PTE manipulation APIs.  For now, The
      existing API is prefixed with double underscore to become the arch-private
      API and the public API is just a simple wrapper that calls the private
      API.
      
      The public API implementation will subsequently be used to transparently
      manipulate the contiguous bit where appropriate.  But since there are
      already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
      first ensure those users use the private API directly so that the future
      contig-bit manipulations in the public API do not interfere with those
      existing uses.
      
      The following APIs are treated this way:
      
       - ptep_get
       - set_pte
       - set_ptes
       - pte_clear
       - ptep_get_and_clear
       - ptep_test_and_clear_young
       - ptep_clear_flush_young
       - ptep_set_wrprotect
       - ptep_set_access_flags
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a00bfd6
    • Ryan Roberts's avatar
      arm64/mm: convert ptep_clear() to ptep_get_and_clear() · cbb0294f
      Ryan Roberts authored
      ptep_clear() is a generic wrapper around the arch-implemented
      ptep_get_and_clear().  We are about to convert ptep_get_and_clear() into a
      public version and private version (__ptep_get_and_clear()) to support the
      transparent contpte work.  We won't have a private version of ptep_clear()
      so let's convert it to directly call ptep_get_and_clear().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-10-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbb0294f
    • Ryan Roberts's avatar
      arm64/mm: convert set_pte_at() to set_ptes(..., 1) · 659e1930
      Ryan Roberts authored
      Since set_ptes() was introduced, set_pte_at() has been implemented as a
      generic macro around set_ptes(..., 1).  So this change should continue to
      generate the same code.  However, making this change prepares us for the
      transparent contpte support.  It means we can reroute set_ptes() to
      __set_ptes().  Since set_pte_at() is a generic macro, there will be no
      equivalent __set_pte_at() to reroute to.
      
      Note that a couple of calls to set_pte_at() remain in the arch code.  This
      is intentional, since those call sites are acting on behalf of core-mm and
      should continue to call into the public set_ptes() rather than the
      arch-private __set_ptes().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      659e1930
    • Ryan Roberts's avatar
      arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep) · 53273655
      Ryan Roberts authored
      There are a number of places in the arch code that read a pte by using the
      READ_ONCE() macro.  Refactor these call sites to instead use the
      ptep_get() helper, which itself is a READ_ONCE().  Generated code should
      be the same.
      
      This will benefit us when we shortly introduce the transparent contpte
      support.  In this case, ptep_get() will become more complex so we now have
      all the code abstracted through it.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53273655