1. 12 Dec, 2022 14 commits
    • Johannes Weiner's avatar
      zpool: clean out dead code · 6a05aa30
      Johannes Weiner authored
      There is a lot of provision for flexibility that isn't actually needed or
      used.  Zswap (the only zpool user) always passes zpool_ops with an .evict
      method set.  The backends who reclaim only do so for zswap, so they can
      also directly call zpool_ops without indirection or checks.
      
      Finally, there is no need to check the retries parameters and bail with
      -EINVAL in the reclaim function, when that's called just a few lines below
      with a hard-coded 8.  There is no need to duplicate the evictable and
      sleep_mapped attrs from the driver in zpool_ops.
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-3-nphamcs@gmail.comReviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a05aa30
    • Johannes Weiner's avatar
      zswap: fix writeback lock ordering for zsmalloc · 6b3379e8
      Johannes Weiner authored
      Patch series "Implement writeback for zsmalloc", v7.
      
      Unlike other zswap allocators such as zbud or z3fold, zsmalloc currently
      lacks the writeback mechanism.  This means that when the zswap pool is
      full, it will simply reject further allocations, and the pages will be
      written directly to swap.
      
      This series of patches implements writeback for zsmalloc. When the zswap
      pool becomes full, zsmalloc will attempt to evict all the compressed
      objects in the least-recently used zspages.
      
      
      This patch (of 6):
      
      zswap's customary lock order is tree->lock before pool->lock, because the
      tree->lock protects the entries' refcount, and the free callbacks in the
      backends acquire their respective pool locks to dispatch the backing
      object.  zsmalloc's map callback takes the pool lock, so zswap must not
      grab the tree->lock while a handle is mapped.  This currently only happens
      during writeback, which isn't implemented for zsmalloc.  In preparation
      for it, move the tree->lock section out of the mapped entry section
      
      Link: https://lkml.kernel.org/r/20221128191616.1261026-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20221128191616.1261026-2-nphamcs@gmail.comSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b3379e8
    • Pavankumar Kondeti's avatar
      mm/madvise: fix madvise_pageout for private file mappings · fd3b1bc3
      Pavankumar Kondeti authored
      When MADV_PAGEOUT is called on a private file mapping VMA region, we bail
      out early if the process is neither owner nor write capable of the file. 
      However, this VMA may have both private/shared clean pages and private
      dirty pages.  The opportunity of paging out the private dirty pages (Anon
      pages) is missed.  Fix this behavior by allowing private file mappings
      pageout further and perform the file access check along with PageAnon()
      during page walk.
      
      We observe ~10% improvement in zram usage, thus leaving more available
      memory on a 4GB RAM system running Android.
      
      [quic_pkondeti@quicinc.com: v2]
        Link: https://lkml.kernel.org/r/1669962597-27724-1-git-send-email-quic_pkondeti@quicinc.com
      Link: https://lkml.kernel.org/r/1667971116-12900-1-git-send-email-quic_pkondeti@quicinc.comSigned-off-by: default avatarPavankumar Kondeti <quic_pkondeti@quicinc.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd3b1bc3
    • Gautam Menghani's avatar
      mm/khugepaged: add tracepoint to collapse_file() · 4c9473e8
      Gautam Menghani authored
      "mm_khugepaged_collapse_file" for capturing is_shmem.
      Currently, is_shmem is not being captured. Capturing is_shmem is useful
      as it can indicate if tmpfs is being used as a backing store instead of
      persistent storage. Add the tracepoint in collapse_file() named
      "mm_khugepaged_collapse_file" for capturing is_shmem.
      
      [gautammenghani201@gmail.com: swap is_shmem and addr to save space, per Steven Rostedt]
        Link: https://lkml.kernel.org/r/20221202201807.182829-1-gautammenghani201@gmail.com
      Link: https://lkml.kernel.org/r/20221026052218.148234-1-gautammenghani201@gmail.comSigned-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	[tracing]
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4c9473e8
    • David Hildenbrand's avatar
      mm/gup: remove FOLL_MIGRATION · f7355e99
      David Hildenbrand authored
      Fortunately, the last user (KSM) is gone, so let's just remove this rather
      special code from generic GUP handling -- especially because KSM never
      required the PMD handling as KSM only deals with individual base pages.
      
      [akpm@linux-foundation.org: fix merge snafu]Link: https://lkml.kernel.org/r/20221021101141.84170-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7355e99
    • David Hildenbrand's avatar
      mm/ksm: convert break_ksm() to use walk_page_range_vma() · d7c0e68d
      David Hildenbrand authored
      FOLL_MIGRATION exists only for the purpose of break_ksm(), and actually,
      there is not even the need to wait for the migration to finish, we only
      want to know if we're dealing with a KSM page.
      
      Using follow_page() just to identify a KSM page overcomplicates GUP code. 
      Let's use walk_page_range_vma() instead, because we don't actually care
      about the page itself, we only need to know a single property -- no need
      to even grab a reference.
      
      So, get rid of follow_page() usage such that we can get rid of
      FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
      the future.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s).  I
      don't think we particularly care for now.
      
      Interestingly, the benchmark reduction is due to the single callback. 
      Adding a second callback (e.g., pud_entry()) reduces the benchmark by
      another 100-200 MiB/s.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7c0e68d
    • David Hildenbrand's avatar
      mm/pagewalk: add walk_page_range_vma() · e07cda5f
      David Hildenbrand authored
      Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
      however, is only interested in a subset of the VMA range.
      
      To be used in KSM code to stop using follow_page() next.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e07cda5f
    • David Hildenbrand's avatar
      mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE · 6cce3314
      David Hildenbrand authored
      Let's stop breaking COW via a fake write fault and let's use
      FAULT_FLAG_UNSHARE instead.  This avoids any wrong side effects of the
      fake write fault, such as mapping the PTE writable and marking the pte
      dirty/softdirty.
      
      Consequently, we will no longer trigger a fake write fault and break COW
      without any such side-effects.
      
      Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
      page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
      will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
      0 instead of actually breaking COW.
      
      For now, the KSM unmerge tests can trigger that:
          $ sudo ./ksm_functional_tests
          TAP version 13
          1..3
          # [RUN] test_unmerge
          ok 1 Pages were unmerged
          # [RUN] test_unmerge_discarded
          ok 2 Pages were unmerged
          # [RUN] test_unmerge_uffd_wp
          not ok 3 Pages were unmerged
          Bail out! 1 out of 3 tests failed
          # Planned tests != run tests (2 != 3)
          # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
      
      The warning in dmesg also indicates this wrong handling:
          [  230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
          [  230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
          [  230.110124] Hardware name: [...]
          [  230.117775] Call Trace:
          [  230.120227]  <TASK>
          [  230.122334]  dump_stack_lvl+0x44/0x5c
          [  230.126010]  handle_userfault.cold+0x14/0x19
          [  230.130281]  ? tlb_finish_mmu+0x65/0x170
          [  230.134207]  ? uffd_wp_range+0x65/0xa0
          [  230.137959]  ? _raw_spin_unlock+0x15/0x30
          [  230.141972]  ? do_wp_page+0x50/0x590
          [  230.145551]  __handle_mm_fault+0x9f5/0xf50
          [  230.149652]  ? mmput+0x1f/0x40
          [  230.152712]  handle_mm_fault+0xb9/0x2a0
          [  230.156550]  break_ksm+0x141/0x180
          [  230.159964]  unmerge_ksm_pages+0x60/0x90
          [  230.163890]  ksm_madvise+0x3c/0xb0
          [  230.167295]  do_madvise.part.0+0x10c/0xeb0
          [  230.171396]  ? do_syscall_64+0x67/0x80
          [  230.175157]  __x64_sys_madvise+0x5a/0x70
          [  230.179082]  do_syscall_64+0x58/0x80
          [  230.182661]  ? do_syscall_64+0x67/0x80
          [  230.186413]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
      fault was always questionable.  As this fix is not easy to backport and
      it's not very critical, let's not cc stable.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-6-david@redhat.com
      Fixes: 529b930b ("userfaultfd: wp: hook userfault handler to write protection fault")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cce3314
    • David Hildenbrand's avatar
      mm: remove VM_FAULT_WRITE · cb8d8633
      David Hildenbrand authored
      All users -- GUP and KSM -- are gone, let's just remove it.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb8d8633
    • David Hildenbrand's avatar
      mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE · 58f595c6
      David Hildenbrand authored
      Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
      remaining user of VM_FAULT_WRITE.  As we also want to stop triggering a
      fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
      GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
      (including KSM pages), let's stop relying on VM_FAULT_WRITE.
      
      Let's rework break_ksm() to not rely on the return value of
      handle_mm_fault() anymore to figure out whether COW-breaking was
      successful.  Simply perform another follow_page() lookup to verify the
      result.
      
      While this makes break_ksm() slightly less efficient, we can simplify
      handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE without
      introducing similar KSM-specific behavior for FAULT_FLAG_UNSHARE.
      
      In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
      performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
      a performance degradation of ~4% -- 5% (old: ~5250 MiB/s, new: ~5010
      MiB/s).
      
      I don't think that we particularly care about that performance drop when
      unmerging.  If it ever turns out to be an actual performance issue, we can
      think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
      it simple for now.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58f595c6
    • David Hildenbrand's avatar
      selftests/vm: add test to measure MADV_UNMERGEABLE performance · 5036880e
      David Hildenbrand authored
      Let's add a test to measure performance of KSM breaking not triggered via
      COW, but triggered by disabling KSM on an area filled with KSM pages via
      MADV_UNMERGEABLE.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5036880e
    • David Hildenbrand's avatar
      mm/pagewalk: don't trigger test_walk() in walk_page_vma() · c31783ee
      David Hildenbrand authored
      As Peter points out, the caller passes a single VMA and can just do that
      check itself.
      
      And in fact, no existing users rely on test_walk() getting called.  So
      let's just remove it and make the implementation slightly more efficient.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c31783ee
    • David Hildenbrand's avatar
      selftests/vm: add KSM unmerge tests · 93fb70aa
      David Hildenbrand authored
      Patch series "mm/ksm: break_ksm() cleanups and fixes", v2.
      
      This series cleans up and fixes break_ksm().  In summary, we no longer use
      fake write faults to break COW but instead FAULT_FLAG_UNSHARE.  Further,
      we move away from using follow_page() --- that we can hopefully remove
      completely at one point --- and use new walk_page_range_vma() instead.
      
      Fortunately, we can get rid of VM_FAULT_WRITE and FOLL_MIGRATION in common
      code now.
      
      Extend the existing ksm tests by an unmerge benchmark, and a some new
      unmerge tests.
      
      Also, add a selftest to measure MADV_UNMERGEABLE performance.  In my setup
      (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance
      on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a
      performance degradation of ~6% -- 7% (old: ~5250 MiB/s, new: ~4900 MiB/s).
      I don't think we particularly care for now, but it's good to be aware of
      the implication.
      
      
      This patch (of 9):
      
      Let's add three unmerge tests (MADV_UNMERGEABLE unmerging all pages in the
      range).
      
      test_unmerge(): basic unmerge tests
      test_unmerge_discarded(): have some pte_none() entries in the range
      test_unmerge_uffd_wp(): protect the merged pages using uffd-wp
      
      ksm_tests.c currently contains a mixture of benchmarks and tests, whereby
      each test is carried out by executing the ksm_tests binary with specific
      parameters.  Let's add new ksm_functional_tests.c that performs multiple,
      smaller functional tests all at once.
      
      Link: https://lkml.kernel.org/r/20221021101141.84170-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20221021101141.84170-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93fb70aa
    • Joel Savitz's avatar
      selftests/vm: enable running select groups of tests · 85463321
      Joel Savitz authored
      Our memory management kernel CI testing at Red Hat uses the VM
      selftests and we have run into two problems:
      
      First, our LTP tests overlap with the VM selftests.
      
      We want to avoid unhelpful redundancy in our testing practices.
      
      Second, we have observed the current run_vmtests.sh to report overall
      failure/ambiguous results in the case that a machine lacks the necessary
      hardware to perform one or more of the tests. E.g. ksm tests that
      require more than one numa node.
      
      We want to be able to run the vm selftests suitable to particular hardware.
      
      Add the ability to run one or more groups of vm tests via run_vmtests.sh
      instead of simply all-or-none in order to solve these problems.
      
      Preserve existing default behavior of running all tests when the script
      is invoked with no arguments.
      
      Documentation of test groups is included in the patch as follows:
      
          # ./run_vmtests.sh [ -h || --help ]
      
          usage: ./tools/testing/selftests/vm/run_vmtests.sh [ -h | -t "<categories>"]
            -t: specify specific categories to tests to run
            -h: display this message
      
          The default behavior is to run all tests.
      
          Alternatively, specific groups tests can be run by passing a string
          to the -t argument containing one or more of the following categories
          separated by spaces:
          - mmap
      	    tests for mmap(2)
          - gup_test
      	    tests for gup using gup_test interface
          - userfaultfd
      	    tests for  userfaultfd(2)
          - compaction
      	    a test for the patch "Allow compaction of unevictable pages"
          - mlock
      	    tests for mlock(2)
          - mremap
      	    tests for mremap(2)
          - hugevm
      	    tests for very large virtual address space
          - vmalloc
      	    vmalloc smoke tests
          - hmm
      	    hmm smoke tests
          - madv_populate
      	    test memadvise(2) MADV_POPULATE_{READ,WRITE} options
          - memfd_secret
      	    test memfd_secret(2)
          - process_mrelease
      	    test process_mrelease(2)
          - ksm
      	    ksm tests that do not require >=2 NUMA nodes
          - ksm_numa
      	    ksm tests that require >=2 NUMA nodes
          - pkey
      	    memory protection key tests
          - soft_dirty
          	    test soft dirty page bit semantics
          - anon_cow
                  test anonymous copy-on-write semantics
          example: ./run_vmtests.sh -t "hmm mmap ksm"
      
      Link: https://lkml.kernel.org/r/20221018231222.1884715-1-jsavitz@redhat.comSigned-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85463321
  2. 10 Dec, 2022 10 commits
    • Andrew Morton's avatar
      3b910105
    • Tejun Heo's avatar
      memcg: fix possible use-after-free in memcg_write_event_control() · 4a7ba45b
      Tejun Heo authored
      memcg_write_event_control() accesses the dentry->d_name of the specified
      control fd to route the write call.  As a cgroup interface file can't be
      renamed, it's safe to access d_name as long as the specified file is a
      regular cgroup file.  Also, as these cgroup interface files can't be
      removed before the directory, it's safe to access the parent too.
      
      Prior to 347c4a87 ("memcg: remove cgroup_event->cft"), there was a
      call to __file_cft() which verified that the specified file is a regular
      cgroupfs file before further accesses.  The cftype pointer returned from
      __file_cft() was no longer necessary and the commit inadvertently dropped
      the file type check with it allowing any file to slip through.  With the
      invarients broken, the d_name and parent accesses can now race against
      renames and removals of arbitrary files and cause use-after-free's.
      
      Fix the bug by resurrecting the file type check in __file_cft().  Now that
      cgroupfs is implemented through kernfs, checking the file operations needs
      to go through a layer of indirection.  Instead, let's check the superblock
      and dentry type.
      
      Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
      Fixes: 347c4a87 ("memcg: remove cgroup_event->cft")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>	[3.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a7ba45b
    • Muchun Song's avatar
      MAINTAINERS: update Muchun Song's email · a501788a
      Muchun Song authored
      I'm moving to the @linux.dev account.  Map my old addresses and update it
      to my new address.
      
      Link: https://lkml.kernel.org/r/20221208115548.85244-1-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a501788a
    • John Starks's avatar
      mm/gup: fix gup_pud_range() for dax · fcd0ccd8
      John Starks authored
      For dax pud, pud_huge() returns true on x86. So the function works as long
      as hugetlb is configured. However, dax doesn't depend on hugetlb.
      Commit 414fd080 ("mm/gup: fix gup_pmd_range() for dax") fixed
      devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as
      well.
      
      This fixes the below kernel panic:
      
      general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP
      	< snip >
      Call Trace:
      <TASK>
      get_user_pages_fast+0x1f/0x40
      iov_iter_get_pages+0xc6/0x3b0
      ? mempool_alloc+0x5d/0x170
      bio_iov_iter_get_pages+0x82/0x4e0
      ? bvec_alloc+0x91/0xc0
      ? bio_alloc_bioset+0x19a/0x2a0
      blkdev_direct_IO+0x282/0x480
      ? __io_complete_rw_common+0xc0/0xc0
      ? filemap_range_has_page+0x82/0xc0
      generic_file_direct_write+0x9d/0x1a0
      ? inode_update_time+0x24/0x30
      __generic_file_write_iter+0xbd/0x1e0
      blkdev_write_iter+0xb4/0x150
      ? io_import_iovec+0x8d/0x340
      io_write+0xf9/0x300
      io_issue_sqe+0x3c3/0x1d30
      ? sysvec_reschedule_ipi+0x6c/0x80
      __io_queue_sqe+0x33/0x240
      ? fget+0x76/0xa0
      io_submit_sqes+0xe6a/0x18d0
      ? __fget_light+0xd1/0x100
      __x64_sys_io_uring_enter+0x199/0x880
      ? __context_tracking_enter+0x1f/0x70
      ? irqentry_exit_to_user_mode+0x24/0x30
      ? irqentry_exit+0x1d/0x30
      ? __context_tracking_exit+0xe/0x70
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x61/0xcb
      RIP: 0033:0x7fc97c11a7be
      	< snip >
      </TASK>
      ---[ end trace 48b2e0e67debcaeb ]---
      RIP: 0010:internal_get_user_pages_fast+0x340/0x990
      	< snip >
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      
      Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com
      Fixes: 414fd080 ("mm/gup: fix gup_pmd_range() for dax")
      Signed-off-by: default avatarJohn Starks <jostarks@microsoft.com>
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fcd0ccd8
    • Liam Howlett's avatar
      mmap: fix do_brk_flags() modifying obviously incorrect VMAs · 6c28ca64
      Liam Howlett authored
      Add more sanity checks to the VMA that do_brk_flags() will expand.  Ensure
      the VMA matches basic merge requirements within the function before
      calling can_vma_merge_after().
      
      Drop the duplicate checks from vm_brk_flags() since they will be enforced
      later.
      
      The old code would expand file VMAs on brk(), which is functionally
      wrong and also dangerous in terms of locking because the brk() path
      isn't designed for file VMAs and therefore doesn't lock the file
      mapping.  Checking can_vma_merge_after() ensures that new anonymous
      VMAs can't be merged into file VMAs.
      
      See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com
      Fixes: 2e7ce7d3 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c28ca64
    • David Hildenbrand's avatar
      mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit · 630dc25e
      David Hildenbrand authored
      We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
      store a physical address.
      
      On a 64bit system, both are 64bit wide.  However, on a 32bit system, the
      latter might be 64bit wide.  This is, for example, the case on x86 with
      PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
      32bit.
      
      The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
      that case, and assumes that the maximum PFN is limited by an 32bit
      phys_addr_t.  This implies, that SWP_PFN_BITS will currently only be able
      to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.
      
      Let's rely on the number of bits in phys_addr_t instead, but make sure to
      not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
      is_pfn_swap_entry() unhappy.  Note that swp_entry_t is effectively an
      unsigned long and the maximum swap offset shares that value with the swap
      type.
      
      For example, on an 8 GiB x86 PAE system with a kernel config based on
      Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
      fail removing migration entries (remove_migration_ptes()), because
      mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
      swp_offset_pfn() wrongly masks off PFN bits.  For example,
      split_huge_page_to_list()->...->remap_page() will leave migration entries
      in place and continue to unlock the page.
      
      Later, when we stumble over these migration entries (e.g., via
      /proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
      migration entries shouldn't exist anymore and the page was unlocked.
      
      [   33.067591] kernel BUG at include/linux/swapops.h:497!
      [   33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      [   33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G            E      6.1.0-rc8+ #16
      [   33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
      [   33.067606] EIP: pagemap_pmd_range+0x644/0x650
      [   33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
      [   33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
      [   33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
      [   33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
      [   33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
      [   33.067625] Call Trace:
      [   33.067628]  ? madvise_free_pte_range+0x720/0x720
      [   33.067632]  ? smaps_pte_range+0x4b0/0x4b0
      [   33.067634]  walk_pgd_range+0x325/0x720
      [   33.067637]  ? mt_find+0x1d6/0x3a0
      [   33.067641]  ? mt_find+0x1d6/0x3a0
      [   33.067643]  __walk_page_range+0x164/0x170
      [   33.067646]  walk_page_range+0xf9/0x170
      [   33.067648]  ? __kmem_cache_alloc_node+0x2a8/0x340
      [   33.067653]  pagemap_read+0x124/0x280
      [   33.067658]  ? default_llseek+0x101/0x160
      [   33.067662]  ? smaps_account+0x1d0/0x1d0
      [   33.067664]  vfs_read+0x90/0x290
      [   33.067667]  ? do_madvise.part.0+0x24b/0x390
      [   33.067669]  ? debug_smp_processor_id+0x12/0x20
      [   33.067673]  ksys_pread64+0x58/0x90
      [   33.067675]  __ia32_sys_ia32_pread64+0x1b/0x20
      [   33.067680]  __do_fast_syscall_32+0x4c/0xc0
      [   33.067683]  do_fast_syscall_32+0x29/0x60
      [   33.067686]  do_SYSENTER_32+0x15/0x20
      [   33.067689]  entry_SYSENTER_32+0x98/0xf1
      
      Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
      readable and consistent.
      
      [david@redhat.com: rely on sizeof(phys_addr_t) and min_t() instead]
        Link: https://lkml.kernel.org/r/20221206105737.69478-1-david@redhat.com
      [david@redhat.com: use "int" for comparison, as we're only comparing numbers < 64]
        Link: https://lkml.kernel.org/r/1f157500-2676-7cef-a84e-9224ed64e540@redhat.com
      Link: https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com
      Fixes: 0d206b5d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      630dc25e
    • Hugh Dickins's avatar
      tmpfs: fix data loss from failed fallocate · 44bcabd7
      Hugh Dickins authored
      Fix tmpfs data loss when the fallocate system call is interrupted by a
      signal, or fails for some other reason.  The partial folio handling in
      shmem_undo_range() forgot to consider this unfalloc case, and was liable
      to erase or truncate out data which had already been committed earlier.
      
      It turns out that none of the partial folio handling there is appropriate
      for the unfalloc case, which just wants to proceed to removal of whole
      folios: which find_get_entries() provides, even when partially covered.
      
      Original patch by Rui Wang.
      
      Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/
      Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com
      Fixes: b9a8a419 ("truncate,shmem: Handle truncates that split large folios")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGuoqi Chen <chenguoqic@163.com>
        Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/
      Cc: Rui Wang <kernel@hev.cc>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44bcabd7
    • Michal Hocko's avatar
      kselftests: cgroup: update kmem test precision tolerance · de16d6e4
      Michal Hocko authored
      1813e51e ("memcg: increase MEMCG_CHARGE_BATCH to 64") has changed
      the batch size while this test case has been left behind. This has led
      to a test failure reported by test bot:
      not ok 2 selftests: cgroup: test_kmem # exit=1
      
      Update the tolerance for the pcp charges to reflect the
      MEMCG_CHARGE_BATCH change to fix this.
      
      [akpm@linux-foundation.org: update comments, per Roman]
      Link: https://lkml.kernel.org/r/Y4m8Unt6FhWKC6IH@dhcp22.suse.cz
      Fixes: 1813e51e ("memcg: increase MEMCG_CHARGE_BATCH to 64")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
        Link: https://lore.kernel.org/oe-lkp/202212010958.c1053bd3-yujie.liu@intel.comAcked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Tested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de16d6e4
    • Jason A. Donenfeld's avatar
      mm: do not BUG_ON missing brk mapping, because userspace can unmap it · f5ad5083
      Jason A. Donenfeld authored
      The following program will trigger the BUG_ON that this patch removes,
      because the user can munmap() mm->brk:
      
        #include <sys/syscall.h>
        #include <sys/mman.h>
        #include <assert.h>
        #include <unistd.h>
      
        static void *brk_now(void)
        {
          return (void *)syscall(SYS_brk, 0);
        }
      
        static void brk_set(void *b)
        {
          assert(syscall(SYS_brk, b) != -1);
        }
      
        int main(int argc, char *argv[])
        {
          void *b = brk_now();
          brk_set(b + 4096);
          assert(munmap(b - 4096, 4096 * 2) == 0);
          brk_set(b);
          return 0;
        }
      
      Compile that with musl, since glibc actually uses brk(), and then
      execute it, and it'll hit this splat:
      
        kernel BUG at mm/mmap.c:229!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 12 PID: 1379 Comm: a.out Tainted: G S   U             6.1.0-rc7+ #419
        RIP: 0010:__do_sys_brk+0x2fc/0x340
        Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
        RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
        RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
        RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
        R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
        R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
        FS:  0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
        PKRU: 55555554
        Call Trace:
         <TASK>
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        RIP: 0033:0x400678
        Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
        RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
        RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
        RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
        RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
        R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
        R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
         </TASK>
      
      Instead, just return the old brk value if the original mapping has been
      removed.
      
      [akpm@linux-foundation.org: fix changelog, per Liam]
      Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
      Fixes: 2e7ce7d3 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5ad5083
    • Matti Vaittinen's avatar
      mailmap: update Matti Vaittinen's email address · 38f1d4ae
      Matti Vaittinen authored
      The email backend used by ROHM keeps labeling patches as spam.  This can
      result in missing the patches.
      
      Switch my mail address from a company mail to a personal one.
      
      Link: https://lkml.kernel.org/r/8f4498b66fedcbded37b3b87e0c516e659f8f583.1669912977.git.mazziesaccount@gmail.comSigned-off-by: default avatarMatti Vaittinen <mazziesaccount@gmail.com>
      Suggested-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Ben Widawsky <bwidawsk@kernel.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Colin Ian King <colin.i.king@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Vasily Averin <vasily.averin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f1d4ae
  3. 30 Nov, 2022 16 commits