1. 18 Oct, 2023 40 commits
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle read faults under the VMA lock · 12214eba
      Matthew Wilcox (Oracle) authored
      Most file-backed faults are already handled through ->map_pages(), but if
      we need to do I/O we'll come this way.  Since filemap_fault() is now safe
      to be called under the VMA lock, we can handle these faults under the VMA
      lock now.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12214eba
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle COW faults under the VMA lock · 4de8c93a
      Matthew Wilcox (Oracle) authored
      If the page is not currently present in the page tables, we need to call
      the page fault handler to find out which page we're supposed to COW, so we
      need to both check that there is already an anon_vma and that the fault
      handler doesn't need the mmap_lock.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4de8c93a
    • Matthew Wilcox (Oracle)'s avatar
      mm: handle shared faults under the VMA lock · 4ed43798
      Matthew Wilcox (Oracle) authored
      There are many implementations of ->fault and some of them depend on
      mmap_lock being held.  All vm_ops that implement ->map_pages() end up
      calling filemap_fault(), which I have audited to be sure it does not rely
      on mmap_lock.  So (for now) key off ->map_pages existing as a flag to
      indicate that it's safe to call ->fault while only holding the vma lock.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4ed43798
    • Matthew Wilcox (Oracle)'s avatar
      mm: call wp_page_copy() under the VMA lock · 164b06f2
      Matthew Wilcox (Oracle) authored
      It is usually safe to call wp_page_copy() under the VMA lock.  The only
      unsafe situation is when no anon_vma has been allocated for this VMA, and
      we have to look at adjacent VMAs to determine if their anon_vma can be
      shared.  Since this happens only for the first COW of a page in this VMA,
      the majority of calls to wp_page_copy() do not need to fall back to the
      mmap_sem.
      
      Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will
      return RETRY if we currently hold the VMA lock and need to allocate an
      anon_vma.  This lets us drop the check in do_wp_page().
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      164b06f2
    • Matthew Wilcox (Oracle)'s avatar
      mm: make lock_folio_maybe_drop_mmap() VMA lock aware · 5d74b2ab
      Matthew Wilcox (Oracle) authored
      Patch series "Handle more faults under the VMA lock", v2.
      
      At this point, we're handling the majority of file-backed page faults
      under the VMA lock, using the ->map_pages entry point.  This patch set
      attempts to expand that for the following siutations:
      
       - We have to do a read.  This could be because we've hit the point in
         the readahead window where we need to kick off the next readahead,
         or because the page is simply not present in cache.
       - We're handling a write fault.  Most applications don't do I/O by writes
         to shared mmaps for very good reasons, but some do, and it'd be nice
         to not make that slow unnecessarily.
       - We're doing a COW of a private mapping (both PTE already present
         and PTE not-present).  These are two different codepaths and I handle
         both of them in this patch set.
      
      There is no support in this patch set for drivers to mark themselves as
      being VMA lock friendly; they could implement the ->map_pages
      vm_operation, but if they do, they would be the first.  This is probably
      something we want to change at some point in the future, and I've marked
      where to make that change in the code.
      
      There is very little performance change in the benchmarks we've run;
      mostly because the vast majority of page faults are handled through the
      other paths.  I still think this patch series is useful for workloads that
      may take these paths more often, and just for cleaning up the fault path
      in general (it's now clearer why we have to retry in these cases).
      
      
      This patch (of 6):
      
      Drop the VMA lock instead of the mmap_lock if that's the one which
      is held.
      
      Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5d74b2ab
    • Hugh Dickins's avatar
      percpu_counter: extend _limited_add() to negative amounts · 1431996b
      Hugh Dickins authored
      Though tmpfs does not need it, percpu_counter_limited_add() can be twice
      as useful if it works sensibly with negative amounts (subs) - typically
      decrements towards a limit of 0 or nearby: as suggested by Dave Chinner.
      
      And in the course of that reworking, skip the percpu counter sum if it is
      already obvious that the limit would be passed: as suggested by Tim Chen.
      
      Extend the comment above __percpu_counter_limited_add(), defining the
      behaviour with positive and negative amounts, allowing negative limits,
      but not bothering about overflow beyond S64_MAX.
      
      Link: https://lkml.kernel.org/r/8f86083b-c452-95d4-365b-f16a2e4ebcd4@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1431996b
    • Hugh Dickins's avatar
      shmem,percpu_counter: add _limited_add(fbc, limit, amount) · beb98686
      Hugh Dickins authored
      Percpu counter's compare and add are separate functions: without locking
      around them (which would defeat their purpose), it has been possible to
      overflow the intended limit.  Imagine all the other CPUs fallocating tmpfs
      huge pages to the limit, in between this CPU's compare and its add.
      
      I have not seen reports of that happening; but tmpfs's recent addition of
      dquot_alloc_block_nodirty() in between the compare and the add makes it
      even more likely, and I'd be uncomfortable to leave it unfixed.
      
      Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
      
      I believe this implementation is correct, and slightly more efficient than
      the combination of compare and add (taking the lock once rather than twice
      when nearing full - the last 128MiB of a tmpfs volume on a machine with
      128 CPUs and 4KiB pages); but it does beg for a better design - when
      nearing full, there is no new batching, but the costly percpu counter sum
      across CPUs still has to be done, while locked.
      
      Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well
      as cpu_online_mask: but shouldn't __percpu_counter_compare() and
      __percpu_counter_limited_add() then be adding a num_dying_cpus() to
      num_online_cpus(), when they calculate the maximum which could be held
      across CPUs?  But the times when it matters would be vanishingly rare.
      
      Link: https://lkml.kernel.org/r/bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      beb98686
    • Hugh Dickins's avatar
      shmem: _add_to_page_cache() before shmem_inode_acct_blocks() · 3022fd7a
      Hugh Dickins authored
      There has been a recurring problem, that when a tmpfs volume is being
      filled by racing threads, some fail with ENOSPC (or consequent SIGBUS or
      EFAULT) even though all allocations were within the permitted size.
      
      This was a problem since early days, but magnified and complicated by the
      addition of huge pages.  We have often worked around it by adding some
      slop to the tmpfs size, but it's hard to say how much is needed, and some
      users prefer not to do that e.g.  keeping sparse files in a tightly
      tailored tmpfs helps to prevent accidental writing to holes.
      
      This comes from the allocation sequence:
      1. check page cache for existing folio
      2. check and reserve from vm_enough_memory
      3. check and account from size of tmpfs
      4. if huge, check page cache for overlapping folio
      5. allocate physical folio, huge or small
      6. check and charge from mem cgroup limit
      7. add to page cache (but maybe another folio already got in).
      
      Concurrent tasks allocating at the same position could deplete the size
      allowance and fail.  Doing vm_enough_memory and size checks before the
      folio allocation was intentional (to limit the load on the page allocator
      from this source) and still has some virtue; but memory cgroup never did
      that, so I think it's better reordered to favour predictable behaviour.
      
      1. check page cache for existing folio
      2. if huge, check page cache for overlapping folio
      3. allocate physical folio, huge or small
      4. check and charge from mem cgroup limit
      5. add to page cache (but maybe another folio already got in)
      6. check and reserve from vm_enough_memory
      7. check and account from size of tmpfs.
      
      The folio lock held from allocation onwards ensures that the !uptodate
      folio cannot be used by others, and can safely be deleted from the cache
      if checks 6 or 7 subsequently fail (and those waiting on folio lock
      already check that the folio was not truncated once they get the lock);
      and the early addition to page cache ensures that racers find it before
      they try to duplicate the accounting.
      
      Seize the opportunity to tidy up shmem_get_folio_gfp()'s ENOSPC retrying,
      which can be combined inside the new shmem_alloc_and_add_folio(): doing 2
      splits twice (once huge, once nonhuge) is not exactly equivalent to trying
      5 splits (and giving up early on huge), but let's keep it simple unless
      more complication proves necessary.
      
      Userfaultfd is a foreign country: they do things differently there, and
      for good reason - to avoid mmap_lock deadlock.  Leave ordering in
      shmem_mfill_atomic_pte() untouched for now, but I would rather like to
      mesh it better with shmem_get_folio_gfp() in the future.
      
      Link: https://lkml.kernel.org/r/22ddd06-d919-33b-1219-56335c1bf28e@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3022fd7a
    • Hugh Dickins's avatar
      shmem: move memcg charge out of shmem_add_to_page_cache() · 054a9f7c
      Hugh Dickins authored
      Extract shmem's memcg charging out of shmem_add_to_page_cache(): it's
      misleading done there, because many calls are dealing with a swapcache
      page, whose memcg is nowadays always remembered while swapped out, then
      the charge re-levied when it's brought back into swapcache.
      
      Temporarily move it back up to the shmem_get_folio_gfp() level, where the
      memcg was charged before v5.8; but the next commit goes on to move it back
      down to a new home.
      
      In making this change, it becomes clear that shmem_swapin_folio() does not
      need to know the vma, just the fault mm (if any): call it fault_mm rather
      than charge_mm - let mem_cgroup_charge() decide whom to charge.
      
      Link: https://lkml.kernel.org/r/4b2143c5-bf32-64f0-841-81a81158dac@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      054a9f7c
    • Hugh Dickins's avatar
      shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() · 4199f51a
      Hugh Dickins authored
      By historical accident, shmem_acct_block() and shmem_inode_acct_block()
      were never pluralized when the pages argument was added, despite their
      complements being shmem_unacct_blocks() and shmem_inode_unacct_blocks()
      all along.  It has been an irritation: fix their naming at last.
      
      Link: https://lkml.kernel.org/r/9124094-e4ab-8be7-ef80-9a87bdc2e4fc@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4199f51a
    • Hugh Dickins's avatar
      shmem: trivial tidyups, removing extra blank lines, etc · 9be7d5b0
      Hugh Dickins authored
      Mostly removing a few superfluous blank lines, joining short arglines,
      imposing some 80-column observance, correcting a couple of comments.  None
      of it more interesting than deleting a repeated INIT_LIST_HEAD().
      
      Link: https://lkml.kernel.org/r/b3983d28-5d3f-8649-36af-b819285d7a9e@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9be7d5b0
    • Hugh Dickins's avatar
      shmem: factor shmem_falloc_wait() out of shmem_fault() · f0a9ad1d
      Hugh Dickins authored
      That Trinity livelock shmem_falloc avoidance block is unlikely, and a
      distraction from the proper business of shmem_fault(): separate it out. 
      (This used to help compilers save stack on the fault path too, but both
      gcc and clang nowadays seem to make better choices anyway.)
      
      Link: https://lkml.kernel.org/r/6fe379a4-6176-9225-9263-fe60d2633c0@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0a9ad1d
    • Hugh Dickins's avatar
      shmem: remove vma arg from shmem_get_folio_gfp() · e3e1a506
      Hugh Dickins authored
      The vma is already there in vmf->vma, so no need for a separate arg.
      
      Link: https://lkml.kernel.org/r/d9ce6f65-a2ed-48f4-4299-fdb0544875c5@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3e1a506
    • Hugh Dickins's avatar
      shmem: shrink shmem_inode_info: dir_offsets in a union · ee615d45
      Hugh Dickins authored
      Patch series "shmem,tmpfs: general maintenance".
      
      Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the
      "size=" limit better.  8/8 goes into percpu counter territory, and could
      stand alone.
      
      
      This patch (of 8):
      
      Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
      pahole after stop_eviction, better filled by fsflags.  And the 24-byte
      dir_offsets can only be used by directories, whereas shrinklist and
      swaplist only by shmem_mapping() inodes (regular files or long symlinks):
      so put those into a union.  No change in mm/shmem.c is required for this.
      
      Link: https://lkml.kernel.org/r/c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com
      Link: https://lkml.kernel.org/r/86ebb4b-c571-b9e8-27f5-cb82ec50357e@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ee615d45
    • Lorenzo Stoakes's avatar
      mm/filemap: clarify filemap_fault() comments for not uptodate case · 6facf36e
      Lorenzo Stoakes authored
      The existing comments in filemap_fault() suggest that, after either a
      minor fault has occurred and filemap_get_folio() found a folio in the page
      cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
      the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
      to read in the folio), the only possible reason it could not be uptodate
      is because of an error.
      
      This is not so, as if, for instance, the fault occurred within a VMA which
      had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
      specified), this would cause even synchronous readahead to fail to read in
      the folio.
      
      I confirmed this by dropping page caches and faulting in memory
      madvise()'d this way, observing that this code path was reached on each
      occasion.
      
      Clarify the comments to include this case, and additionally update the
      comment recently added around the invalidate lock logic to make it clear
      the comment explicitly refers to the minor fault case.
      
      In addition, while we're here, refer to folios rather than pages.
      
      [lstoakes@gmail.com: correct identation as per Christopher's feedback]
        Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
      Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6facf36e
    • Liam R. Howlett's avatar
      radix tree test suite: fix allocation calculation in kmem_cache_alloc_bulk() · 7771dcf0
      Liam R. Howlett authored
      The bulk allocation is iterating through an array and storing enough
      memory for the entire bulk allocation instead of a single array entry. 
      Only allocate an array element of the size set in the kmem_cache.
      
      Link: https://lkml.kernel.org/r/20230929201359.2857583-1-Liam.Howlett@oracle.com
      Fixes: cc86e0c2 ("radix tree test suite: add support for slab bulk APIs")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7771dcf0
    • Muhammad Usama Anjum's avatar
      selftests: mm: add pagemap ioctl tests · 46fd75d4
      Muhammad Usama Anjum authored
      Add pagemap ioctl tests. Add several different types of tests to judge
      the correction of the interface.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-7-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46fd75d4
    • Muhammad Usama Anjum's avatar
      mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL · 18825b8a
      Muhammad Usama Anjum authored
      Add some explanation and method to use write-protection and written-to
      on memory range.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-6-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18825b8a
    • Muhammad Usama Anjum's avatar
      tools headers UAPI: update linux/fs.h with the kernel sources · b58aa0f4
      Muhammad Usama Anjum authored
      New IOCTL and macros has been added in the kernel sources. Update the
      tools header file as well.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-5-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b58aa0f4
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag · 12f6b01a
      Muhammad Usama Anjum authored
      Adding fast code paths to handle specifically only get and/or clear
      operation of PAGE_IS_WRITTEN, increases its performance by 0-35%.  The
      results of some test cases are given below:
      
      Test-case-1
      t1 = (Get + WP) time
      t2 = WP time
                             t1            t2
      Without this patch:    140-170mcs    90-115mcs
      With this patch:       110mcs        80mcs
      Worst case diff:       35% faster    30% faster
      
      Test-case-2
      t3 = atomic Get and WP
                            t3
      Without this patch:   120-140mcs
      With this patch:      100-110mcs
      Worst case diff:      21% faster
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-4-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12f6b01a
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs · 52526ca7
      Muhammad Usama Anjum authored
      The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally
      clear the info about page table entries. The following operations are
      supported in this IOCTL:
      - Scan the address range and get the memory ranges matching the provided
        criteria. This is performed when the output buffer is specified.
      - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect
        the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if
        non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING``
        can be used with or without PM_SCAN_CHECK_WPASYNC.
      - Both of those operations can be combined into one atomic operation where
        we can get and write protect the pages as well.
      
      Following flags about pages are currently supported:
      - PAGE_IS_WPALLOWED - Page has async-write-protection enabled
      - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected
      - PAGE_IS_FILE - Page is file backed
      - PAGE_IS_PRESENT - Page is present in the memory
      - PAGE_IS_SWAPPED - Page is in swapped
      - PAGE_IS_PFNZERO - Page has zero PFN
      - PAGE_IS_HUGE - Page is THP or Hugetlb backed
      
      This IOCTL can be extended to get information about more PTE bits. The
      entire address range passed by user [start, end) is scanned until either
      the user provided buffer is full or max_pages have been found.
      
      [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"]
      [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning]
      [arnd@arndb.de: hide unused pagemap_scan_backout_range() function]
        Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org
      [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"]
        Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au
      Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reviewed-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52526ca7
    • Peter Xu's avatar
      userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
      Peter Xu authored
      Patch series "Implement IOCTL to get and optionally clear info about
      PTEs", v33.
      
      *Motivation*
      The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
      GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
      retrieves the addresses of the pages that are written to in a region of
      virtual memory.
      
      This syscall is used in Windows applications and games etc.  This syscall
      is being emulated in pretty slow manner in userspace.  Our purpose is to
      enhance the kernel such that we translate it efficiently in a better way. 
      Currently some out of tree hack patches are being used to efficiently
      emulate it in some kernels.  We intend to replace those with these
      patches.  So the whole gaming on Linux can effectively get benefit from
      this.  It means there would be tons of users of this code.
      
      CRIU use case [2] was mentioned by Andrei and Danylo:
      > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
      > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
      > shadow memory [4]. Being able to migrate such binaries allows to highly
      > reduce the amount of work needed to identify and fix post-migration
      > crashes, which happen constantly.
      
      Andrei defines the following uses of this code:
      * it is more granular and allows us to track changed pages more
        effectively. The current interface can clear dirty bits for the entire
        process only. In addition, reading info about pages is a separate
        operation. It means we must freeze the process to read information
        about all its pages, reset dirty bits, only then we can start dumping
        pages. The information about pages becomes more and more outdated,
        while we are processing pages. The new interface solves both these
        downsides. First, it allows us to read pte bits and clear the
        soft-dirty bit atomically. It means that CRIU will not need to freeze
        processes to pre-dump their memory. Second, it clears soft-dirty bits
        for a specified region of memory. It means CRIU will have actual info
        about pages to the moment of dumping them.
      * The new interface has to be much faster because basic page filtering
        is happening in the kernel. With the old interface, we have to read
        pagemap for each page.
      
      *Implementation Evolution (Short Summary)*
      From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
      feature can be used under the hood with some additions like:
      * reset soft-dirty flag for only a specific region of memory instead of
      clearing the flag for the entire process
      * get and clear soft-dirty flag for a specific region atomically
      
      So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
      flag. But using soft-dirty flag, sometimes we get extra pages which weren't
      even written. They had become soft-dirty because of VMA merging and
      VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
      able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
      reported that mprotect etc messes up the soft-dirty flag while ignoring
      VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
      discussed if we can revert these patches. But we could not reach to any
      conclusion. So at this point, I made couple of tries to solve this whole
      VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
      * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
      regression. We left it behind.
      * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
      got the reply don't increase the size of the VMA by 8 bytes.
      
      At this point, we left soft-dirty considering it is too much delicate and
      userfaultfd [9] seemed like the only way forward. From there onward, we
      have been basing soft-dirty emulation on userfaultfd wp feature where
      kernel resolves the faults itself when WP_ASYNC feature is used. It was
      straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
      those pages dirty or written-to which are really written in reality. (PS
      There is another WP_UNPOPULATED userfautfd feature is required which is
      needed to avoid pre-faulting memory before write-protecting [9].)
      
      All the different masks were added on the request of CRIU devs to create
      interface more generic and better.
      
      [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
      [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
      [3] https://github.com/google/sanitizers
      [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
      [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
      [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
      [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
      [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
      [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
      [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
      
      
      This patch (of 6):
      
      Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
      userfaultfd wr-protect faults to be resolved by the kernel directly.
      
      It can be used like a high accuracy version of soft-dirty, without vma
      modifications during tracking, and also with ranged support by default
      rather than for a whole mm when reset the protections due to existence of
      ioctl(UFFDIO_WRITEPROTECT).
      
      Several goals of such a dirty tracking interface:
      
      1. All types of memory should be supported and tracable. This is nature
         for soft-dirty but should mention when the context is userfaultfd,
         because it used to only support anon/shmem/hugetlb. The problem is for
         a dirty tracking purpose these three types may not be enough, and it's
         legal to track anything e.g. any page cache writes from mmap.
      
      2. Protections can be applied to partial of a memory range, without vma
         split/merge fuss.  The hope is that the tracking itself should not
         affect any vma layout change.  It also helps when reset happens because
         the reset will not need mmap write lock which can block the tracee.
      
      3. Accuracy needs to be maintained.  This means we need pte markers to work
         on any type of VMA.
      
      One could question that, the whole concept of async dirty tracking is not
      really close to fundamentally what userfaultfd used to be: it's not "a
      fault to be serviced by userspace" anymore. However, using userfaultfd-wp
      here as a framework is convenient for us in at least:
      
      1. VM_UFFD_WP vma flag, which has a very good name to suite something like
         this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
         feature bit to identify from a sync version of uffd-wp registration.
      
      2. PTE markers logic can be leveraged across the whole kernel to maintain
         the uffd-wp bit as long as an arch supports, this also applies to this
         case where uffd-wp bit will be a hint to dirty information and it will
         not go lost easily (e.g. when some page cache ptes got zapped).
      
      3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
         resetting a range of memory, while there's no counterpart in the old
         soft-dirty world, hence if this is wanted in a new design we'll need a
         new interface otherwise.
      
      We can somehow understand that commonality because uffd-wp was
      fundamentally a similar idea of write-protecting pages just like
      soft-dirty.
      
      This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
      far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
      gives us chance to modify impl of WP_ASYNC just in case it could be not
      depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
      to imply that because both features will rely on PTE_MARKER_UFFD_WP config
      option, so they'll show up together (or both missing) in an UFFDIO_API
      probe.
      
      vma_can_userfault() now allows any VMA if the userfaultfd registration is
      only about async uffd-wp.  So we can track dirty for all kinds of memory
      including generic file systems (like XFS, EXT4 or BTRFS).
      
      One trick worth mention in do_wp_page() is that we need to manually update
      vmf->orig_pte here because it can be used later with a pte_same() check -
      this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
      
      The major defect of this approach of dirty tracking is we need to populate
      the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
      It's unwanted in the case where the range of memory to track is huge and
      unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
      without having any page cache installed yet).  One way to improve this is
      to allow pte markers exist for larger than PTE level for PMD+.  That will
      not change the interface if to implemented, so we can leave that for
      later.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61ea1cb
    • Yosry Ahmed's avatar
      mm: memcg: normalize the value passed into memcg_rstat_updated() · 7bd5bc3c
      Yosry Ahmed authored
      memcg_rstat_updated() uses the value of the state update to keep track of
      the magnitude of pending updates, so that we only do a stats flush when
      it's worth the work.  Most values passed into memcg_rstat_updated() are in
      pages, however, a few of them are actually in bytes or KBs.
      
      To put this into perspective, a 512 byte slab allocation today would look
      the same as allocating 512 pages.  This may result in premature flushes,
      which means unnecessary work and latency.
      
      Normalize all the state values passed into memcg_rstat_updated() to pages.
      Round up non-zero sub-page to 1 page, because memcg_rstat_updated()
      ignores 0 page updates.
      
      Link: https://lkml.kernel.org/r/20230922175741.635002-3-yosryahmed@google.com
      Fixes: 5b3be698 ("memcg: better bounds on the memcg stats updates")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7bd5bc3c
    • Yosry Ahmed's avatar
      mm: memcg: refactor page state unit helpers · ff841a06
      Yosry Ahmed authored
      Patch series "mm: memcg: fix tracking of pending stats updates values", v2.
      
      While working on adjacent code [1], I realized that the values passed into
      memcg_rstat_updated() to keep track of the magnitude of pending updates is
      consistent.  It is mostly in pages, but sometimes it can be in bytes or
      KBs.  Fix that.
      
      Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch 2
      to check and normalize the units of state updates.
      
      [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
      
      
      This patch (of 2):
      
      memcg_page_state_unit() is currently used to identify the unit of a memcg
      state item so that all stats in memory.stat are in bytes.  However, it
      lies about the units of WORKINGSET_* stats.  These stats actually
      represent pages, but we present them to userspace as a scalar number of
      events.  In retrospect, maybe those stats should have been memcg "events"
      rather than memcg "state".
      
      In preparation for using memcg_page_state_unit() for other purposes that
      need to know the truthful units of different stat items, break it down
      into two helpers:
      - memcg_page_state_unit() retuns the actual unit of the item.
      - memcg_page_state_output_unit() returns the unit used for output.
      
      Use the latter instead of the former in memcg_page_state_output() and
      lruvec_page_state_output().  While we are at it, let's show cgroup v1 some
      love and add memcg_page_state_local_output() for consistency.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230922175741.635002-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230922175741.635002-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff841a06
    • Kees Cook's avatar
      mm/memcg: annotate struct mem_cgroup_threshold_ary with __counted_by · b7c67206
      Kees Cook authored
      Prepare for the coming implementation by GCC and Clang of the __counted_by
      attribute.  Flexible array members annotated with __counted_by can have
      their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
      (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
      functions).
      
      As found with Coccinelle[1], add __counted_by for struct
      mem_cgroup_threshold_ary.
      
      [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
      
      Link: https://lkml.kernel.org/r/20230922175327.work.985-kees@kernel.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7c67206
    • Mike Kravetz's avatar
      hugetlb: check for hugetlb folio before vmemmap_restore · 30a89adf
      Mike Kravetz authored
      In commit d8f5f7e4 ("hugetlb: set hugetlb page flag before
      optimizing vmemmap") checks were added to print a warning if
      hugetlb_vmemmap_restore was called on a non-hugetlb page.
      
      This was mostly due to ordering issues in the hugetlb page set up and tear
      down sequencees.  One place missed was the routine
      dissolve_free_huge_page.
      
      Naoya Horiguchi noted: "I saw that VM_WARN_ON_ONCE() in
      hugetlb_vmemmap_restore is triggered when memory_failure() is called on a
      free hugetlb page with vmemmap optimization disabled (the warning is not
      triggered if vmemmap optimization is enabled).  I think that we need check
      folio_test_hugetlb() before dissolve_free_huge_page() calls
      hugetlb_vmemmap_restore_folio()."
      
      Perform the check as suggested by Naoya.
      
      Link: https://lkml.kernel.org/r/20231017032140.GA3680@monkey
      Fixes: d8f5f7e4 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30a89adf
    • Andrew Morton's avatar
    • Tiezhu Yang's avatar
      selftests/clone3: Fix broken test under !CONFIG_TIME_NS · fc7f04dc
      Tiezhu Yang authored
      When execute the following command to test clone3 under !CONFIG_TIME_NS:
      
        # make headers && cd tools/testing/selftests/clone3 && make && ./clone3
      
      we can see the following error info:
      
        # [7538] Trying clone3() with flags 0x80 (size 0)
        # Invalid argument - Failed to create new process
        # [7538] clone3() with flags says: -22 expected 0
        not ok 18 [7538] Result (-22) is different than expected (0)
        ...
        # Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
      
      This is because if CONFIG_TIME_NS is not set, but the flag
      CLONE_NEWTIME (0x80) is used to clone a time namespace, it
      will return -EINVAL in copy_time_ns().
      
      If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
      will be not exist, and then we should skip clone3() test with
      CLONE_NEWTIME.
      
      With this patch under !CONFIG_TIME_NS:
      
        # make headers && cd tools/testing/selftests/clone3 && make && ./clone3
        ...
        # Time namespaces are not supported
        ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
        ...
        # Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
      
      Link: https://lkml.kernel.org/r/1689066814-13295-1-git-send-email-yangtiezhu@loongson.cn
      Fixes: 515bddf0 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fc7f04dc
    • Liam R. Howlett's avatar
      maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() · 099d7439
      Liam R. Howlett authored
      Users complained about OOM errors during fork without triggering
      compaction.  This can be fixed by modifying the flags used in
      mas_expected_entries() so that the compaction will be triggered in low
      memory situations.  Since mas_expected_entries() is only used during fork,
      the extra argument does not need to be passed through.
      
      Additionally, the two test_maple_tree test cases and one benchmark test
      were altered to use the correct locking type so that allocations would not
      trigger sleeping and thus fail.  Testing was completed with lockdep atomic
      sleep detection.
      
      The additional locking change requires rwsem support additions to the
      tools/ directory through the use of pthreads pthread_rwlock_t.  With this
      change test_maple_tree works in userspace, as a module, and in-kernel.
      
      Users may notice that the system gave up early on attempting to start new
      processes instead of attempting to reclaim memory.
      
      Link: https://lkml.kernel.org/r/20230915093243epcms1p46fa00bbac1ab7b7dca94acb66c44c456@epcms1p4
      Link: https://lkml.kernel.org/r/20231012155233.2272446-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Cc: <jason.sim@samsung.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      099d7439
    • Samasth Norway Ananda's avatar
      selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier · e2de156b
      Samasth Norway Ananda authored
      Definition for MREMAP_DONTUNMAP is not present in glibc older than 2.32
      thus throwing an undeclared error when running make on mm.  Including
      linux/mman.h solves the build error for people having older glibc.
      
      Link: https://lkml.kernel.org/r/20231012155257.891776-1-samasth.norway.ananda@oracle.com
      Fixes: 0183d777 ("selftests: mm: remove duplicate unneeded defines")
      Signed-off-by: default avatarSamasth Norway Ananda <samasth.norway.ananda@oracle.com>
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Closes: https://lore.kernel.org/linux-mm/CA+G9fYvV-71XqpCr_jhdDfEtN701fBdG3q+=bafaZiGwUXy_aA@mail.gmail.com/Tested-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2de156b
    • Oleksij Rempel's avatar
      mailmap: correct email aliasing for Oleksij Rempel · d2313c77
      Oleksij Rempel authored
      Ensure the current work email addresses for Oleksij Rempel are preserved
      and not overridden by private address.  Alias the alternate work email to
      the primary work email address.
      
      Link: https://lkml.kernel.org/r/20231011112519.1427077-1-o.rempel@pengutronix.deSigned-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2313c77
    • Bartosz Golaszewski's avatar
      mailmap: map Bartosz's old address to the current one · 002e39e9
      Bartosz Golaszewski authored
      I no longer work for BayLibre but many DT bindings have my BL address in
      the maintainers entries.  Map it to the email address I use for kernel
      development.
      
      Link: https://lkml.kernel.org/r/20231011150104.73863-1-brgl@bgdev.plSigned-off-by: default avatarBartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Suggested-by: default avatarConor Dooley <conor@kernel.org>
      Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      002e39e9
    • SeongJae Park's avatar
      mm/damon/sysfs: check DAMOS regions update progress from before_terminate() · 76b7069b
      SeongJae Park authored
      DAMON_SYSFS can receive DAMOS tried regions update request while kdamond
      is already out of the main loop and before_terminate callback
      (damon_sysfs_before_terminate() in this case) is not yet called.  And
      damon_sysfs_handle_cmd() can further be finished before the callback is
      invoked.  Then, damon_sysfs_before_terminate() unlocks damon_sysfs_lock,
      which is not locked by anyone.  This happens because the callback function
      assumes damon_sysfs_cmd_request_callback() should be called before it. 
      Check if the assumption was true before doing the unlock, to avoid this
      problem.
      
      Link: https://lkml.kernel.org/r/20231007200432.3110-1-sj@kernel.org
      Fixes: f1d13cac ("mm/damon/sysfs: implement DAMOS tried regions update command")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.2.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76b7069b
    • Ondrej Jirman's avatar
      MAINTAINERS: Ondrej has moved · c5155d4e
      Ondrej Jirman authored
      Update my email-address in MAINTAINERS to <megi@xff.cz>.  Also add
      .mailmap entries to map my old, now blocked, email address.
      
      Link: https://lkml.kernel.org/r/20231008105812.1084226-1-megi@xff.czSigned-off-by: default avatarOndrej Jirman <megi@xff.cz>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5155d4e
    • Arnd Bergmann's avatar
      kasan: disable kasan_non_canonical_hook() for HW tags · 17c17567
      Arnd Bergmann authored
      On arm64, building with CONFIG_KASAN_HW_TAGS now causes a compile-time
      error:
      
      mm/kasan/report.c: In function 'kasan_non_canonical_hook':
      mm/kasan/report.c:637:20: error: 'KASAN_SHADOW_OFFSET' undeclared (first use in this function)
        637 |         if (addr < KASAN_SHADOW_OFFSET)
            |                    ^~~~~~~~~~~~~~~~~~~
      mm/kasan/report.c:637:20: note: each undeclared identifier is reported only once for each function it appears in
      mm/kasan/report.c:640:77: error: expected expression before ';' token
        640 |         orig_addr = (addr - KASAN_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT;
      
      This was caused by removing the dependency on CONFIG_KASAN_INLINE that
      used to prevent this from happening. Use the more specific dependency
      on KASAN_SW_TAGS || KASAN_GENERIC to only ignore the function for hwasan
      mode.
      
      Link: https://lkml.kernel.org/r/20231016200925.984439-1-arnd@kernel.org
      Fixes: 12ec6a919b0f ("kasan: print the original fault addr when access invalid shadow")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Haibo Li <haibo.li@mediatek.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17c17567
    • Haibo Li's avatar
      kasan: print the original fault addr when access invalid shadow · babddbfb
      Haibo Li authored
      when the checked address is illegal,the corresponding shadow address from
      kasan_mem_to_shadow may have no mapping in mmu table.  Access such shadow
      address causes kernel oops.  Here is a sample about oops on arm64(VA
      39bit) with KASAN_SW_TAGS and KASAN_OUTLINE on:
      
      [ffffffb80aaaaaaa] pgd=000000005d3ce003, p4d=000000005d3ce003,
          pud=000000005d3ce003, pmd=0000000000000000
      Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 3 PID: 100 Comm: sh Not tainted 6.6.0-rc1-dirty #43
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : __hwasan_load8_noabort+0x5c/0x90
      lr : do_ib_ob+0xf4/0x110
      ffffffb80aaaaaaa is the shadow address for efffff80aaaaaaaa.
      The problem is reading invalid shadow in kasan_check_range.
      
      The generic kasan also has similar oops.
      
      It only reports the shadow address which causes oops but not
      the original address.
      
      Commit 2f004eea("x86/kasan: Print original address on #GP")
      introduce to kasan_non_canonical_hook but limit it to KASAN_INLINE.
      
      This patch extends it to KASAN_OUTLINE mode.
      
      Link: https://lkml.kernel.org/r/20231009073748.159228-1-haibo.li@mediatek.com
      Fixes: 2f004eea("x86/kasan: Print original address on #GP")
      Signed-off-by: default avatarHaibo Li <haibo.li@mediatek.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Haibo Li <haibo.li@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      babddbfb
    • Rik van Riel's avatar
      hugetlbfs: close race between MADV_DONTNEED and page fault · 2820b0f0
      Rik van Riel authored
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by pulling the locking from __unmap_hugepage_final_range
      into helper functions called from zap_page_range_single.  This ensures
      page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
      have actually been freed.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2820b0f0
    • Rik van Riel's avatar
      hugetlbfs: extend hugetlb_vma_lock to private VMAs · bf491692
      Rik van Riel authored
      Extend the locking scheme used to protect shared hugetlb mappings from
      truncate vs page fault races, in order to protect private hugetlb mappings
      (with resv_map) against MADV_DONTNEED.
      
      Add a read-write semaphore to the resv_map data structure, and use that
      from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
      race between MADV_DONTNEED and page faults.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf491692
    • Rik van Riel's avatar
      hugetlbfs: clear resv_map pointer if mmap fails · 92fe9dcb
      Rik van Riel authored
      Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
      
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by extending the hugetlb_vma_lock locking scheme to also
      cover private hugetlb mappings (with resv_map), and pulling the locking
      from __unmap_hugepage_final_range into helper functions called from
      zap_page_range_single.  This ensures page faults stay locked out of the
      MADV_DONTNEED VMA until the huge pages have actually been freed.
      
      
      This patch (of 3):
      
      Hugetlbfs leaves a dangling pointer in the VMA if mmap fails.  This has
      not been a problem so far, but other code in this patch series tries to
      follow that pointer.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92fe9dcb
    • Johannes Weiner's avatar
      mm: zswap: fix pool refcount bug around shrink_worker() · 969d63e1
      Johannes Weiner authored
      When a zswap store fails due to the limit, it acquires a pool reference
      and queues the shrinker.  When the shrinker runs, it drops the reference. 
      However, there can be multiple store attempts before the shrinker wakes up
      and runs once.  This results in reference leaks and eventual saturation
      warnings for the pool refcount.
      
      Fix this by dropping the reference again when the shrinker is already
      queued.  This ensures one reference per shrinker run.
      
      Link: https://lkml.kernel.org/r/20231006160024.170748-1-hannes@cmpxchg.org
      Fixes: 45190f01 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      969d63e1