1. 30 Nov, 2022 40 commits
    • Sergey Senozhatsky's avatar
      Docs/ABI/zram: document zram recompress sysfs knobs · c959a0e8
      Sergey Senozhatsky authored
      Document zram re-compression sysfs knobs.
      
      Link: https://lkml.kernel.org/r/20221115020314.386235-1-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c959a0e8
    • Sergey Senozhatsky's avatar
      zram: add incompressible flag to read_block_state() · 77db7bb5
      Sergey Senozhatsky authored
      Add a new flag to zram block state that shows if the page is
      incompressible: that none of the algorithm (including secondary ones)
      could compress it.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-14-senozhatsky@chromium.orgSuggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77db7bb5
    • Sergey Senozhatsky's avatar
      zram: add incompressible writeback · b46f9ea3
      Sergey Senozhatsky authored
      Add support for incompressible pages writeback:
      
        echo incompressible > /sys/block/zramX/writeback
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-13-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b46f9ea3
    • Sergey Senozhatsky's avatar
      documentation: add zram recompression documentation · 443dd798
      Sergey Senozhatsky authored
      Document user-space visible device attributes that are enabled by
      ZRAM_MULTI_COMP.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-12-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      443dd798
    • Sergey Senozhatsky's avatar
      zram: add algo parameter support to zram_recompress() · a55cf964
      Sergey Senozhatsky authored
      Recompression iterates through all the registered secondary compression
      algorithms in order of their priorities so that we have higher chances of
      finding the algorithm that compresses a particular page.  This, however,
      may not always be best approach and sometimes we may want to limit
      recompression to only one particular algorithm.  For instance, when a
      higher priority algorithm uses too much power and device has a relatively
      low battery level we may want to limit recompression to use only a lower
      priority algorithm, which uses less power.
      
      Introduce algo= parameter support to recompression sysfs knob so that
      user-sapce can request recompression with particular algorithm only:
      
        echo "type=idle algo=zstd" > /sys/block/zramX/recompress
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-11-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a55cf964
    • Sergey Senozhatsky's avatar
      zram: remove redundant checks from zram_recompress() · 4942cf6a
      Sergey Senozhatsky authored
      Size class index comparison is powerful enough so we can remove object
      size comparisons.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-10-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4942cf6a
    • Alexey Romanov's avatar
      zram: add size class equals check into recompression · 7c2af309
      Alexey Romanov authored
      It makes no sense for us to recompress the object if it will be in the
      same size class.  We anyway don't get any memory gain.  But, at the same
      time, we get a CPU time overhead when inserting this object into zspage
      and decompressing it afterwards.
      
      [senozhatsky: rebased and fixed conflicts]
      Link: https://lkml.kernel.org/r/20221109115047.2921851-9-senozhatsky@chromium.orgSigned-off-by: default avatarAlexey Romanov <avromanov@sberdevices.ru>
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c2af309
    • Sergey Senozhatsky's avatar
      zram: use IS_ERR_VALUE() to check for zs_malloc() errors · f24ee92c
      Sergey Senozhatsky authored
      Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE()
      instead.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-8-senozhatsky@chromium.orgSuggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f24ee92c
    • Sergey Senozhatsky's avatar
      zram: clarify writeback_store() comment · 9fda785d
      Sergey Senozhatsky authored
      Re-phrase writeback BIO error comment.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-7-senozhatsky@chromium.orgReported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9fda785d
    • Sergey Senozhatsky's avatar
      zram: add recompress flag to read_block_state() · 60e9b39e
      Sergey Senozhatsky authored
      Add a new flag to zram block state that shows if the page was recompressed
      (using alternative compression algorithm).
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-6-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      60e9b39e
    • Sergey Senozhatsky's avatar
      zram: introduce recompress sysfs knob · 84b33bf7
      Sergey Senozhatsky authored
      Allow zram to recompress (using secondary compression streams)
      pages.
      
      Re-compression algorithms (we support up to 3 at this stage)
      are selected via recomp_algorithm:
      
        echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
      
      Please read documentation for more details.
      
      We support several recompression modes:
      
      1) IDLE pages recompression is activated by `idle` mode
      
        echo "type=idle" > /sys/block/zram0/recompress
      
      2) Since there may be many idle pages user-space may pass a size
      threshold value (in bytes) and we will recompress pages only
      of equal or greater size:
      
        echo "threshold=888" > /sys/block/zram0/recompress
      
      3) HUGE pages recompression is activated by `huge` mode
      
        echo "type=huge" > /sys/block/zram0/recompress
      
      4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
      
        echo "type=huge_idle" > /sys/block/zram0/recompress
      
      [senozhatsky@chromium.org: we should always zero out err variable in recompress loop[
        Link: https://lkml.kernel.org/r/20221110143423.3250790-1-senozhatsky@chromium.org
      Link: https://lkml.kernel.org/r/20221109115047.2921851-5-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      84b33bf7
    • Sergey Senozhatsky's avatar
      zram: factor out WB and non-WB zram read functions · 5561347a
      Sergey Senozhatsky authored
      We will use non-WB variant in ZRAM page recompression path.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-4-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5561347a
    • Sergey Senozhatsky's avatar
      zram: add recompression algorithm sysfs knob · 001d9273
      Sergey Senozhatsky authored
      Introduce recomp_algorithm sysfs knob that controls secondary algorithm
      selection used for recompression.
      
      We will support up to 3 secondary compression algorithms which are sorted
      in order of their priority.  To select an algorithm user has to provide
      its name and priority:
      
        echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
        echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm
      
      During recompression zram iterates through the list of registered
      secondary algorithms in order of their priorities.
      
      We also have a short version for cases when there is only
      one secondary compression algorithm:
      
        echo "algo=zstd" > /sys/block/zramX/recomp_algorithm
      
      This will register zstd as the secondary algorithm with priority 1.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-3-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      001d9273
    • Sergey Senozhatsky's avatar
      zram: preparation for multi-zcomp support · 7ac07a26
      Sergey Senozhatsky authored
      Patch series "zram: Support multiple compression streams", v5.
      
      This series adds support for multiple compression streams.  The main idea
      is that different compression algorithms have different characteristics
      and zram may benefit when it uses a combination of algorithms: a default
      algorithm that is faster but have lower compression rate and a secondary
      algorithm that can use higher compression rate at a price of slower
      compression/decompression.
      
      There are several use-case for this functionality:
      
      - huge pages re-compression: zstd or deflate can successfully compress
        huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW
        pages that lzo was not able to compress.
      
      - idle pages re-compression: idle/cold pages sit in the memory and we
        may reduce zsmalloc memory usage if we recompress those idle pages.
      
      Userspace has a number of ways to control the behavior and impact of zram
      recompression: what type of pages should be recompressed, size watermarks,
      etc.  Please refer to documentation patch.
      
      
      This patch (of 13):
      			
      The patch turns compression streams and compressor algorithm name struct
      zram members into arrays, so that we can have multiple compression streams
      support (in the next patches).
      
      The patch uses a rather explicit API for compressor selection:
      
      - Get primary (default) compression stream
      	zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP])
      - Get secondary compression stream
      	zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP])
      
      We use similar API for compression streams put().
      
      At this point we always have just one compression stream,
      since CONFIG_ZRAM_MULTI_COMP is not yet defined.
      
      Link: https://lkml.kernel.org/r/20221109115047.2921851-1-senozhatsky@chromium.org
      Link: https://lkml.kernel.org/r/20221109115047.2921851-2-senozhatsky@chromium.orgSigned-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ac07a26
    • Alexander Gordeev's avatar
      mm: mmu_gather: do not expose delayed_rmap flag · f036c818
      Alexander Gordeev authored
      Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but
      it is still accessed directly.  Instead, let the TLB gather code access
      the flag.
      
      Link: https://lkml.kernel.org/r/Y3SWCu6NRaMQ5dbD@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.comSigned-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f036c818
    • Linus Torvalds's avatar
      mm: delay page_remove_rmap() until after the TLB has been flushed · 5df397de
      Linus Torvalds authored
      When we remove a page table entry, we are very careful to only free the
      page after we have flushed the TLB, because other CPUs could still be
      using the page through stale TLB entries until after the flush.
      
      However, we have removed the rmap entry for that page early, which means
      that functions like folio_mkclean() would end up not serializing with the
      page table lock because the page had already been made invisible to rmap.
      
      And that is a problem, because while the TLB entry exists, we could end up
      with the following situation:
      
       (a) one CPU could come in and clean it, never seeing our mapping of the
           page
      
       (b) another CPU could continue to use the stale and dirty TLB entry and
           continue to write to said page
      
      resulting in a page that has been dirtied, but then marked clean again,
      all while another CPU might have dirtied it some more.
      
      End result: possibly lost dirty data.
      
      This extends our current TLB gather infrastructure to optionally track a
      "should I do a delayed page_remove_rmap() for this page after flushing the
      TLB".  It uses the newly introduced 'encoded page pointer' to do that
      without having to keep separate data around.
      
      Note, this is complicated by a couple of issues:
      
       - we want to delay the rmap removal, but not past the page table lock,
         because that simplifies the memcg accounting
      
       - only SMP configurations want to delay TLB flushing, since on UP
         there are obviously no remote TLBs to worry about, and the page
         table lock means there are no preemption issues either
      
       - s390 has its own mmu_gather model that doesn't delay TLB flushing,
         and as a result also does not want the delayed rmap. As such, we can
         treat S390 like the UP case and use a common fallback for the "no
         delays" case.
      
       - we can track an enormous number of pages in our mmu_gather structure,
         with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
         all set up to be approximately 10k pending pages.
      
         We do not want to have a huge number of batched pages that we then
         need to check for delayed rmap handling inside the page table lock.
      
      Particularly that last point results in a noteworthy detail, where the
      normal page batch gathering is limited once we have delayed rmaps pending,
      in such a way that only the last batch (the so-called "active batch") in
      the mmu_gather structure can have any delayed entries.
      
      NOTE!  While the "possibly lost dirty data" sounds catastrophic, for this
      all to happen you need to have a user thread doing either madvise() with
      MADV_DONTNEED or a full re-mmap() of the area concurrently with another
      thread continuing to use said mapping.
      
      So arguably this is about user space doing crazy things, but from a VM
      consistency standpoint it's better if we track the dirty bit properly even
      when user space goes off the rails.
      
      [akpm@linux-foundation.org: fix UP build, per Linus]
      Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
      Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Tested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5df397de
    • Linus Torvalds's avatar
      mm: mmu_gather: prepare to gather encoded page pointers with flags · 7cc8f9c7
      Linus Torvalds authored
      This is purely a preparatory patch that makes all the data structures
      ready for encoding flags with the mmu_gather page pointers.
      
      The code currently always sets the flag to zero and doesn't use it yet,
      but now it's tracking the type state along.  The next step will be to
      actually start using it.
      
      Link: https://lkml.kernel.org/r/20221109203051.1835763-3-torvalds@linux-foundation.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7cc8f9c7
    • Linus Torvalds's avatar
      mm: teach release_pages() to take an array of encoded page pointers too · 449c7967
      Linus Torvalds authored
      release_pages() already could take either an array of page pointers, or an
      array of folio pointers.  Expand it to also accept an array of encoded
      page pointers, which is what both the existing mlock() use and the
      upcoming mmu_gather use of encoded page pointers wants.
      
      Note that release_pages() won't actually use, or react to, any extra
      encoded bits.  Instead, this is very much a case of "I have walked the
      array of encoded pages and done everything the extra bits tell me to do,
      now release it all".
      
      Also, while the "either page or folio pointers" dual use was handled with
      a cast of the pointer in "release_folios()", this takes a slightly
      different approach and uses the "transparent union" attribute to describe
      the set of arguments to the function:
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html
      
      which has been supported by gcc forever, but the kernel hasn't used
      before.
      
      That allows us to avoid using various wrappers with casts, and just use
      the same function regardless of use.
      
      Link: https://lkml.kernel.org/r/20221109203051.1835763-2-torvalds@linux-foundation.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      449c7967
    • Linus Torvalds's avatar
      mm: introduce 'encoded' page pointers with embedded extra bits · 70fb4fdf
      Linus Torvalds authored
      We already have this notion in parts of the MM code (see the mlock code
      with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new
      case, and I refuse to do the same thing we've done before where we just
      put bits in the raw pointer and say it's still a normal pointer.
      
      So this introduces a 'struct encoded_page' pointer that cannot be used for
      anything else than to encode a real page pointer and a couple of extra
      bits in the low bits.  That way the compiler can trivially track the state
      of the pointer and you just explicitly encode and decode the extra bits.
      
      Note that this makes the alignment of 'struct page' explicit even for the
      case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set.  That is entirely
      redundant in almost all cases, since the page structure already contains
      several word-sized entries.
      
      However, on m68k, the alignment of even 32-bit data is just 16 bits, and
      as such in theory the alignment of 'struct page' could be too.  So let's
      just make it very very explicit that the alignment needs to be at least 32
      bits, giving us a guarantee of two unused low bits in the pointer.
      
      Now, in practice, our page struct array is aligned much more than that
      anyway, even on m68k, and our existing code in mm/mlock.c obviously
      already depended on that.  But since the whole point of this change is to
      be careful about the type system when hiding extra bits in the pointer,
      let's also be explicit about the assumptions we make.
      
      NOTE!  This is being very careful in another way too: it has a build-time
      assertion that the 'flags' added to the page pointer actually fit in the
      two bits.  That means that this helper must be inlined, and can only be
      used in contexts where the compiler can statically determine that the
      value fits in the available bits.
      
      [akpm@linux-foundation.org: kerneldoc on a forward-declared struct confuses htmldocs]
      Link: https://lore.kernel.org/all/Y2tKixpO4RO6DgW5@tuxmaker.boeblingen.de.ibm.com/
      Link: https://lkml.kernel.org/r/20221109203051.1835763-1-torvalds@linux-foundation.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com> [s390]
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      70fb4fdf
    • David Hildenbrand's avatar
      selftests/vm: anon_cow: add mprotect() optimization tests · 07f8bac4
      David Hildenbrand authored
      Let's extend the test to cover the possible mprotect() optimization when
      removing write-protection. mprotect() must not allow write-access to a
      COW-shared page by accident.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07f8bac4
    • David Hildenbrand's avatar
      mm: remove unused savedwrite infrastructure · d6379159
      David Hildenbrand authored
      NUMA hinting no longer uses savedwrite, let's rip it out.
      
      ... and while at it, drop __pte_write() and __pmd_write() on ppc64.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6379159
    • David Hildenbrand's avatar
      mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite · 6a56ccbc
      David Hildenbrand authored
      commit b191f9b1 ("mm: numa: preserve PTE write permissions across a
      NUMA hinting fault") added remembering write permissions using ordinary
      pte_write() for PROT_NONE mapped pages to avoid write faults when
      remapping the page !PROT_NONE on NUMA hinting faults.
      
      That commit noted:
      
          The patch looks hacky but the alternatives looked worse. The tidest was
          to rewalk the page tables after a hinting fault but it was more complex
          than this approach and the performance was worse. It's not generally
          safe to just mark the page writable during the fault if it's a write
          fault as it may have been read-only for COW so that approach was
          discarded.
      
      Later, commit 288bc549 ("mm/autonuma: let architecture override how
      the write bit should be stashed in a protnone pte.") introduced a family
      of savedwrite PTE functions that didn't necessarily improve the whole
      situation.
      
      One confusing thing is that nowadays, if a page is pte_protnone()
      and pte_savedwrite() then also pte_write() is true. Another source of
      confusion is that there is only a single pte_mk_savedwrite() call in the
      kernel. All other write-protection code seems to silently rely on
      pte_wrprotect().
      
      Ever since PageAnonExclusive was introduced and we started using it in
      mprotect context via commit 64fe24a3 ("mm/mprotect: try avoiding write
      faults for exclusive anonymous pages when changing protection"), we do
      have machinery in place to avoid write faults when changing protection,
      which is exactly what we want to do here.
      
      Let's similarly do what ordinary mprotect() does nowadays when upgrading
      write permissions and reuse can_change_pte_writable() and
      can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
      writable.
      
      For anonymous pages there should be absolutely no change: if an
      anonymous page is not exclusive, it could not have been mapped writable --
      because only exclusive anonymous pages can be mapped writable.
      
      However, there *might* be a change for writable shared mappings that
      require writenotify: if they are not dirty, we cannot map them writable.
      While it might not matter in practice, we'd need a different way to
      identify whether writenotify is actually required -- and ordinary mprotect
      would benefit from that as well.
      
      Note that we don't optimize for the actual migration case:
      (1) When migration succeeds the new PTE will not be writable because the
          source PTE was not writable (protnone); in the future we
          might just optimize that case similarly by reusing
          can_change_pte_writable()/can_change_pmd_writable() when removing
          migration PTEs.
      (2) When migration fails, we'd have to recalculate the "writable" flag
          because we temporarily dropped the PT lock; for now keep it simple and
          set "writable=false".
      
      We'll remove all savedwrite leftovers next.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a56ccbc
    • David Hildenbrand's avatar
      mm/mprotect: factor out check whether manual PTE write upgrades are required · eb309ec8
      David Hildenbrand authored
      Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be
      reused in NUMA hinting fault context soon.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb309ec8
    • David Hildenbrand's avatar
      mm/huge_memory: try avoiding write faults when changing PMD protection · c27f479e
      David Hildenbrand authored
      Let's replicate what we have for PTEs in can_change_pte_writable() also
      for PMDs.
      
      While this might look like a pure performance improvement, we'll us this to
      get rid of savedwrite handling in do_huge_pmd_numa_page() next. Place
      do_huge_pmd_numa_page() strategically good for that purpose.
      
      Note that MM_CP_TRY_CHANGE_WRITABLE is currently only set when we come
      via mprotect_fixup().
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c27f479e
    • David Hildenbrand's avatar
      mm/mprotect: minor can_change_pte_writable() cleanups · 7ea7e333
      David Hildenbrand authored
      We want to replicate this code for handling PMDs soon.
      
      (1) No need to crash the kernel, warning and rejecting is good enough. As
          this will no longer get optimized out, drop the pte_write() check: no
          harm would be done.
      
      (2) Add a comment why PROT_NONE mapped pages are excluded.
      
      (3) Add a comment regarding MAP_SHARED handling and why we rely on the
          dirty bit in the PTE.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ea7e333
    • Nadav Amit's avatar
      mm/mprotect: allow clean exclusive anon pages to be writable · d8488773
      Nadav Amit authored
      Patch series "mm/autonuma: replace savedwrite infrastructure", v2.
      
      As discussed in my talk at LPC, we can reuse the same mechanism for
      deciding whether to map a pte writable when upgrading permissions via
      mprotect() -- e.g., PROT_READ -> PROT_READ|PROT_WRITE -- to replace the
      savedwrite infrastructure used for NUMA hinting faults (e.g., PROT_NONE ->
      PROT_READ|PROT_WRITE).
      
      Instead of maintaining previous write permissions for a pte/pmd, we
      re-determine if the pte/pmd can be writable.  The big benefit is that we
      have a common logic for deciding whether we can map a pte/pmd writable on
      protection changes.
      
      For private mappings, there should be no difference -- from what I
      understand, that is what autonuma benchmarks care about.
      
      I ran autonumabench for v1 on a system with 2 NUMA nodes, 96 GiB each via:
      	perf stat --null --repeat 10
      The numa01 benchmark is quite noisy in my environment and I failed to
      reduce the noise so far.
      
      numa01:
      	mm-unstable:   146.88 +- 6.54 seconds time elapsed  ( +-  4.45% )
      	mm-unstable++: 147.45 +- 13.39 seconds time elapsed  ( +-  9.08% )
      
      numa02:
      	mm-unstable:   16.0300 +- 0.0624 seconds time elapsed  ( +-  0.39% )
      	mm-unstable++: 16.1281 +- 0.0945 seconds time elapsed  ( +-  0.59% )
      
      It is worth noting that for shared writable mappings that require
      writenotify, we will only avoid write faults if the pte/pmd is dirty
      (inherited from the older mprotect logic).  If we ever care about
      optimizing that further, we'd need a different mechanism to identify
      whether the FS still needs to get notified on the next write access.
      
      In any case, such an optimization will then not be autonuma-specific, but
      mprotect() permission upgrades would similarly benefit from it.
      
      
      This patch (of 7):
      
      Anonymous pages might have the dirty bit clear, but this should not
      prevent mprotect from making them writable if they are exclusive. 
      Therefore, skip the test whether the page is dirty in this case.
      
      Note that there are already other ways to get a writable PTE mapping an
      anonymous page that is clean: for example, via MADV_FREE.  In an ideal
      world, we'd have a different indication from the FS whether writenotify is
      still required.
      
      [david@redhat.com: return directly; update description]
      Link: https://lkml.kernel.org/r/20221108174652.198904-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20221108174652.198904-2-david@redhat.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8488773
    • Rong Tao's avatar
      tools/vm/page_owner: ignore page_owner_sort binary · 1a1af17e
      Rong Tao authored
      page_owner_sort was introduced since commit 48c96a36 ("mm/page_owner:
      keep track of page owners"), and we should ignore it.
      
      Link: https://lkml.kernel.org/r/tencent_F6CAC0ABE16839E2B2419BD07316DA65BB06@qq.comSigned-off-by: default avatarRong Tao <rongtao@cestc.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1a1af17e
    • Hugh Dickins's avatar
      mm,thp,rmap: clean up the end of __split_huge_pmd_locked() · 96d82deb
      Hugh Dickins authored
      It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
      HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
      HPAGE_PMD_NR page_remove_rmap() loop below it.
      
      It's just a mistake to add rmaps in the "freeze" (insert migration entries
      prior to splitting huge page) case: the pmd_migration case already avoids
      doing that, so just follow its lead.  page_add_ref() versus put_page()
      likewise.  But why is one more put_page() needed in the "freeze" case? 
      Because it's removing the pmd rmap, already removed when pmd_migration
      (and freeze and pmd_migration are mutually exclusive cases).
      
      Link: https://lkml.kernel.org/r/d43748aa-fece-e0b9-c4ab-f23c9ebc9011@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96d82deb
    • Hugh Dickins's avatar
      mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped · 4b51634c
      Hugh Dickins authored
      Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? 
      Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
      but if we slightly abuse subpages_mapcount by additionally demanding that
      one bit be set there when the compound page is PMD-mapped, then a cascade
      of two atomic ops is able to maintain the stats without bit_spin_lock.
      
      This is harder to reason about than when bit_spin_locked, but I believe
      safe; and no drift in stats detected when testing.  When there are racing
      removes and adds, of course the sequence of operations is less well-
      defined; but each operation on subpages_mapcount is atomically good.  What
      might be disastrous, is if subpages_mapcount could ever fleetingly appear
      negative: but the pte lock (or pmd lock) these rmap functions are called
      under, ensures that a last remove cannot race ahead of a first add.
      
      Continue to make an exception for hugetlb (PageHuge) pages, though that
      exception can be easily removed by a further commit if necessary: leave
      subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
      carry on checking compound_mapcount too in folio_mapped(), page_mapped().
      
      Evidence is that this way goes slightly faster than the previous
      implementation in all cases (pmds after ptes now taking around 103ms); and
      relieves us of worrying about contention on the bit_spin_lock.
      
      Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b51634c
    • Hugh Dickins's avatar
      mm,thp,rmap: subpages_mapcount of PTE-mapped subpages · be5ef2d9
      Hugh Dickins authored
      Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.
      
      
      This patch (of 3):
      
      Following suggestion from Linus, instead of counting every PTE map of a
      compound page in subpages_mapcount, just count how many of its subpages
      are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
      NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
      requires updating the count less often.
      
      This does then revert total_mapcount() and folio_mapcount() to needing a
      scan of subpages; but they are inherently racy, and need no locking, so
      Linus is right that the scans are much better done there.  Plus (unlike in
      6.1 and previous) subpages_mapcount lets us avoid the scan in the common
      case of no PTE maps.  And page_mapped() and folio_mapped() remain scanless
      and just as efficient with the new meaning of subpages_mapcount: those are
      the functions which I most wanted to remove the scan from.
      
      The updated page_dup_compound_rmap() is no longer suitable for use by anon
      THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
      that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
      
      Evidence is that this way goes slightly faster than the previous
      implementation for most cases; but significantly faster in the (now
      scanless) pmds after ptes case, which started out at 870ms and was brought
      down to 495ms by the previous series, now takes around 105ms.
      
      Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com
      Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be5ef2d9
    • Joao Martins's avatar
      mm/hugetlb_vmemmap: remap head page to newly allocated page · 11aad263
      Joao Martins authored
      Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
      back to page allocator is as following: for a 2M hugetlb page it will reuse
      the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
      1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
      that means that it breaks the first 4K of a potentially contiguous chunk of
      memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
      this reason the memory that it's free back to page allocator cannot be used
      for hugetlb to allocate huge pages of the same size, but rather only of a
      smaller huge page size:
      
      Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
      having 64G):
      
      * Before allocation:
      Free pages count per migrate type at order       0      1      2      3
      4      5      6      7      8      9     10
      ...
      Node    0, zone   Normal, type      Movable    340    100     32     15
      1      2      0      0      0      1  15558
      
      $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
       31987
      
      * After:
      
      Node    0, zone   Normal, type      Movable  30893  32006  31515      7
      0      0      0      0      0      0      0
      
      Notice how the memory freed back are put back into 4K / 8K / 16K page
      pools. And it allocates a total of 31987 pages (63974M).
      
      To fix this behaviour rather than remapping second vmemmap page (thus
      breaking the contiguous block of memory backing the struct pages)
      repopulate the first vmemmap page with a new one. We allocate and copy
      from the currently mapped vmemmap page, and then remap it later on.
      The same algorithm works if there's a pre initialized walk::reuse_page
      and the head page doesn't need to be skipped and instead we remap it
      when the @addr being changed is the @reuse_addr.
      
      The new head page is allocated in vmemmap_remap_free() given that on
      restore there's no need for functional change. Note that, because right
      now one hugepage is remapped at a time, thus only one free 4K page at a
      time is needed to remap the head page. Should it fail to allocate said
      new page, it reuses the one that's already mapped just like before. As a
      result, for every 64G of contiguous hugepages it can give back 1G more
      of contiguous memory per 64G, while needing in total 128M new 4K pages
      (for 2M hugetlb) or 256k (for 1G hugetlb).
      
      After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
      guest, each node with 64G):
      
      * Before allocation
      Free pages count per migrate type at order       0      1      2      3
      4      5      6      7      8      9     10
      ...
      Node    0, zone   Normal, type      Movable      1      1      1      0
      0      1      0      0      1      1  15564
      
      $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      32394
      
      * After:
      
      Node    0, zone   Normal, type      Movable      0     50     97    108
      96     81     70     46     18      0      0
      
      In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
      of the 32394 (64788M) allocated. So the memory freed back is indeed being
      used back in hugetlb and there's no massive order-0..order-2 pages
      accumulated unused.
      
      [joao.m.martins@oracle.com: v3]
        Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
      [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
        Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11aad263
    • SeongJae Park's avatar
      selftests/damon: test non-context inputs to rm_contexts file · d7ec8f42
      SeongJae Park authored
      There was a bug[1] that triggered by writing non-context DAMON debugfs
      file names to the 'rm_contexts' DAMON debugfs file.  Add a selftest for
      the bug to avoid it happen again.
      
      [1] https://lore.kernel.org/damon/000000000000ede3ac05ec4abf8e@google.com/
      
      Link: https://lkml.kernel.org/r/20221107165001.5717-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7ec8f42
    • Hugh Dickins's avatar
      mm,thp,rmap: handle the normal !PageCompound case first · d8dd5e97
      Hugh Dickins authored
      Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
      propagated the "if (compound) {lock} else if (PageCompound) {lock} else
      {atomic}" pattern throughout; but Linus hated the way that gives primacy
      to the uncommon case: switch to "if (!PageCompound) {atomic} else if
      (compound) {lock} else {lock}" throughout.  Linus has a bigger idea for
      how to improve it all, but here just make that rearrangement.
      
      Link: https://lkml.kernel.org/r/fca2f694-2098-b0ef-d4e-f1d8b94d318c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8dd5e97
    • Hugh Dickins's avatar
      mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts · 9bd3155e
      Hugh Dickins authored
      Fix the races in maintaining compound_mapcount, subpages_mapcount and
      subpage _mapcount by using PG_locked in the first tail of any compound
      page for a bit_spin_lock() on such modifications; skipping the usual
      atomic operations on those fields in this case.
      
      Bring page_remove_file_rmap() and page_remove_anon_compound_rmap() back
      into page_remove_rmap() itself.  Rearrange page_add_anon_rmap() and
      page_add_file_rmap() and page_remove_rmap() to follow the same "if
      (compound) {lock} else if (PageCompound) {lock} else {atomic}" pattern
      (with a PageTransHuge in the compound test, like before, to avoid BUG_ONs
      and optimize away that block when THP is not configured).  Move all the
      stats updates outside, after the bit_spin_locked section, so that it is
      sure to be a leaf lock.
      
      Add page_dup_compound_rmap() to manage compound locking versus atomics in
      sync with the rest.  In particular, hugetlb pages are still using the
      atomics: to avoid unnecessary interference there, and because they never
      have subpage mappings; but this exception can easily be changed. 
      Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's
      __split_huge_pmd_locked() too.
      
      bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT
      sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure is to
      the non-hugetlb non-THP pte-mapped compound pages (with large folios being
      currently dependent on TRANSPARENT_HUGEPAGE).  There is never any scan of
      subpages in this case; but we have chosen to use PageCompound tests rather
      than PageTransCompound tests to gate the use of lock_compound_mapcounts(),
      so that page_mapped() is correct on all compound pages, whether or not
      TRANSPARENT_HUGEPAGE is enabled: could that be a problem for PREEMPT_RT,
      when there is contention on the lock - under heavy concurrent forking for
      example?  If so, then it can be turned into a sleeping lock (like
      folio_lock()) when PREEMPT_RT.
      
      A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
      18 seconds on small pages, and used to take 1 second on huge pages, but
      now takes 115 milliseconds on huge pages.  Mapping by pmds a second time
      used to take 860ms and now takes 86ms; mapping by pmds after mapping by
      ptes (when the scan is needed) used to take 870ms and now takes 495ms. 
      Mapping huge pages by ptes is largely unaffected but variable: between 5%
      faster and 5% slower in what I've recorded.  Contention on the lock is
      likely to behave worse than contention on the atomics behaved.
      
      Link: https://lkml.kernel.org/r/1b42bd1a-8223-e827-602f-d466c2db7d3c@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9bd3155e
    • Hugh Dickins's avatar
      mm,thp,rmap: simplify compound page mapcount handling · cb67f428
      Hugh Dickins authored
      Compound page (folio) mapcount calculations have been different for anon
      and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. 
      And each huge mapping and unmapping of a file (or shmem) THP involved
      atomically incrementing and decrementing the mapcount of every subpage of
      that huge page, dirtying many struct page cachelines.
      
      Add subpages_mapcount field to the struct folio and first tail page, so
      that the total of subpage mapcounts is available in one place near the
      head: then page_mapcount() and total_mapcount() and page_mapped(), and
      their folio equivalents, are so quick that anon and file and hugetlb don't
      need to be optimized differently.  Delete the unloved PageDoubleMap.
      
      page_add and page_remove rmap functions must now maintain the
      subpages_mapcount as well as the subpage _mapcount, when dealing with pte
      mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
      NR_FILE_MAPPED statistics still needs reading through the subpages, using
      nr_subpages_unmapped() - but only when first or last pmd mapping finds
      subpages_mapcount raised (double-map case, not the common case).
      
      But are those counts (used to decide when to split an anon THP, and in
      vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
      quite: since page_remove_rmap() (and also split_huge_pmd()) is often
      called without page lock, there can be races when a subpage pte mapcount
      0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
      previous implementation had prevented.  The statistics might become
      inaccurate, and even drift down until they underflow through 0.  That is
      not good enough, but is better dealt with in a followup patch.
      
      Update a few comments on first and second tail page overlaid fields. 
      hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
      subpages_mapcount and compound_pincount are already correctly at 0, so
      delete its reinitialization of compound_pincount.
      
      A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
      18 seconds on small pages, and used to take 1 second on huge pages, but
      now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
      used to take 860ms and now takes 92ms; mapping by pmds after mapping by
      ptes (when the scan is needed) used to take 870ms and now takes 495ms. 
      But there might be some benchmarks which would show a slowdown, because
      tail struct pages now fall out of cache until final freeing checks them.
      
      Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb67f428
    • Hugh Dickins's avatar
      mm,hugetlb: use folio fields in second tail page · dad6a5eb
      Hugh Dickins authored
      Patch series "mm,huge,rmap: unify and speed up compound mapcounts".
      
      
      This patch (of 3):
      
      We want to declare one more int in the first tail of a compound page: that
      first tail page being valuable property, since every compound page has a
      first tail, but perhaps no more than that.
      
      No problem on 64-bit: there is already space for it.  No problem with
      32-bit THPs: 5.18 commit 5232c63f ("mm: Make compound_pincount always
      available") kindly cleared the space for it, apparently not realizing that
      only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail
      page->private might conflict) - but make sure of that in its Kconfig.
      
      But hugetlb pages use tail page->private of the first tail page for a
      subpool pointer, which will conflict; and they also use page->private of
      the 2nd, 3rd and 4th tails.
      
      Undo "mm: add private field of first tail to struct page and struct
      folio"'s recent addition of private_1 to the folio tail: instead add
      hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison to
      a second tail page of the folio: THP has long been using several fields of
      that tail, so make better use of it for hugetlb too.  This is not how a
      generic folio should be declared in future, but it is an effective
      transitional way to make use of it.
      
      Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3.
      
      [hughd@google.com: prefix folio's page_1 and page_2 with double underscore,
        give folio's _flags_2 and _head_2 a line documentation each]
        Link: https://lkml.kernel.org/r/9e2cb6b-5b58-d3f2-b5ee-5f8a14e8f10@google.com
      Link: https://lkml.kernel.org/r/5f52de70-975-e94f-f141-543765736181@google.com
      Link: https://lkml.kernel.org/r/3818cc9a-9999-d064-d778-9c94c5911e6@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dad6a5eb
    • Mike Kravetz's avatar
      selftests/vm: update hugetlb madvise · 634ba645
      Mike Kravetz authored
      Commit 8ebe0a5e ("mm,madvise,hugetlb: fix unexpected data loss with
      MADV_DONTNEED on hugetlbfs") changed how the passed length was interpreted
      for hugetlb mappings.  It was changed from align up to align down.  The
      hugetlb-madvise test explicitly tests this behavior.  Change test to
      expect new behavior.
      
      Link: https://lkml.kernel.org/r/20221104011632.357049-1-mike.kravetz@oracle.com
      Link: https://lore.kernel.org/oe-lkp/202211040619.2ec447d7-oliver.sang@intel.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      634ba645
    • Deming Wang's avatar
      zsmalloc: replace IS_ERR() with IS_ERR_VALUE() · 65917b53
      Deming Wang authored
      Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE()
      instead.
      
      Link: https://lkml.kernel.org/r/20221104023818.1728-1-wangdeming@inspur.comSigned-off-by: default avatarDeming Wang <wangdeming@inspur.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65917b53
    • Peter Xu's avatar
      mm: use pte markers for swap errors · 15520a3f
      Peter Xu authored
      PTE markers are ideal mechanism for things like SWP_SWAPIN_ERROR.  Using a
      whole swap entry type for this purpose can be an overkill, especially if
      we already have PTE markers.  Define a new bit for swapin error and
      replace it with pte markers.  Then we can safely drop SWP_SWAPIN_ERROR and
      give one device slot back to swap.
      
      We used to have SWP_SWAPIN_ERROR taking the page pfn as part of the swap
      entry, but it's never used.  Neither do I see how it can be useful because
      normally the swapin failure should not be caused by a bad page but bad
      swap device.  Drop it alongside.
      
      Link: https://lkml.kernel.org/r/20221030214151.402274-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15520a3f
    • Peter Xu's avatar
      mm: always compile in pte markers · ca92ea3d
      Peter Xu authored
      Patch series "mm: Use pte marker for swapin errors".
      
      This series uses the pte marker to replace the swapin error swap entry,
      then we save one more swap entry slot for swap devices.  A new pte marker
      bit is defined.
      
      
      This patch (of 2):
      
      The PTE markers code is tiny and now it's enabled for most of the
      distributions.  It's fine to keep it as-is, but to make a broader use of
      it (e.g.  replacing read error swap entry) it needs to be there always
      otherwise we need special code path to take care of !PTE_MARKER case.
      
      It'll be easier just make pte marker always exist.  Use this chance to
      extend its usage to anonymous too by simply touching up some of the old
      comments, because it'll be used for anonymous pages in the follow up
      patches.
      
      Link: https://lkml.kernel.org/r/20221030214151.402274-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221030214151.402274-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarHuang Ying <ying.huang@intel.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca92ea3d