1. 12 Sep, 2024 8 commits
    • David Howells's avatar
      netfs: Speed up buffered reading · ee4cdf7b
      David Howells authored
      Improve the efficiency of buffered reads in a number of ways:
      
       (1) Overhaul the algorithm in general so that it's a lot more compact and
           split the read submission code between buffered and unbuffered
           versions.  The unbuffered version can be vastly simplified.
      
       (2) Read-result collection is handed off to a work queue rather than being
           done in the I/O thread.  Multiple subrequests can be processes
           simultaneously.
      
       (3) When a subrequest is collected, any folios it fully spans are
           collected and "spare" data on either side is donated to either the
           previous or the next subrequest in the sequence.
      
      Notes:
      
       (*) Readahead expansion is massively slows down fio, presumably because it
           causes a load of extra allocations, both folio and xarray, up front
           before RPC requests can be transmitted.
      
       (*) RDMA with cifs does appear to work, both with SIW and RXE.
      
       (*) PG_private_2-based reading and copy-to-cache is split out into its own
           file and altered to use folio_queue.  Note that the copy to the cache
           now creates a new write transaction against the cache and adds the
           folios to be copied into it.  This allows it to use part of the
           writeback I/O code.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      ee4cdf7b
    • David Howells's avatar
      afs: Make read subreqs async · 2e45b922
      David Howells authored
      Perform AFS read subrequests in a work item rather than in the calling
      thread.  For normal buffered reads, this will allow the calling thread to
      copy data from the pagecache to the application at the same time as the
      demarshalling thread is shovelling data from skbuffs into the pagecache.
      
      This will also allow the RA mark to trigger a new read before we've
      finished shovelling the data from the current one.
      
      Note: This would be a bit safer if the FS.FetchData RPC ops returned the
      metadata (including the data version number) before returning the data.
      This would allow me to flush the pagecache before installing the new data.
      
      In future, it may be possible to asynchronously flush the pagecache either
      side of the region being read.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-afs@lists.infradead.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-19-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      2e45b922
    • David Howells's avatar
      netfs: Simplify the writeback code · 983cdcf8
      David Howells authored
      Use the new folio_queue structures to simplify the writeback code.  The
      problem with referring to the i_pages xarray directly is that we may have
      gaps in the sequence of folios we're writing from that we need to skip when
      we're removing the writeback mark from the folios we're writing back from.
      
      At the moment the code tries to deal with this by carefully tracking the
      gaps in each writeback stream (eg. write to server and write to cache) and
      divining when there's a gap that spans folios (something that's not helped
      by folios not being a consistent size).
      
      Instead, the folio_queue buffer contains pointers only the folios we're
      dealing with, has them in ascending order and indicates a gap by placing
      non-consequitive folios next to each other.  This makes it possible to
      track where we need to clean up to by just keeping track of where we've
      processed to on each stream and taking the minimum.
      
      Note that the I/O iterator is always rounded up to the end of the folio,
      even if that is beyond the EOF position, so that the cache can do DIO from
      the page.  The excess space is cleared, though mmapped writes clobber it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      983cdcf8
    • David Howells's avatar
      netfs: Provide an iterator-reset function · bfaa33b8
      David Howells authored
      Provide a function to reset the iterator on a subrequest.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-17-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      bfaa33b8
    • David Howells's avatar
      netfs: Use new folio_queue data type and iterator instead of xarray iter · cd0277ed
      David Howells authored
      Make the netfs write-side routines use the new folio_queue struct to hold a
      rolling buffer of folios, with the issuer adding folios at the tail and the
      collector removing them from the head as they're processed instead of using
      an xarray.
      
      This will allow a subsequent patch to simplify the write collector.
      
      The primary mark (as tested by folioq_is_marked()) is used to note if the
      corresponding folio needs putting.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      cd0277ed
    • David Howells's avatar
      cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs · c45ebd63
      David Howells authored
      Make smb_extract_iter_to_rdma() extract page fragments from an ITER_FOLIOQ
      iterator into RDMA SGEs.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Steve French <sfrench@samba.org>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Tom Talpey <tom@talpey.com>
      cc: Enzo Matsumiya <ematsumiya@suse.de>
      cc: linux-cifs@vger.kernel.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-15-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      c45ebd63
    • David Howells's avatar
      iov_iter: Provide copy_folio_from_iter() · 197a3de6
      David Howells authored
      Provide a copy_folio_from_iter() wrapper.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Alexander Viro <viro@zeniv.linux.org.uk>
      cc: Christian Brauner <christian@brauner.io>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-14-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      197a3de6
    • David Howells's avatar
      mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios · db0aa2e9
      David Howells authored
      Define a data structure, struct folio_queue, to represent a sequence of
      folios and a kernel-internal I/O iterator type, ITER_FOLIOQ, to allow a
      list of folio_queue structures to be used to provide a buffer to
      iov_iter-taking functions, such as sendmsg and recvmsg.
      
      The folio_queue structure looks like:
      
      	struct folio_queue {
      		struct folio_batch	vec;
      		u8			orders[PAGEVEC_SIZE];
      		struct folio_queue	*next;
      		struct folio_queue	*prev;
      		unsigned long		marks;
      		unsigned long		marks2;
      	};
      
      It does not use a list_head so that next and/or prev can be set to NULL at
      the ends of the list, allowing iov_iter-handling routines to determine that
      they *are* the ends without needing to store a head pointer in the iov_iter
      struct.
      
      A folio_batch struct is used to hold the folio pointers which allows the
      batch to be passed to batch handling functions.  Two mark bits are
      available per slot.  The intention is to use at least one of them to mark
      folios that need putting, but that might not be ultimately necessary.
      Accessor functions are used to access the slots to do the masking and an
      additional accessor function is used to indicate the size of the array.
      
      The order of each folio is also stored in the structure to avoid the need
      for iov_iter_advance() and iov_iter_revert() to have to query each folio to
      find its size.
      
      With careful barriering, this can be used as an extending buffer with new
      folios inserted and new folio_queue structs added without the need for a
      lock.  Further, provided we always keep at least one struct in the buffer,
      we can also remove consumed folios and consumed structs from the head end
      as we without the need for locks.
      
      [Questions/thoughts]
      
       (1) To manage this, I need a head pointer, a tail pointer, a tail slot
           number (assuming insertion happens at the tail end and the next
           pointers point from head to tail).  Should I put these into a struct
           of their own, say "folio_queue_head" or "rolling_buffer"?
      
           I will end up with two of these in netfs_io_request eventually, one
           keeping track of the pagecache I'm dealing with for buffered I/O and
           the other to hold a bounce buffer when we need one.
      
       (2) Should I make the slots {folio,off,len} or bio_vec?
      
       (3) This is intended to replace ITER_XARRAY eventually.  Using an xarray
           in I/O iteration requires the taking of the RCU read lock, doing
           copying under the RCU read lock, walking the xarray (which may change
           under us), handling retries and dealing with special values.
      
           The advantage of ITER_XARRAY is that when we're dealing with the
           pagecache directly, we don't need any allocation - but if we're doing
           encrypted comms, there's a good chance we'd be using a bounce buffer
           anyway.
      
           This will require afs, erofs, cifs, orangefs and fscache to be
           converted to not use this.  afs still uses it for dirs and symlinks;
           some of erofs usages should be easy to change, but there's one which
           won't be so easy; ceph's use via fscache can be fixed by porting ceph
           to netfslib; cifs is using xarray as a bounce buffer - that can be
           moved to use sheaves instead; and orangefs has a similar problem to
           erofs - maybe orangefs could use netfslib?
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Steve French <sfrench@samba.org>
      cc: Ilya Dryomov <idryomov@gmail.com>
      cc: Gao Xiang <xiang@kernel.org>
      cc: Mike Marshall <hubcap@omnibond.com>
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      cc: linux-afs@lists.infradead.org
      cc: linux-cifs@vger.kernel.org
      cc: ceph-devel@vger.kernel.org
      cc: linux-erofs@lists.ozlabs.org
      cc: devel@lists.orangefs.org
      Link: https://lore.kernel.org/r/20240814203850.2240469-13-dhowells@redhat.com/ # v2
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      db0aa2e9
  2. 05 Sep, 2024 10 commits
  3. 04 Sep, 2024 5 commits
  4. 03 Sep, 2024 2 commits
  5. 02 Sep, 2024 15 commits
    • Linus Torvalds's avatar
      Merge tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux · 67784a74
      Linus Torvalds authored
      Pull ata fix from Damien Le Moal:
      
       - Fix a potential memory leak in the ata host initialization code (from
         Zheng)
      
      * tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
        ata: libata: Fix memory leak for error path in ata_host_alloc()
      67784a74
    • Suren Baghdasaryan's avatar
      alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n · 052a45c1
      Suren Baghdasaryan authored
      codetag_module_init() is used to initialize sections containing allocation
      tags.  This function is used to initialize module sections as well as core
      kernel sections, in which case the module parameter is set to NULL.  This
      function has to be called even when CONFIG_MODULES=n to initialize core
      kernel allocation tag sections.  When CONFIG_MODULES=n, this function is a
      NOP, which is wrong.  This leads to /proc/allocinfo reported as empty. 
      Fix this by making it independent of CONFIG_MODULES.
      
      Link: https://lkml.kernel.org/r/20240828231536.1770519-1-surenb@google.com
      Fixes: 916cc516 ("lib: code tagging framework")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      052a45c1
    • Adrian Huang's avatar
      mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area · 409faf8c
      Adrian Huang authored
      When running the vmalloc stress on a 448-core system, observe the average
      latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
      'funclatency.py' tool [1].
      
        # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
      
           usecs             : count    distribution
              0 -> 1         : 0       |                                        |
              2 -> 3         : 29      |                                        |
              4 -> 7         : 19      |                                        |
              8 -> 15        : 56      |                                        |
             16 -> 31        : 483     |****                                    |
             32 -> 63        : 1548    |************                            |
             64 -> 127       : 2634    |*********************                   |
            128 -> 255       : 2535    |*********************                   |
            256 -> 511       : 1776    |**************                          |
            512 -> 1023      : 1015    |********                                |
           1024 -> 2047      : 573     |****                                    |
           2048 -> 4095      : 488     |****                                    |
           4096 -> 8191      : 1091    |*********                               |
           8192 -> 16383     : 3078    |*************************               |
          16384 -> 32767     : 4821    |****************************************|
          32768 -> 65535     : 3318    |***************************             |
          65536 -> 131071    : 1718    |**************                          |
         131072 -> 262143    : 2220    |******************                      |
         262144 -> 524287    : 1147    |*********                               |
         524288 -> 1048575   : 1179    |*********                               |
        1048576 -> 2097151   : 822     |******                                  |
        2097152 -> 4194303   : 906     |*******                                 |
        4194304 -> 8388607   : 2148    |*****************                       |
        8388608 -> 16777215  : 4497    |*************************************   |
       16777216 -> 33554431  : 289     |**                                      |
      
        avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
      
        The worst case is over 16-33 seconds, so soft lockup is triggered [2].
      
      [Root Cause]
      1) Each purge_list has the long list. The following shows the number of
         vmap_area is purged.
      
         crash> p vmap_nodes
         vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
         crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
           nr_purged = 663070
           ...
           nr_purged = 821670
           nr_purged = 692214
           nr_purged = 726808
           ...
      
      2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
         operation when purging each vmap_area. However, the iteration is over
         600000 vmap_area (See 'nr_purged' above).
      
         Here is objdump output:
      
           $ objdump -D vmlinux
           ffffffff813e8c80 <purge_vmap_node>:
           ...
           ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
           ...
      
         Quote from "Instruction tables" pdf file [3]:
           Instructions with a LOCK prefix have a long latency that depends on
           cache organization and possibly RAM speed. If there are multiple
           processors or cores or direct memory access (DMA) devices, then all
           locked instructions will lock a cache line for exclusive access,
           which may involve RAM access. A LOCK prefix typically costs more
           than a hundred clock cycles, even on single-processor systems.
      
         That's why the latency of purge_vmap_node() dramatically increases
         on a many-core system: One core is busy on purging each vmap_area of
         the *long* purge_list and executing atomic_long_sub() for each
         vmap_area, while other cores free vmalloc allocations and execute
         atomic_long_add_return() in free_vmap_area_noflush().
      
      [Solution]
      Employ a local variable to record the total purged pages, and execute
      atomic_long_sub() after the traversal of the purge_list is done. The
      experiment result shows the latency improvement is 99%.
      
      [Experiment Result]
      1) System Configuration: Three servers (with HT-enabled) are tested.
           * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
           * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
           * 448-core server: AMD Zen 4 Processor*2
      
      2) Kernel Config
           * CONFIG_KASAN is disabled
      
      3) The data in column "w/o patch" and "w/ patch"
           * Unit: micro seconds (us)
           * Each data is the average of 3-time measurements
      
               System        w/o patch (us)   w/ patch (us)    Improvement (%)
           ---------------   --------------   -------------    -------------
           72-core server          2194              14            99.36%
           192-core server       143799            1139            99.21%
           448-core server      1992122            6883            99.65%
      
      [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
      [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
      [3] https://www.agner.org/optimize/instruction_tables.pdf
      
      Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.comSigned-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      409faf8c
    • Jan Kuliga's avatar
      mailmap: update entry for Jan Kuliga · 4f295229
      Jan Kuliga authored
      Soon I won't be able to use my current email address.
      
      Link: https://lkml.kernel.org/r/20240830095658.1203198-1-jankul@alatek.krakow.plSigned-off-by: default avatarJan Kuliga <jankul@alatek.krakow.pl>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f295229
    • Hao Ge's avatar
      codetag: debug: mark codetags for poisoned page as empty · 5e9784e9
      Hao Ge authored
      When PG_hwpoison pages are freed they are treated differently in
      free_pages_prepare() and instead of being released they are isolated.
      
      Page allocation tag counters are decremented at this point since the page
      is considered not in use.  Later on when such pages are released by
      unpoison_memory(), the allocation tag counters will be decremented again
      and the following warning gets reported:
      
      [  113.930443][ T3282] ------------[ cut here ]------------
      [  113.931105][ T3282] alloc_tag was not set
      [  113.931576][ T3282] WARNING: CPU: 2 PID: 3282 at ./include/linux/alloc_tag.h:130 pgalloc_tag_sub.part.66+0x154/0x164
      [  113.932866][ T3282] Modules linked in: hwpoison_inject fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_man4
      [  113.941638][ T3282] CPU: 2 UID: 0 PID: 3282 Comm: madvise11 Kdump: loaded Tainted: G        W          6.11.0-rc4-dirty #18
      [  113.943003][ T3282] Tainted: [W]=WARN
      [  113.943453][ T3282] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
      [  113.944378][ T3282] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  113.945319][ T3282] pc : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946016][ T3282] lr : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946706][ T3282] sp : ffff800087093a10
      [  113.947197][ T3282] x29: ffff800087093a10 x28: ffff0000d7a9d400 x27: ffff80008249f0a0
      [  113.948165][ T3282] x26: 0000000000000000 x25: ffff80008249f2b0 x24: 0000000000000000
      [  113.949134][ T3282] x23: 0000000000000001 x22: 0000000000000001 x21: 0000000000000000
      [  113.950597][ T3282] x20: ffff0000c08fcad8 x19: ffff80008251e000 x18: ffffffffffffffff
      [  113.952207][ T3282] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800081746210
      [  113.953161][ T3282] x14: 0000000000000000 x13: 205d323832335420 x12: 5b5d353031313339
      [  113.954120][ T3282] x11: ffff800087093500 x10: 000000000000005d x9 : 00000000ffffffd0
      [  113.955078][ T3282] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008236ba90 x6 : c0000000ffff7fff
      [  113.956036][ T3282] x5 : ffff000b34bf4dc8 x4 : ffff8000820aba90 x3 : 0000000000000001
      [  113.956994][ T3282] x2 : ffff800ab320f000 x1 : 841d1e35ac932e00 x0 : 0000000000000000
      [  113.957962][ T3282] Call trace:
      [  113.958350][ T3282]  pgalloc_tag_sub.part.66+0x154/0x164
      [  113.959000][ T3282]  pgalloc_tag_sub+0x14/0x1c
      [  113.959539][ T3282]  free_unref_page+0xf4/0x4b8
      [  113.960096][ T3282]  __folio_put+0xd4/0x120
      [  113.960614][ T3282]  folio_put+0x24/0x50
      [  113.961103][ T3282]  unpoison_memory+0x4f0/0x5b0
      [  113.961678][ T3282]  hwpoison_unpoison+0x30/0x48 [hwpoison_inject]
      [  113.962436][ T3282]  simple_attr_write_xsigned.isra.34+0xec/0x1cc
      [  113.963183][ T3282]  simple_attr_write+0x38/0x48
      [  113.963750][ T3282]  debugfs_attr_write+0x54/0x80
      [  113.964330][ T3282]  full_proxy_write+0x68/0x98
      [  113.964880][ T3282]  vfs_write+0xdc/0x4d0
      [  113.965372][ T3282]  ksys_write+0x78/0x100
      [  113.965875][ T3282]  __arm64_sys_write+0x24/0x30
      [  113.966440][ T3282]  invoke_syscall+0x7c/0x104
      [  113.966984][ T3282]  el0_svc_common.constprop.1+0x88/0x104
      [  113.967652][ T3282]  do_el0_svc+0x2c/0x38
      [  113.968893][ T3282]  el0_svc+0x3c/0x1b8
      [  113.969379][ T3282]  el0t_64_sync_handler+0x98/0xbc
      [  113.969980][ T3282]  el0t_64_sync+0x19c/0x1a0
      [  113.970511][ T3282] ---[ end trace 0000000000000000 ]---
      
      To fix this, clear the page tag reference after the page got isolated
      and accounted for.
      
      Link: https://lkml.kernel.org/r/20240825163649.33294-1-hao.ge@linux.dev
      Fixes: d224eb02 ("codetag: debug: mark codetags for reserved pages as empty")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hao Ge <gehao@kylinos.cn>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e9784e9
    • Mike Yuan's avatar
      mm/memcontrol: respect zswap.writeback setting from parent cg too · e3992573
      Mike Yuan authored
      Currently, the behavior of zswap.writeback wrt.  the cgroup hierarchy
      seems a bit odd.  Unlike zswap.max, it doesn't honor the value from parent
      cgroups.  This surfaced when people tried to globally disable zswap
      writeback, i.e.  reserve physical swap space only for hibernation [1] -
      disabling zswap.writeback only for the root cgroup results in subcgroups
      with zswap.writeback=1 still performing writeback.
      
      The inconsistency became more noticeable after I introduced the
      MemoryZSwapWriteback= systemd unit setting [2] for controlling the knob.
      The patch assumed that the kernel would enforce the value of parent
      cgroups.  It could probably be workarounded from systemd's side, by going
      up the slice unit tree and inheriting the value.  Yet I think it's more
      sensible to make it behave consistently with zswap.max and friends.
      
      [1] https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate#Disable_zswap_writeback_to_use_the_swap_space_only_for_hibernation
      [2] https://github.com/systemd/systemd/pull/31734
      
      Link: https://lkml.kernel.org/r/20240823162506.12117-1-me@yhndnzj.com
      Fixes: 501a06fe ("zswap: memcontrol: implement zswap writeback disabling")
      Signed-off-by: default avatarMike Yuan <me@yhndnzj.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3992573
    • Marc Zyngier's avatar
      scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum · a3f6a89c
      Marc Zyngier authored
      Richard reports that since 772dd034 ("mm: enumerate all gfp flags"),
      gfp-translate is broken, as the bit numbers are implicit, leaving the
      shell script unable to extract them.  Even more, some bits are now at a
      variable location, making it double extra hard to parse using a simple
      shell script.
      
      Use a brute-force approach to the problem by generating a small C stub
      that will use the enum to dump the interesting bits.
      
      As an added bonus, we are now able to identify invalid bits for a given
      configuration.  As an added drawback, we cannot parse include files that
      predate this change anymore.  Tough luck.
      
      Link: https://lkml.kernel.org/r/20240823163850.3791201-1-maz@kernel.org
      Fixes: 772dd034 ("mm: enumerate all gfp flags")
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Reported-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Petr Tesařík <petr@tesarici.cz>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3f6a89c
    • Usama Arif's avatar
      Revert "mm: skip CMA pages when they are not available" · bfe0857c
      Usama Arif authored
      This reverts commit 5da226db ("mm: skip CMA pages when they are not
      available") and b7108d66 ("Multi-gen LRU: skip CMA pages when they are
      not eligible").
      
      lruvec->lru_lock is highly contended and is held when calling
      isolate_lru_folios.  If the lru has a large number of CMA folios
      consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
      isolate_lru_folios can hold the lock for a very long time while it skips
      those.  For FIO workload, ~150million order=0 folios were skipped to
      isolate a few ZONE_DMA folios [1].  This can cause lockups [1] and high
      memory pressure for extended periods of time [2].
      
      Remove skipping CMA for MGLRU as well, as it was introduced in sort_folio
      for the same resaon as 5da226db.
      
      [1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
      [2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/
      
      [usamaarif642@gmail.com: also revert b7108d66, per Johannes]
        Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
        Link: https://lkml.kernel.org/r/357ac325-4c61-497a-92a3-bdbd230d5ec9@gmail.com
      Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
      Fixes: 5da226db ("mm: skip CMA pages when they are not available")
      Signed-off-by: default avatarUsama Arif <usamaarif642@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
      Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bfe0857c
    • Liam R. Howlett's avatar
      maple_tree: remove rcu_read_lock() from mt_validate() · f806de88
      Liam R. Howlett authored
      The write lock should be held when validating the tree to avoid updates
      racing with checks.  Holding the rcu read lock during a large tree
      validation may also cause a prolonged rcu read window and "rcu_preempt
      detected stalls" warnings.
      
      Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/
      Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reported-by: syzbot+036af2f0c7338a33b0cd@syzkaller.appspotmail.com
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f806de88
    • Petr Tesarik's avatar
      kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y · 6dacd79d
      Petr Tesarik authored
      Fix the condition to exclude the elfcorehdr segment from the SHA digest
      calculation.
      
      The j iterator is an index into the output sha_regions[] array, not into
      the input image->segment[] array.  Once it reaches
      image->elfcorehdr_index, all subsequent segments are excluded.  Besides,
      if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
      may be wrongly included in the calculation.
      
      Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
      Fixes: f7cc804a ("kexec: exclude elfcorehdr from the segment digest")
      Signed-off-by: default avatarPetr Tesarik <ptesarik@suse.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: Eric DeVolder <eric_devolder@yahoo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dacd79d
    • Hao Ge's avatar
      mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook · ab7ca095
      Hao Ge authored
      When enable CONFIG_MEMCG & CONFIG_KFENCE & CONFIG_KMEMLEAK, the following
      warning always occurs,This is because the following call stack occurred:
      mem_pool_alloc
          kmem_cache_alloc_noprof
              slab_alloc_node
                  kfence_alloc
      
      Once the kfence allocation is successful,slab->obj_exts will not be empty,
      because it has already been assigned a value in kfence_init_pool.
      
      Since in the prepare_slab_obj_exts_hook function,we perform a check for
      s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE),the alloc_tag_add function
      will not be called as a result.Therefore,ref->ct remains NULL.
      
      However,when we call mem_pool_free,since obj_ext is not empty, it
      eventually leads to the alloc_tag_sub scenario being invoked.  This is
      where the warning occurs.
      
      So we should add corresponding checks in the alloc_tagging_slab_free_hook.
      For __GFP_NO_OBJ_EXT case,I didn't see the specific case where it's using
      kfence,so I won't add the corresponding check in
      alloc_tagging_slab_free_hook for now.
      
      [    3.734349] ------------[ cut here ]------------
      [    3.734807] alloc_tag was not set
      [    3.735129] WARNING: CPU: 4 PID: 40 at ./include/linux/alloc_tag.h:130 kmem_cache_free+0x444/0x574
      [    3.735866] Modules linked in: autofs4
      [    3.736211] CPU: 4 UID: 0 PID: 40 Comm: ksoftirqd/4 Tainted: G        W          6.11.0-rc3-dirty #1
      [    3.736969] Tainted: [W]=WARN
      [    3.737258] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
      [    3.737875] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [    3.738501] pc : kmem_cache_free+0x444/0x574
      [    3.738951] lr : kmem_cache_free+0x444/0x574
      [    3.739361] sp : ffff80008357bb60
      [    3.739693] x29: ffff80008357bb70 x28: 0000000000000000 x27: 0000000000000000
      [    3.740338] x26: ffff80008207f000 x25: ffff000b2eb2fd60 x24: ffff0000c0005700
      [    3.740982] x23: ffff8000804229e4 x22: ffff800082080000 x21: ffff800081756000
      [    3.741630] x20: fffffd7ff8253360 x19: 00000000000000a8 x18: ffffffffffffffff
      [    3.742274] x17: ffff800ab327f000 x16: ffff800083398000 x15: ffff800081756df0
      [    3.742919] x14: 0000000000000000 x13: 205d344320202020 x12: 5b5d373038343337
      [    3.743560] x11: ffff80008357b650 x10: 000000000000005d x9 : 00000000ffffffd0
      [    3.744231] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008237bad0 x6 : c0000000ffff7fff
      [    3.744907] x5 : ffff80008237ba78 x4 : ffff8000820bbad0 x3 : 0000000000000001
      [    3.745580] x2 : 68d66547c09f7800 x1 : 68d66547c09f7800 x0 : 0000000000000000
      [    3.746255] Call trace:
      [    3.746530]  kmem_cache_free+0x444/0x574
      [    3.746931]  mem_pool_free+0x44/0xf4
      [    3.747306]  free_object_rcu+0xc8/0xdc
      [    3.747693]  rcu_do_batch+0x234/0x8a4
      [    3.748075]  rcu_core+0x230/0x3e4
      [    3.748424]  rcu_core_si+0x14/0x1c
      [    3.748780]  handle_softirqs+0x134/0x378
      [    3.749189]  run_ksoftirqd+0x70/0x9c
      [    3.749560]  smpboot_thread_fn+0x148/0x22c
      [    3.749978]  kthread+0x10c/0x118
      [    3.750323]  ret_from_fork+0x10/0x20
      [    3.750696] ---[ end trace 0000000000000000 ]---
      
      Link: https://lkml.kernel.org/r/20240816013336.17505-1-hao.ge@linux.dev
      Fixes: 4b873696 ("mm/slab: add allocation accounting into slab allocation and free paths")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab7ca095
    • Ryusuke Konishi's avatar
      nilfs2: fix state management in error path of log writing function · 6576dd66
      Ryusuke Konishi authored
      After commit a694291a ("nilfs2: separate wait function from
      nilfs_segctor_write") was applied, the log writing function
      nilfs_segctor_do_construct() was able to issue I/O requests continuously
      even if user data blocks were split into multiple logs across segments,
      but two potential flaws were introduced in its error handling.
      
      First, if nilfs_segctor_begin_construction() fails while creating the
      second or subsequent logs, the log writing function returns without
      calling nilfs_segctor_abort_construction(), so the writeback flag set on
      pages/folios will remain uncleared.  This causes page cache operations to
      hang waiting for the writeback flag.  For example,
      truncate_inode_pages_final(), which is called via nilfs_evict_inode() when
      an inode is evicted from memory, will hang.
      
      Second, the NILFS_I_COLLECTED flag set on normal inodes remain uncleared. 
      As a result, if the next log write involves checkpoint creation, that's
      fine, but if a partial log write is performed that does not, inodes with
      NILFS_I_COLLECTED set are erroneously removed from the "sc_dirty_files"
      list, and their data and b-tree blocks may not be written to the device,
      corrupting the block mapping.
      
      Fix these issues by uniformly calling nilfs_segctor_abort_construction()
      on failure of each step in the loop in nilfs_segctor_do_construct(),
      having it clean up logs and segment usages according to progress, and
      correcting the conditions for calling nilfs_redirty_inodes() to ensure
      that the NILFS_I_COLLECTED flag is cleared.
      
      Link: https://lkml.kernel.org/r/20240814101119.4070-1-konishi.ryusuke@gmail.com
      Fixes: a694291a ("nilfs2: separate wait function from nilfs_segctor_write")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6576dd66
    • Ryusuke Konishi's avatar
      nilfs2: fix missing cleanup on rollforward recovery error · 5787fcaa
      Ryusuke Konishi authored
      In an error injection test of a routine for mount-time recovery, KASAN
      found a use-after-free bug.
      
      It turned out that if data recovery was performed using partial logs
      created by dsync writes, but an error occurred before starting the log
      writer to create a recovered checkpoint, the inodes whose data had been
      recovered were left in the ns_dirty_files list of the nilfs object and
      were not freed.
      
      Fix this issue by cleaning up inodes that have read the recovery data if
      the recovery routine fails midway before the log writer starts.
      
      Link: https://lkml.kernel.org/r/20240810065242.3701-1-konishi.ryusuke@gmail.com
      Fixes: 0f3e1c7f ("nilfs2: recovery functions")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5787fcaa
    • Ryusuke Konishi's avatar
      nilfs2: protect references to superblock parameters exposed in sysfs · 68340825
      Ryusuke Konishi authored
      The superblock buffers of nilfs2 can not only be overwritten at runtime
      for modifications/repairs, but they are also regularly swapped, replaced
      during resizing, and even abandoned when degrading to one side due to
      backing device issues.  So, accessing them requires mutual exclusion using
      the reader/writer semaphore "nilfs->ns_sem".
      
      Some sysfs attribute show methods read this superblock buffer without the
      necessary mutual exclusion, which can cause problems with pointer
      dereferencing and memory access, so fix it.
      
      Link: https://lkml.kernel.org/r/20240811100320.9913-1-konishi.ryusuke@gmail.com
      Fixes: da7141fb ("nilfs2: add /sys/fs/nilfs2/<device> group")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68340825
    • Jann Horn's avatar
      userfaultfd: don't BUG_ON() if khugepaged yanks our page table · 4828d207
      Jann Horn authored
      Since khugepaged was changed to allow retracting page tables in file
      mappings without holding the mmap lock, these BUG_ON()s are wrong - get
      rid of them.
      
      We could also remove the preceding "if (unlikely(...))" block, but then we
      could reach pte_offset_map_lock() with transhuge pages not just for file
      mappings but also for anonymous mappings - which would probably be fine
      but I think is not necessarily expected.
      
      Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-2-5efa61078a41@google.com
      Fixes: 1d65b771 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4828d207