1. 29 Jun, 2021 3 commits
    • Mel Gorman's avatar
      mm/page_alloc: correct return value of populated elements if bulk array is populated · ff4b2b40
      Mel Gorman authored
      Dave Jones reported the following
      
      	This made it into 5.13 final, and completely breaks NFSD for me
      	(Serving tcp v3 mounts).  Existing mounts on clients hang, as do
      	new mounts from new clients.  Rebooting the server back to rc7
      	everything recovers.
      
      The commit b3b64ebd ("mm/page_alloc: do bulk array bounds check after
      checking populated elements") returns the wrong value if the array is
      already populated which is interpreted as an allocation failure.  Dave
      reported this fixes his problem and it also passed a test running dbench
      over NFS.
      
      Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net
      Fixes: b3b64ebd ("mm/page_alloc: do bulk array bounds check after checking populated elements")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org> [5.13+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff4b2b40
    • Mike Rapoport's avatar
      mm/page_alloc: fix memory map initialization for descending nodes · 122e093c
      Mike Rapoport authored
      On systems with memory nodes sorted in descending order, for instance Dell
      Precision WorkStation T5500, the struct pages for higher PFNs and
      respectively lower nodes, could be overwritten by the initialization of
      struct pages corresponding to the holes in the memory sections.
      
      For example for the below memory layout
      
      [    0.245624] Early memory node ranges
      [    0.248496]   node   1: [mem 0x0000000000001000-0x0000000000090fff]
      [    0.251376]   node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
      [    0.254256]   node   1: [mem 0x0000000100000000-0x0000001423ffffff]
      [    0.257144]   node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
      the middle of a section and will be considered as a hole during the
      initialization of the last section in node 1.
      
      The wrong initialization of the memory map causes panic on boot when
      CONFIG_DEBUG_VM is enabled.
      
      Reorder loop order of the memory map initialization so that the outer loop
      will always iterate over populated memory regions in the ascending order
      and the inner loop will select the zone corresponding to the PFN range.
      
      This way initialization of the struct pages for the memory holes will be
      always done for the ranges that are actually not populated.
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
      Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
      Fixes: 0740a50b ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Boris Petkov <bp@alien8.de>
      Cc: Robert Shteynfeld <robert.shteynfeld@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      122e093c
    • Jann Horn's avatar
      mm/gup: fix try_grab_compound_head() race with split_huge_page() · c24d3732
      Jann Horn authored
      try_grab_compound_head() is used to grab a reference to a page from
      get_user_pages_fast(), which is only protected against concurrent freeing
      of page tables (via local_irq_save()), but not against concurrent TLB
      flushes, freeing of data pages, or splitting of compound pages.
      
      Because no reference is held to the page when try_grab_compound_head() is
      called, the page may have been freed and reallocated by the time its
      refcount has been elevated; therefore, once we're holding a stable
      reference to the page, the caller re-checks whether the PTE still points
      to the same page (with the same access rights).
      
      The problem is that try_grab_compound_head() has to grab a reference on
      the head page; but between the time we look up what the head page is and
      the time we actually grab a reference on the head page, the compound page
      may have been split up (either explicitly through split_huge_page() or by
      freeing the compound page to the buddy allocator and then allocating its
      individual order-0 pages).  If that happens, get_user_pages_fast() may end
      up returning the right page but lifting the refcount on a now-unrelated
      page, leading to use-after-free of pages.
      
      To fix it: Re-check whether the pages still belong together after lifting
      the refcount on the head page.  Move anything else that checks
      compound_head(page) below the refcount increment.
      
      This can't actually happen on bare-metal x86 (because there, disabling
      IRQs locks out remote TLB flushes), but it can happen on virtualized x86
      (e.g.  under KVM) and probably also on arm64.  The race window is pretty
      narrow, and constantly allocating and shattering hugepages isn't exactly
      fast; for now I've only managed to reproduce this in an x86 KVM guest with
      an artificially widened timing window (by adding a loop that repeatedly
      calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
      that PV TLB flushes are used instead of IPIs).
      
      As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
      a warning and bailout.  Since the existing code only performed the BUG_ON
      check on DEBUG_VM kernels, ensure that the new code also only performs the
      check under that configuration - I don't want to mix two logically
      separate changes together too much.  The macro VM_WARN_ON_ONCE_PAGE()
      doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
      block.  An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
      definition for !DEBUG_VM such that it always returns false, but since that
      would differ from the behavior of the normal WARN macros, it might be too
      confusing for readers.
      
      Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com
      Fixes: 7aef4172 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c24d3732
  2. 27 Jun, 2021 2 commits
    • Linus Torvalds's avatar
      Linux 5.13 · 62fb9874
      Linus Torvalds authored
      62fb9874
    • Linus Torvalds's avatar
      Revert "signal: Allow tasks to cache one sigqueue struct" · b4b27b9e
      Linus Torvalds authored
      This reverts commits 4bad58eb (and
      399f8dd9, which tried to fix it).
      
      I do not believe these are correct, and I'm about to release 5.13, so am
      reverting them out of an abundance of caution.
      
      The locking is odd, and appears broken.
      
      On the allocation side (in __sigqueue_alloc()), the locking is somewhat
      straightforward: it depends on sighand->siglock.  Since one caller
      doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
      the case with no locks held.
      
      On the freeing side (in sigqueue_cache_or_free()), there is no locking
      at all, and the logic instead depends on 'current' being a single
      thread, and not able to race with itself.
      
      To make things more exciting, there's also the data race between freeing
      a signal and allocating one, which is handled by using WRITE_ONCE() and
      READ_ONCE(), and being mutually exclusive wrt the initial state (ie
      freeing will only free if the old state was NULL, while allocating will
      obviously only use the value if it was non-NULL, so only one or the
      other will actually act on the value).
      
      However, while the free->alloc paths do seem mutually exclusive thanks
      to just the data value dependency, it's not clear what the memory
      ordering constraints are on it.  Could writes from the previous
      allocation possibly be delayed and seen by the new allocation later,
      causing logical inconsistencies?
      
      So it's all very exciting and unusual.
      
      And in particular, it seems that the freeing side is incorrect in
      depending on "current" being single-threaded.  Yes, 'current' is a
      single thread, but in the presense of asynchronous events even a single
      thread can have data races.
      
      And such asynchronous events can and do happen, with interrupts causing
      signals to be flushed and thus free'd (for example - sending a
      SIGCONT/SIGSTOP can happen from interrupt context, and can flush
      previously queued process control signals).
      
      So regardless of all the other questions about the memory ordering and
      locking for this new cached allocation, the sigqueue_cache_or_free()
      assumptions seem to be fundamentally incorrect.
      
      It may be that people will show me the errors of my ways, and tell me
      why this is all safe after all.  We can reinstate it if so.  But my
      current belief is that the WRITE_ONCE() that sets the cached entry needs
      to be a smp_store_release(), and the READ_ONCE() that finds a cached
      entry needs to be a smp_load_acquire() to handle memory ordering
      correctly.
      
      And the sequence in sigqueue_cache_or_free() would need to either use a
      lock or at least be interrupt-safe some way (perhaps by using something
      like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
      percpu operations it needs to be interrupt-safe).
      
      Fixes: 399f8dd9 ("signal: Prevent sigqueue caching after task got released")
      Fixes: 4bad58eb ("signal: Allow tasks to cache one sigqueue struct")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4b27b9e
  3. 26 Jun, 2021 2 commits
  4. 25 Jun, 2021 33 commits