An error occurred fetching the project authors.
  1. 02 Sep, 2024 1 commit
  2. 11 Apr, 2024 1 commit
    • Paolo Bonzini's avatar
      mmu_notifier: remove the .change_pte() callback · 997308f9
      Paolo Bonzini authored
      The scope of set_pte_at_notify() has reduced more and more through the
      years.  Initially, it was meant for when the change to the PTE was
      not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
      that has not been so for over ten years.  During all this period
      the only implementation of .change_pte() was KVM and it
      had no actual functionality, because it was called after
      mmu_notifier_invalidate_range_start() zapped the secondary PTE.
      
      Now that this (nonfunctional) user of the .change_pte() callback is
      gone, the whole callback can be removed.  For now, leave in place
      set_pte_at_notify() even though it is just a synonym for set_pte_at().
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      997308f9
  3. 18 Aug, 2023 3 commits
  4. 03 Feb, 2023 1 commit
  5. 22 Apr, 2022 1 commit
    • Alistair Popple's avatar
      mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() · 31956166
      Alistair Popple authored
      In some cases it is possible for mmu_interval_notifier_remove() to race
      with mn_tree_inv_end() allowing it to return while the notifier data
      structure is still in use.  Consider the following sequence:
      
        CPU0 - mn_tree_inv_end()            CPU1 - mmu_interval_notifier_remove()
        ----------------------------------- ------------------------------------
                                            spin_lock(subscriptions->lock);
                                            seq = subscriptions->invalidate_seq;
        spin_lock(subscriptions->lock);     spin_unlock(subscriptions->lock);
        subscriptions->invalidate_seq++;
                                            wait_event(invalidate_seq != seq);
                                            return;
        interval_tree_remove(interval_sub); kfree(interval_sub);
        spin_unlock(subscriptions->lock);
        wake_up_all();
      
      As the wait_event() condition is true it will return immediately.  This
      can lead to use-after-free type errors if the caller frees the data
      structure containing the interval notifier subscription while it is
      still on a deferred list.  Fix this by taking the appropriate lock when
      reading invalidate_seq to ensure proper synchronisation.
      
      I observed this whilst running stress testing during some development.
      You do have to be pretty unlucky, but it leads to the usual problems of
      use-after-free (memory corruption, kernel crash, difficult to diagnose
      WARN_ON, etc).
      
      Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com
      Fixes: 99cb252f ("mm/mmu_notifier: add an interval tree notifier")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31956166
  6. 25 Mar, 2021 1 commit
    • Sean Christopherson's avatar
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · c2655835
      Sean Christopherson authored
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2655835
  7. 15 Dec, 2020 1 commit
    • Daniel Vetter's avatar
      mm: track mmu notifiers in fs_reclaim_acquire/release · f920e413
      Daniel Vetter authored
      fs_reclaim_acquire/release nicely catch recursion issues when allocating
      GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
      the excessive caches in check).  For mmu notifier recursions we do have
      lockdep annotations since 23b68395 ("mm/mmu_notifiers: add a lockdep
      map for invalidate_range_start/end").
      
      But these only fire if a path actually results in some pte invalidation -
      for most small allocations that's very rarely the case.  The other trouble
      is that pte invalidation can happen any time when __GFP_RECLAIM is set.
      Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
      enough to avoid potential mmu notifier recursion.
      
      I was pondering whether we should just do the general annotation, but
      there's always the risk for false positives.  Plus I'm assuming that the
      core fs and io code is a lot better reviewed and tested than random mmu
      notifier code in drivers.  Hence why I decide to only annotate for that
      specific case.
      
      Furthermore even if we'd create a lockdep map for direct reclaim, we'd
      still need to explicit pull in the mmu notifier map - there's a lot more
      places that do pte invalidation than just direct reclaim, these two
      contexts arent the same.
      
      Note that the mmu notifiers needing their own independent lockdep map is
      also the reason we can't hold them from fs_reclaim_acquire to
      fs_reclaim_release - it would nest with the acquistion in the pte
      invalidation code, causing a lockdep splat.  And we can't remove the
      annotations from pte invalidation and all the other places since they're
      called from many other places than page reclaim.  Hence we can only do the
      equivalent of might_lock, but on the raw lockdep map.
      
      With this we can also remove the lockdep priming added in 66204f1d
      ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
      more powerful.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f920e413
  8. 16 Oct, 2020 1 commit
  9. 12 Aug, 2020 1 commit
  10. 09 Jun, 2020 4 commits
  11. 22 Mar, 2020 1 commit
    • Qian Cai's avatar
      mm/mmu_notifier: silence PROVE_RCU_LIST warnings · 63886bad
      Qian Cai authored
      It is safe to traverse mm->notifier_subscriptions->list either under
      SRCU read lock or mm->notifier_subscriptions->lock using
      hlist_for_each_entry_rcu().  Silence the PROVE_RCU_LIST false positives,
      for example,
      
        WARNING: suspicious RCU usage
        -----------------------------
        mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        3 locks held by libvirtd/802:
         #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
         #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
         #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460
      
        stack backtrace:
        CPU: 7 PID: 802 Comm: libvirtd Tainted: G          I       5.6.0-rc6-next-20200317+ #2
        Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
        Call Trace:
          dump_stack+0xa4/0xfe
          lockdep_rcu_suspicious+0xeb/0xf5
          __mmu_notifier_invalidate_range_start+0x3ff/0x460
          change_p4d_range+0x746/0x800
          change_protection+0x1df/0x300
          mprotect_fixup+0x245/0x3e0
          do_mprotect_pkey+0x23b/0x3e0
          __x64_sys_mprotect+0x51/0x70
          do_syscall_64+0x91/0xae8
          entry_SYSCALL_64_after_hwframe+0x49/0xb3
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pwSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63886bad
  12. 14 Jan, 2020 3 commits
  13. 23 Nov, 2019 1 commit
    • Jason Gunthorpe's avatar
      mm/mmu_notifier: add an interval tree notifier · 99cb252f
      Jason Gunthorpe authored
      Of the 13 users of mmu_notifiers, 8 of them use only
      invalidate_range_start/end() and immediately intersect the
      mmu_notifier_range with some kind of internal list of VAs.  4 use an
      interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
      of some kind (scif_dma, vhost, gntdev, hmm)
      
      And the remaining 5 either don't use invalidate_range_start() or do some
      special thing with it.
      
      It turns out that building a correct scheme with an interval tree is
      pretty complicated, particularly if the use case is synchronizing against
      another thread doing get_user_pages().  Many of these implementations have
      various subtle and difficult to fix races.
      
      This approach puts the interval tree as common code at the top of the mmu
      notifier call tree and implements a shareable locking scheme.
      
      It includes:
       - An interval tree tracking VA ranges, with per-range callbacks
       - A read/write locking scheme for the interval tree that avoids
         sleeping in the notifier path (for OOM killer)
       - A sequence counter based collision-retry locking scheme to tell
         device page fault that a VA range is being concurrently invalidated.
      
      This is based on various ideas:
      - hmm accumulates invalidated VA ranges and releases them when all
        invalidates are done, via active_invalidate_ranges count.
        This approach avoids having to intersect the interval tree twice (as
        umem_odp does) at the potential cost of a longer device page fault.
      
      - kvm/umem_odp use a sequence counter to drive the collision retry,
        via invalidate_seq
      
      - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
        This makes adding/removing interval tree members more deterministic
      
      - seqlock, except this version makes the seqlock idea multi-holder on the
        write side by protecting it with active_invalidate_ranges and a spinlock
      
      To minimize MM overhead when only the interval tree is being used, the
      entire SRCU and hlist overheads are dropped using some simple
      branches. Similarly the interval tree overhead is dropped when in hlist
      mode.
      
      The overhead from the mandatory spinlock is broadly the same as most of
      existing users which already had a lock (or two) of some sort on the
      invalidation path.
      
      Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.caAcked-by: default avatarChristian König <christian.koenig@amd.com>
      Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
      Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      99cb252f
  14. 13 Nov, 2019 1 commit
  15. 06 Nov, 2019 1 commit
  16. 07 Sep, 2019 3 commits
  17. 28 Aug, 2019 1 commit
  18. 21 Aug, 2019 1 commit
  19. 20 Aug, 2019 1 commit
    • Daniel Vetter's avatar
      mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail · 8402ce61
      Daniel Vetter authored
      Just a bit of paranoia, since if we start pushing this deep into
      callchains it's hard to spot all places where an mmu notifier
      implementation might fail when it's not allowed to.
      
      Inspired by some confusion we had discussing i915 mmu notifiers and
      whether we could use the newly-introduced return value to handle some
      corner cases. Until we realized that these are only for when a task has
      been killed by the oom reaper.
      
      An alternative approach would be to split the callback into two versions,
      one with the int return value, and the other with void return value like
      in older kernels. But that's a lot more churn for fairly little gain I
      think.
      
      Summary from the m-l discussion on why we want something at warning level:
      This allows automated tooling in CI to catch bugs without humans having to
      look at everything. If we just upgrade the existing pr_info to a pr_warn,
      then we'll have false positives. And as-is, no one will ever spot the
      problem since it's lost in the massive amounts of overall dmesg noise.
      
      Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      8402ce61
  20. 16 Aug, 2019 3 commits
  21. 12 Jul, 2019 1 commit
    • Jean-Philippe Brucker's avatar
      mm/mmu_notifier: use hlist_add_head_rcu() · 543bdb2d
      Jean-Philippe Brucker authored
      Make mmu_notifier_register() safer by issuing a memory barrier before
      registering a new notifier.  This fixes a theoretical bug on weakly
      ordered CPUs.  For example, take this simplified use of notifiers by a
      driver:
      
      	my_struct->mn.ops = &my_ops; /* (1) */
      	mmu_notifier_register(&my_struct->mn, mm)
      		...
      		hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
      		...
      
      Once mmu_notifier_register() releases the mm locks, another thread can
      invalidate a range:
      
      	mmu_notifier_invalidate_range()
      		...
      		hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
      			if (mn->ops->invalidate_range)
      
      The read side relies on the data dependency between mn and ops to ensure
      that the pointer is properly initialized.  But the write side doesn't have
      any dependency between (1) and (2), so they could be reordered and the
      readers could dereference an invalid mn->ops.  mmu_notifier_register()
      does take all the mm locks before adding to the hlist, but those have
      acquire semantics which isn't sufficient.
      
      By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
      hlist using a store-release, ensuring that readers see prior
      initialization of my_struct.  This situation is better illustated by
      litmus test MP+onceassign+derefonce.
      
      Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
      Fixes: cddb8a5c ("mmu-notifiers: core")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      543bdb2d
  22. 19 Jun, 2019 1 commit
  23. 14 May, 2019 2 commits
  24. 28 Dec, 2018 3 commits
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 · ac46d4f3
      Jérôme Glisse authored
      To avoid having to change many call sites everytime we want to add a
      parameter use a structure to group all parameters for the mmu_notifier
      invalidate_range_start/end cakks.  No functional changes with this patch.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarChristian König <christian.koenig@amd.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      From: Jérôme Glisse <jglisse@redhat.com>
      Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
      
      fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
      
      Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac46d4f3
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse authored
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
    • Sean Christopherson's avatar
      mm/mmu_notifier.c: remove mmu_notifier_synchronize() · 6a90a83f
      Sean Christopherson authored
      Contrary to its name, mmu_notifier_synchronize() does not synchronize the
      notifier's SRCU instance, but rather waits for RCU callbacks to finish.
      i.e.  it invokes rcu_barrier().  The RCU documentation is quite clear on
      this matter, explicitly calling out that rcu_barrier() does not imply
      synchronize_rcu().
      
      As there are no callers of mmu_notifier_synchronize() and it's unclear
      whether any user of mmu_notifier_call_srcu() will ever want to barrier on
      their callbacks, simply remove the function.
      
      Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.comSigned-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a90a83f
  25. 26 Oct, 2018 1 commit
  26. 22 Aug, 2018 1 commit
    • Michal Hocko's avatar
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko authored
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7