1. 14 Apr, 2015 1 commit
  2. 07 Aug, 2014 1 commit
    • Vlastimil Babka's avatar
      mm: try_to_unmap_cluster() should lock_page() before mlocking · 21dbd45f
      Vlastimil Babka authored
      commit 57e68e9c
      
       upstream.
      
      A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
      fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
      the pages other than its check_page parameter (which is already locked).
      
      The BUG_ON in mlock_vma_page() is not documented and its purpose is
      somewhat unclear, but apparently it serializes against page migration,
      which could otherwise fail to transfer the PG_mlocked flag.  This would
      not be fatal, as the page would be eventually encountered again, but
      NR_MLOCK accounting would become distorted nevertheless.  This patch adds
      a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
      effect.
      
      The call site try_to_unmap_cluster() is fixed so that for page !=
      check_page, trylock_page() is attempted (to avoid possible deadlocks as we
      already have check_page locked) and mlock_vma_page() is performed only
      upon success.  If the page lock cannot be obtained, the page is left
      without PG_mlocked, which is again not a problem in the whole unevictable
      memory design.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yijing Wang <wangyijing@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21dbd45f
  3. 01 Jul, 2014 1 commit
  4. 11 Jun, 2014 1 commit
  5. 08 Jan, 2014 1 commit
  6. 31 Oct, 2012 1 commit
    • Jan Kara's avatar
      mm: fix XFS oops due to dirty pages without buffers on s390 · 71a36b53
      Jan Kara authored
      commit ef5d437f
      
       upstream.
      
      On s390 any write to a page (even from kernel itself) sets architecture
      specific page dirty bit.  Thus when a page is written to via buffered
      write, HW dirty bit gets set and when we later map and unmap the page,
      page_remove_rmap() finds the dirty bit and calls set_page_dirty().
      
      Dirtying of a page which shouldn't be dirty can cause all sorts of
      problems to filesystems.  The bug we observed in practice is that
      buffers from the page get freed, so when the page gets later marked as
      dirty and writeback writes it, XFS crashes due to an assertion
      BUG_ON(!PagePrivate(page)) in page_buffers() called from
      xfs_count_page_state().
      
      Similar problem can also happen when zero_user_segment() call from
      xfs_vm_writepage() (or block_write_full_page() for that matter) set the
      hardware dirty bit during writeback, later buffers get freed, and then
      page unmapped.
      
      Fix the issue by ignoring s390 HW dirty bit for page cache pages of
      mappings with mapping_cap_account_dirty().  This is safe because for
      such mappings when a page gets marked as writeable in PTE it is also
      marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
      cleared by clear_page_dirty_for_io(), the page gets writeprotected in
      page_mkclean().  So pagecache page is writeable if and only if it is
      dirty.
      
      Thanks to Hugh Dickins for pointing out mapping has to have
      mapping_cap_account_dirty() for things to work and proposing a cleaned
      up variant of the patch.
      
      The patch has survived about two hours of running fsx-linux on tmpfs
      while heavily swapping and several days of running on out build machines
      where the original problem was triggered.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a36b53
  7. 22 Mar, 2012 3 commits
  8. 13 Jan, 2012 1 commit
  9. 11 Jan, 2012 1 commit
    • Andrea Arcangeli's avatar
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli authored
      
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarNai Xia <nai.xia@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
  10. 01 Nov, 2011 1 commit
  11. 31 Oct, 2011 1 commit
  12. 24 Jul, 2011 1 commit
    • Martin Schwidefsky's avatar
      [S390] reference bit testing for unmapped pages · 50a15981
      Martin Schwidefsky authored
      
      On x86 a page without a mapper is by definition not referenced / old.
      The s390 architecture keeps the reference bit in the storage key and
      the current code will check the storage key for page without a mapper.
      This leads to an interesting effect: the first time an s390 system
      needs to write pages to swap it only finds referenced pages. This
      causes a lot of pages to get added and written to the swap device.
      To avoid this behaviour change page_referenced to query the storage
      key only if there is a mapper of the page.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      50a15981
  13. 21 Jul, 2011 1 commit
    • Christoph Hellwig's avatar
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig authored
      
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  14. 28 Jun, 2011 1 commit
  15. 18 Jun, 2011 3 commits
    • Linus Torvalds's avatar
      mm: avoid anon_vma_chain allocation under anon_vma lock · dd34739c
      Linus Torvalds authored
      
      Hugh Dickins points out that lockdep (correctly) spots a potential
      deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
      of anon_vma_chain while doing anon_vma_clone().  The problem is that
      page reclaim will want to take the anon_vma lock of any anonymous pages
      that it will try to reclaim.
      
      So re-organize the code in anon_vma_clone() slightly: first do just a
      GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
      let's just drop the lock and re-do the allocation, now with GFP_KERNEL.
      
      End result: not only do we avoid the locking problem, this also ends up
      getting better concurrency in case the allocation does need to block.
      Tim Chen reports that with all these anon_vma locking tweaks, we're now
      almost back up to the spinlock performance.
      Reported-and-tested-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd34739c
    • Peter Zijlstra's avatar
      mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas() · eee2acba
      Peter Zijlstra authored
      
      This matches the anon_vma_clone() case, and uses the same lock helper
      functions.  Because of the need to potentially release the anon_vma's,
      it's a bit more complex, though.
      
      We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
      the anon_vma lock (with the helper function that only takes the lock
      once for the whole loop), and removes any entries that don't need any
      more processing.
      
      The second phase just traverses the remaining list entries (without
      holding the anon_vma lock), and does any actual freeing of the
      anon_vma's that is required.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee2acba
    • Linus Torvalds's avatar
      mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone() · bb4aa396
      Linus Torvalds authored
      
      In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
      vma, locking the anon_vma for each entry.
      
      But they are all going to have the same root entry, which means that
      we're locking and unlocking the same lock over and over again.  Which is
      expensive in locked operations, but can get _really_ expensive when that
      root entry sees any kind of lock contention.
      
      In fact, Tim Chen reports a big performance regression due to this: when
      we switched to use a mutex instead of a spinlock, the contention case
      gets much worse.
      
      So to alleviate this all, this commit creates a small helper function
      (lock_anon_vma_root()) that can be used to take the lock just once
      rather than taking and releasing it over and over again.
      
      We still have the same "take the lock and release" it behavior in the
      exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
      since we're actually freeing the anon_vma entries as we go, and that
      will touch the lock too.
      Reported-and-tested-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb4aa396
  16. 08 Jun, 2011 1 commit
    • Christoph Hellwig's avatar
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig authored
      
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  17. 29 May, 2011 1 commit
  18. 28 May, 2011 2 commits
    • Hugh Dickins's avatar
      mm: fix page_lock_anon_vma leaving mutex locked · eee0f252
      Hugh Dickins authored
      On one machine I've been getting hangs, a page fault's anon_vma_prepare()
      waiting in anon_vma_lock(), other processes waiting for that page's lock.
      
      This is a replay of last year's f1819427
      
       "mm: fix hang on
      anon_vma->root->lock".
      
      The new page_lock_anon_vma() places too much faith in its refcount: when
      it has acquired the mutex_trylock(), it's possible that a racing task in
      anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
      to 1, and is about to reset its anon_vma->root.
      
      Fix this by saving anon_vma->root, and relying on the usual page_mapped()
      check instead of a refcount check: if page is still mapped, the anon_vma
      is still ours; if page is not still mapped, we're no longer interested.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eee0f252
    • Hugh Dickins's avatar
      mm: fix kernel BUG at mm/rmap.c:1017! · 5dbe0af4
      Hugh Dickins authored
      I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
      just once.  The stack showed khugepaged allocation trying to compact
      pages: the call to page_add_anon_rmap() coming from remove_migration_pte().
      
      That path holds anon_vma lock, but does not hold mmap_sem: it can
      therefore race with a split_vma(), and in commit 5f70b962
      
       "mmap:
      avoid unnecessary anon_vma lock" we just took away the anon_vma lock
      protection when adjusting vma->vm_end.
      
      I don't think that particular BUG_ON ever caught anything interesting,
      so better replace it by a comment, than reinstate the anon_vma locking.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dbe0af4
  19. 25 May, 2011 6 commits
  20. 23 May, 2011 1 commit
    • Martin Schwidefsky's avatar
      [S390] merge page_test_dirty and page_clear_dirty · 2d42552d
      Martin Schwidefsky authored
      
      The page_clear_dirty primitive always sets the default storage key
      which resets the access control bits and the fetch protection bit.
      That will surprise a KVM guest that sets non-zero access control
      bits or the fetch protection bit. Merge page_test_dirty and
      page_clear_dirty back to a single function and only clear the
      dirty bit from the storage key.
      
      In addition move the function page_test_and_clear_dirty and
      page_test_and_clear_young to page.h where they belong. This
      requires to change the parameter from a struct page * to a page
      frame number.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      2d42552d
  21. 25 Mar, 2011 2 commits
    • Dave Chinner's avatar
      fs: move i_wb_list out from under inode_lock · a66979ab
      Dave Chinner authored
      
      Protect the inode writeback list with a new global lock
      inode_wb_list_lock and use it to protect the list manipulations and
      traversals. This lock replaces the inode_lock as the inodes on the
      list can be validity checked while holding the inode->i_lock and
      hence the inode_lock is no longer needed to protect the list.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a66979ab
    • Dave Chinner's avatar
      fs: protect inode->i_state with inode->i_lock · 250df6ed
      Dave Chinner authored
      
      Protect inode state transitions and validity checks with the
      inode->i_lock. This enables us to make inode state transitions
      independently of the inode_lock and is the first step to peeling
      away the inode_lock from the code.
      
      This requires that __iget() is done atomically with i_state checks
      during list traversals so that we don't race with another thread
      marking the inode I_FREEING between the state check and grabbing the
      reference.
      
      Also remove the unlock_new_inode() memory barrier optimisation
      required to avoid taking the inode_lock when clearing I_NEW.
      Simplify the code by simply taking the inode->i_lock around the
      state change and wakeup. Because the wakeup is no longer tricky,
      remove the wake_up_inode() function and open code the wakeup where
      necessary.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      250df6ed
  22. 23 Mar, 2011 3 commits
  23. 13 Mar, 2011 1 commit
  24. 14 Jan, 2011 4 commits