1. 04 Jun, 2014 40 commits
    • Andy Shevchenko's avatar
      mm/dmapool.c: reuse devres_release() to free resources · 172cb4b3
      Andy Shevchenko authored
      Instead of calling an additional routine in dmam_pool_destroy() rely on
      what dmam_pool_release() is doing.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      172cb4b3
    • Marc Carino's avatar
      cma: increase CMA_ALIGNMENT upper limit to 12 · fe54b1fd
      Marc Carino authored
      Some systems require a larger maximum PAGE_SIZE order for CMA allocations.
       To accommodate such systems, increase the upper-bound of the
      CMA_ALIGNMENT range to 12 (which ends up being 16MB on systems with 4K
      pages).
      Signed-off-by: default avatarMarc Carino <marc.ceeeee@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe54b1fd
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · 18ab4d4c
      Dan Streetman authored
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18ab4d4c
    • Dan Streetman's avatar
      lib/plist: add plist_requeue · a75f232c
      Dan Streetman authored
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a75f232c
    • Dan Streetman's avatar
      lib/plist: add helper functions · fd16618e
      Dan Streetman authored
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd16618e
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · adfab836
      Dan Streetman authored
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adfab836
    • Jianyu Zhan's avatar
      mm: fold mlocked_vma_newpage() into its only call site · 7ee07a44
      Jianyu Zhan authored
      In previous commit(mm: use the light version __mod_zone_page_state in
      mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used.  And as
      suggested by Andrew, to reduce the risks that new call sites incorrectly
      using mlocked_vma_newpage() without knowing they are adding racing, this
      patch folds mlocked_vma_newpage() into its only call site,
      page_add_new_anon_rmap, to make it open-cocded for people to know what is
      going on.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ee07a44
    • Jianyu Zhan's avatar
      mm: use the light version __mod_zone_page_state in mlocked_vma_newpage() · bea04b07
      Jianyu Zhan authored
      mlocked_vma_newpage() is called with pte lock held(a spinlock), which
      implies preemtion disabled, and the vm stat counter is not modified from
      interrupt context, so we need not use an irq-safe mod_zone_page_state()
      here, using a light-weight version __mod_zone_page_state() would be OK.
      
      This patch also documents __mod_zone_page_state() and some of its
      callsites.  The comment above __mod_zone_page_state() is from Hugh
      Dickins, and acked by Christoph.
      
      Most credits to Hugh and Christoph for the clarification on the usage of
      the __mod_zone_page_state().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bea04b07
    • Vlastimil Babka's avatar
      mm/compaction: avoid rescanning pageblocks in isolate_freepages · e9ade569
      Vlastimil Babka authored
      The compaction free scanner in isolate_freepages() currently remembers PFN
      of the highest pageblock where it successfully isolates, to be used as the
      starting pageblock for the next invocation.  The rationale behind this is
      that page migration might return free pages to the allocator when
      migration fails and we don't want to skip them if the compaction
      continues.
      
      Since migration now returns free pages back to compaction code where they
      can be reused, this is no longer a concern.  This patch changes
      isolate_freepages() so that the PFN for restarting is updated with each
      pageblock where isolation is attempted.  Using stress-highalloc from
      mmtests, this resulted in 10% reduction of the pages scanned by the free
      scanner.
      
      Note that the somewhat similar functionality that records highest
      successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
      This cache is used when the whole compaction is restarted, not for
      multiple invocations of the free scanner during single compaction.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9ade569
    • Vlastimil Babka's avatar
      mm/compaction: do not count migratepages when unnecessary · f8c9301f
      Vlastimil Babka authored
      During compaction, update_nr_listpages() has been used to count remaining
      non-migrated and free pages after a call to migrage_pages().  The
      freepages counting has become unneccessary, and it turns out that
      migratepages counting is also unnecessary in most cases.
      
      The only situation when it's needed to count cc->migratepages is when
      migrate_pages() returns with a negative error code.  Otherwise, the
      non-negative return value is the number of pages that were not migrated,
      which is exactly the count of remaining pages in the cc->migratepages
      list.
      
      Furthermore, any non-zero count is only interesting for the tracepoint of
      mm_compaction_migratepages events, because after that all remaining
      unmigrated pages are put back and their count is set to 0.
      
      This patch therefore removes update_nr_listpages() completely, and changes
      the tracepoint definition so that the manual counting is done only when
      the tracepoint is enabled, and only when migrate_pages() returns a
      negative error code.
      
      Furthermore, migrate_pages() and the tracepoints won't be called when
      there's nothing to migrate.  This potentially avoids some wasted cycles
      and reduces the volume of uninteresting mm_compaction_migratepages events
      where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
      was about 75% of the events.  The mm_compaction_isolate_migratepages event
      is better for determining that nothing was isolated for migration, and
      this one was just duplicating the info.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8c9301f
    • David Rientjes's avatar
      mm, compaction: terminate async compaction when rescheduling · aeef4b83
      David Rientjes authored
      Async compaction terminates prematurely when need_resched(), see
      compact_checklock_irqsave().  This can never trigger, however, if the
      cond_resched() in isolate_migratepages_range() always takes care of the
      scheduling.
      
      If the cond_resched() actually triggers, then terminate this pageblock
      scan for async compaction as well.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeef4b83
    • David Rientjes's avatar
      mm, thp: avoid excessive compaction latency during fault · 75f30861
      David Rientjes authored
      Synchronous memory compaction can be very expensive: it can iterate an
      enormous amount of memory without aborting, constantly rescheduling,
      waiting on page locks and lru_lock, etc, if a pageblock cannot be
      defragmented.
      
      Unfortunately, it's too expensive for transparent hugepage page faults and
      it's much better to simply fallback to pages.  On 128GB machines, we find
      that synchronous memory compaction can take O(seconds) for a single thp
      fault.
      
      Now that async compaction remembers where it left off without strictly
      relying on sync compaction, this makes thp allocations best-effort without
      causing egregious latency during fault.  We still need to retry async
      compaction after reclaim, but this won't stall for seconds.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75f30861
    • David Rientjes's avatar
      mm, compaction: embed migration mode in compact_control · e0b9daeb
      David Rientjes authored
      We're going to want to manipulate the migration mode for compaction in the
      page allocator, and currently compact_control's sync field is only a bool.
      
      Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
      depending on the value of this bool.  Convert the bool to enum
      migrate_mode and pass the migration mode in directly.  Later, we'll want
      to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
      avoid unnecessary latency.
      
      This also alters compaction triggered from sysfs, either for the entire
      system or for a node, to force MIGRATE_SYNC.
      
      [akpm@linux-foundation.org: fix build]
      [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0b9daeb
    • David Rientjes's avatar
      mm, compaction: add per-zone migration pfn cache for async compaction · 35979ef3
      David Rientjes authored
      Each zone has a cached migration scanner pfn for memory compaction so that
      subsequent calls to memory compaction can start where the previous call
      left off.
      
      Currently, the compaction migration scanner only updates the per-zone
      cached pfn when pageblocks were not skipped for async compaction.  This
      creates a dependency on calling sync compaction to avoid having subsequent
      calls to async compaction from scanning an enormous amount of non-MOVABLE
      pageblocks each time it is called.  On large machines, this could be
      potentially very expensive.
      
      This patch adds a per-zone cached migration scanner pfn only for async
      compaction.  It is updated everytime a pageblock has been scanned in its
      entirety and when no pages from it were successfully isolated.  The cached
      migration scanner pfn for sync compaction is updated only when called for
      sync compaction.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35979ef3
    • David Rientjes's avatar
      mm, compaction: return failed migration target pages back to freelist · d53aea3d
      David Rientjes authored
      Greg reported that he found isolated free pages were returned back to the
      VM rather than the compaction freelist.  This will cause holes behind the
      free scanner and cause it to reallocate additional memory if necessary
      later.
      
      He detected the problem at runtime seeing that ext4 metadata pages (esp
      the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
      constantly visited by compaction calls of migrate_pages().  These pages
      had a non-zero b_count which caused fallback_migrate_page() ->
      try_to_release_page() -> try_to_free_buffers() to fail.
      
      Memory compaction works by having a "freeing scanner" scan from one end of
      a zone which isolates pages as migration targets while another "migrating
      scanner" scans from the other end of the same zone which isolates pages
      for migration.
      
      When page migration fails for an isolated page, the target page is
      returned to the system rather than the freelist built by the freeing
      scanner.  This may require the freeing scanner to continue scanning memory
      after suitable migration targets have already been returned to the system
      needlessly.
      
      This patch returns destination pages to the freeing scanner freelist when
      page migration fails.  This prevents unnecessary work done by the freeing
      scanner but also encourages memory to be as compacted as possible at the
      end of the zone.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d53aea3d
    • David Rientjes's avatar
      mm, migration: add destination page freeing callback · 68711a74
      David Rientjes authored
      Memory migration uses a callback defined by the caller to determine how to
      allocate destination pages.  When migration fails for a source page,
      however, it frees the destination page back to the system.
      
      This patch adds a memory migration callback defined by the caller to
      determine how to free destination pages.  If a caller, such as memory
      compaction, builds its own freelist for migration targets, this can reuse
      already freed memory instead of scanning additional memory.
      
      If the caller provides a function to handle freeing of destination pages,
      it is called when page migration fails.  If the caller passes NULL then
      freeing back to the system will be handled as usual.  This patch
      introduces no functional change.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68711a74
    • Vladimir Davydov's avatar
      memcg: memcg_kmem_create_cache: make memcg_name_buf statically allocated · 93f39eea
      Vladimir Davydov authored
      It isn't worth complicating the code by allocating it on the first access,
      because it only takes 256 bytes.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93f39eea
    • Vladimir Davydov's avatar
      memcg: get rid of memcg_create_cache_name · 073ee1c6
      Vladimir Davydov authored
      Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
      order to just create the name of a per memcg cache, let's allocate it in
      place.  We only need to pass the memcg name to kmem_cache_create_memcg for
      that - everything else can be done in slab_common.c.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      073ee1c6
    • Qiang Huang's avatar
      memcg: correct comments for __mem_cgroup_begin_update_page_stat · b5ffc856
      Qiang Huang authored
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5ffc856
    • Qiang Huang's avatar
      memcg: fold mem_cgroup_stolen · bdcbb659
      Qiang Huang authored
      It is only used in __mem_cgroup_begin_update_page_stat(), the name is
      confusing and 2 routines for one thing also confuse people, so fold this
      function seems more clear.
      
      [akpm@linux-foundation.org: fix typo, per Michal]
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdcbb659
    • Kirill A. Shutemov's avatar
      mm: update comment for DEFAULT_MAX_MAP_COUNT · 3fb1c8dc
      Kirill A. Shutemov authored
      With ELF extended numbering 16-bit bound is not hard limit any more.
      
      [akpm@linux-foundation.org: fix typo]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fb1c8dc
    • Emil Medve's avatar
      arch/x86/mm/numa.c: use for_each_memblock() · af4459d3
      Emil Medve authored
      Signed-off-by: default avatarEmil Medve <Emilian.Medve@Freescale.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af4459d3
    • Fabian Frederick's avatar
      mm/mempolicy.c: parameter doc uniformization · b46e14ac
      Fabian Frederick authored
      Also fixes kernel-doc warning
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b46e14ac
    • Kirill A. Shutemov's avatar
      mm/rmap.c: make page_referenced_one() and try_to_unmap_one() static · ac769501
      Kirill A. Shutemov authored
      KSM was converted to use rmap_walk() and now nobody uses these functions
      outside mm/rmap.c.
      
      Let's covert them back to static.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac769501
    • Cyrill Gorcunov's avatar
      mm: x86 pgtable: require X86_64 for soft-dirty tracker · 2bf01f9f
      Cyrill Gorcunov authored
      Tracking dirty status on 2 level pages requires very ugly macros and
      taking into account how old the machines who can operate without PAE
      mode only are, lets drop soft dirty tracker from them for code
      simplicity (note I can't drop all the macros from 2 level pages by now
      since _PAGE_BIT_PROTNONE and _PAGE_BIT_FILE are still used even without
      tracker).
      
      Linus proposed to completely rip off softdirty support on x86-32 (even
      with PAE) and since for CRIU we're not planning to support native x86-32
      mode, lets do that.
      
      (Softdirty tracker is relatively new feature which is mostly used by
      CRIU so I don't expect if such API change would cause problems for
      userspace).
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2bf01f9f
    • Cyrill Gorcunov's avatar
      mm: x86 pgtable: drop unneeded preprocessor ifdef · 2373eaec
      Cyrill Gorcunov authored
      _PAGE_BIT_FILE (bit 6) is always less than _PAGE_BIT_PROTNONE (bit 8), so
      drop redundant #ifdef.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2373eaec
    • Kirill A. Shutemov's avatar
      mm: cleanup __get_user_pages() · fa5bb209
      Kirill A. Shutemov authored
      Get rid of two nested loops over nr_pages, extract vma flags checking to
      separate function and other random cleanups.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa5bb209
    • Kirill A. Shutemov's avatar
      mm: extract code to fault in a page from __get_user_pages() · 16744483
      Kirill A. Shutemov authored
      Nesting level in __get_user_pages() is just insane. Let's try to fix it
      a bit.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16744483
    • Kirill A. Shutemov's avatar
      mm: cleanup follow_page_mask() · 69e68b4f
      Kirill A. Shutemov authored
      Cleanups:
       - move pte-related code to separate function. It's about half of the
         function;
       - get rid of some goto-logic;
       - use 'return NULL' instead of 'return page' where page can only be
         NULL;
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69e68b4f
    • Kirill A. Shutemov's avatar
      mm: extract in_gate_area() case from __get_user_pages() · f2b495ca
      Kirill A. Shutemov authored
      The case is special and disturb from reading main __get_user_pages()
      code path. Let's move it to separate function.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2b495ca
    • Kirill A. Shutemov's avatar
      mm: move get_user_pages()-related code to separate file · 4bbd4c77
      Kirill A. Shutemov authored
      mm/memory.c is overloaded: over 4k lines. get_user_pages() code is
      pretty much self-contained let's move it to separate file.
      
      No other changes made.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bbd4c77
    • Fabian Frederick's avatar
      mm/vmalloc.c: replace seq_printf by seq_puts · f4527c90
      Fabian Frederick authored
      Replace seq_printf where possible
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4527c90
    • Fabian Frederick's avatar
      mm/memcontrol.c: remove NULL assignment on static · ada4ba59
      Fabian Frederick authored
      static values are automatically initialized to NULL
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ada4ba59
    • Dave Hansen's avatar
      mm: shrinker: add nid to tracepoint output · df9024a8
      Dave Hansen authored
      Now that we are doing NUMA-aware shrinking, and can have shrinkers
      running in parallel, or working on individual nodes, it seems like we
      should also be sticking the node in the output.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df9024a8
    • Dave Hansen's avatar
      mm: shrinker trace points: fix negatives · 7fe70475
      Dave Hansen authored
      I was looking at a trace of the slab shrinkers (attachment in this comment):
      
      	https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67
      
      and noticed that "total_scan" can go negative in some cases.  We
      used to dump out the "total_scan" variable directly, but some of
      the shrinker modifications along the way changed that.
      
      This patch just dumps it out directly, again.  It doesn't make
      any sense to derive it from new_nr and nr any more since there
      are now other shrinkers that can be running in parallel and
      mucking with those values.
      
      Here's an example of the negative numbers in the output:
      
      >          kswapd0-840   [000]   160.869398: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
      >          kswapd0-840   [000]   160.869618: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
      >          kswapd0-840   [000]   160.870031: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
      >          kswapd0-840   [000]   160.870464: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
      >          kswapd0-840   [000]   163.384144: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
      >          kswapd0-840   [000]   163.384297: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
      >          kswapd0-840   [000]   163.384414: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
      >          kswapd0-840   [000]   163.384657: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
      >          kswapd0-840   [000]   163.384880: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
      >          kswapd0-840   [000]   163.385256: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
      >          kswapd0-840   [000]   163.385598: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fe70475
    • Daeseok Youn's avatar
      mm/dmapool.c: remove redundant NULL check for dev in dma_pool_create() · cc6b664a
      Daeseok Youn authored
      "dev" cannot be NULL because it is already checked before calling
      dma_pool_create().
      
      If dev ever was NULL, the code would oops in dev_to_node() after enabling
      CONFIG_NUMA.
      
      It is possible that some driver is using dev==NULL and has never been run
      on a NUMA machine.  Such a driver is probably outdated, possibly buggy and
      will need some attention if it starts triggering NULL derefs.
      Signed-off-by: default avatarDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc6b664a
    • Wang Sheng-Hui's avatar
      include/linux/bootmem.h: cleanup the comment for BOOTMEM_ flags · 1754e44e
      Wang Sheng-Hui authored
      Use BOOTMEM_DEFAULT instead of 0 in the comment.
      Signed-off-by: default avatarWang Sheng-Hui <shhuiw@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1754e44e
    • Jianyu Zhan's avatar
      mm: introdule compound_head_by_tail() · d2ee40ea
      Jianyu Zhan authored
      Currently, in put_compound_page(), we have
      
      ======
      if (likely(!PageTail(page))) {                  <------  (1)
              if (put_page_testzero(page)) {
                       /*
                       ¦* By the time all refcounts have been released
                       ¦* split_huge_page cannot run anymore from under us.
                       ¦*/
                       if (PageHead(page))
                               __put_compound_page(page);
                       else
                               __put_single_page(page);
               }
               return;
      }
      
      /* __split_huge_page_refcount can run under us */
      page_head = compound_head(page);        <------------ (2)
      ======
      
      if at (1) ,  we fail the check, this means page is *likely* a tail page.
      
      Then at (2), as compoud_head(page) is inlined, it is :
      
      ======
      static inline struct page *compound_head(struct page *page)
      {
                if (unlikely(PageTail(page))) {           <----------- (3)
                    struct page *head = page->first_page;
      
                      smp_rmb();
                      if (likely(PageTail(page)))
                              return head;
              }
              return page;
      }
      ======
      
      here, the (3) unlikely in the case is a negative hint, because it is
      *likely* a tail page.  So the check (3) in this case is not good, so I
      introduce a helper for this case.
      
      So this patch introduces compound_head_by_tail() which deals with a
      possible tail page(though it could be spilt by a racy thread), and make
      compound_head() a wrapper on it.
      
      This patch has no functional change, and it reduces the object
      size slightly:
         text    data     bss     dec     hex  filename
        11003    1328      16   12347    303b  mm/swap.o.orig
        10971    1328      16   12315    301b  mm/swap.o.patched
      
      I've ran "perf top -e branch-miss" to observe branch-miss in this case.
      As Michael points out, it's a slow path, so only very few times this case
      happens.  But I grep'ed the code base, and found there still are some
      other call sites could be benifited from this helper.  And given that it
      only bloating up the source by only 5 lines, but with a reduced object
      size.  I still believe this helper deserves to exsit.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2ee40ea
    • Jianyu Zhan's avatar
      mm/swap.c: split put_compound_page() · 4bd3e8f7
      Jianyu Zhan authored
      Currently, put_compound_page() carefully handles tricky cases to avoid
      racing with compound page releasing or splitting, which makes it quite
      lenthy (about 200+ lines) and needs deep tab indention, which makes it
      quite hard to follow and maintain.
      
      Now based on two helpers introduced in the previous patch ("mm/swap.c:
      introduce put_[un]refcounted_compound_page helpers for spliting
      put_compound_page"), this patch replaces those two lengthy code paths with
      these two helpers, respectively.  Also, it has some comment rephrasing.
      
      After this patch, the put_compound_page() is very compact, thus easy to
      read and maintain.
      
      After splitting, the object file is of same size as the original one.
      Actually, I've diff'ed put_compound_page()'s orginal disassemble code and
      the patched disassemble code, the are 100% the same!
      
      This fact shows that this splitting has no functional change, but it
      brings readability.
      
      This patch and the previous one blow the code by 32 lines, mostly due to
      comments.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bd3e8f7
    • Jianyu Zhan's avatar
      mm/swap.c: introduce put_[un]refcounted_compound_page helpers for splitting put_compound_page() · c747ce79
      Jianyu Zhan authored
      Currently, put_compound_page() carefully handles tricky cases to avoid
      racing with compound page releasing or splitting, which makes it quite
      lenthy (about 200+ lines) and needs deep tab indention, which makes it
      quite hard to follow and maintain.
      
      This patch and the next patch refactor this function.
      
      Based on the code skeleton of put_compound_page:
      
      put_compound_pge:
              if !PageTail(page)
              	put head page fastpath;
      		return;
      
              /* else PageTail */
              page_head = compound_head(page)
              if !__compound_tail_refcounted(page_head)
      		put head page optimal path; <---(1)
      		return;
              else
      		put head page slowpath; <--- (2)
                      return;
      
      This patch introduces two helpers, put_[un]refcounted_compound_page,
      handling the code path (1) and code path (2), respectively.  They both are
      tagged __always_inline, thus elmiating function call overhead, making them
      operating the same way as before.
      
      They are almost copied verbatim(except one place, a "goto out_put_single"
      is expanded), with some comments rephrasing.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c747ce79