1. 09 Oct, 2014 22 commits
    • Joonsoo Kim's avatar
      mm/compaction: disallow high-order page for migration target · 96b3fde4
      Joonsoo Kim authored
      commit 7d348b9e upstream.
      
      Purpose of compaction is to get a high order page.  Currently, if we
      find high-order page while searching migration target page, we break it
      to order-0 pages and use them as migration target.  It is contrary to
      purpose of compaction, so disallow high-order page to be used for
      migration target.
      
      Additionally, clean-up logic in suitable_migration_target() to simplify
      the code.  There is no functional changes from this clean-up.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96b3fde4
    • David Rientjes's avatar
      mm, compaction: avoid isolating pinned pages · 1190e5c6
      David Rientjes authored
      commit 119d6d59 upstream.
      
      Page migration will fail for memory that is pinned in memory with, for
      example, get_user_pages().  In this case, it is unnecessary to take
      zone->lru_lock or isolating the page and passing it to page migration
      which will ultimately fail.
      
      This is a racy check, the page can still change from under us, but in
      that case we'll just fail later when attempting to move the page.
      
      This avoids very expensive memory compaction when faulting transparent
      hugepages after pinning a lot of memory with a Mellanox driver.
      
      On a 128GB machine and pinning ~120GB of memory, before this patch we
      see the enormous disparity in the number of page migration failures
      because of the pinning (from /proc/vmstat):
      
      	compact_pages_moved 8450
      	compact_pagemigrate_failed 15614415
      
      0.05% of pages isolated are successfully migrated and explicitly
      triggering memory compaction takes 102 seconds.  After the patch:
      
      	compact_pages_moved 9197
      	compact_pagemigrate_failed 7
      
      99.9% of pages isolated are now successfully migrated in this
      configuration and memory compaction takes less than one second.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1190e5c6
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · d540b168
      Dan Streetman authored
      commit 18ab4d4c upstream.
      
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d540b168
    • Dan Streetman's avatar
      lib/plist: add plist_requeue · ae604916
      Dan Streetman authored
      commit a75f232c upstream.
      
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae604916
    • Dan Streetman's avatar
      lib/plist: add helper functions · fd6d61cc
      Dan Streetman authored
      commit fd16618e upstream.
      
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd6d61cc
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · bcbfe6fd
      Dan Streetman authored
      commit adfab836 upstream.
      
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcbfe6fd
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · a4c51bde
      Michal Hocko authored
      commit 70ef57e6 upstream.
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4c51bde
    • Andrew Hunter's avatar
      jiffies: Fix timeval conversion to jiffies · 5c0f0c01
      Andrew Hunter authored
      commit d78c9300 upstream.
      
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: default avatarPaul Turner <pjt@google.com>
      Reported-by: default avatarAaron Jacobs <jacobsa@google.com>
      Signed-off-by: default avatarAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      [bwh: Backported to 3.16: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5c0f0c01
    • Hans Verkuil's avatar
      media: vb2: fix VBI/poll regression · cbb87efb
      Hans Verkuil authored
      commit 58d75f4b upstream.
      
      The recent conversion of saa7134 to vb2 unconvered a poll() bug that
      broke the teletext applications alevt and mtt. These applications
      expect that calling poll() without having called VIDIOC_STREAMON will
      cause poll() to return POLLERR. That did not happen in vb2.
      
      This patch fixes that behavior. It also fixes what should happen when
      poll() is called when STREAMON is called but no buffers have been
      queued. In that case poll() will also return POLLERR, but only for
      capture queues since output queues will always return POLLOUT
      anyway in that situation.
      
      This brings the vb2 behavior in line with the old videobuf behavior.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Acked-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbb87efb
    • Mel Gorman's avatar
      mm: numa: Do not mark PTEs pte_numa when splitting huge pages · 5f50c44d
      Mel Gorman authored
      commit abc40bd2 upstream.
      
      This patch reverts 1ba6e0b5 ("mm: numa: split_huge_page: transfer the
      NUMA type from the pmd to the pte"). If a huge page is being split due
      a protection change and the tail will be in a PROT_NONE vma then NUMA
      hinting PTEs are temporarily created in the protected VMA.
      
       VM_RW|VM_PROTNONE
      |-----------------|
            ^
            split here
      
      In the specific case above, it should get fixed up by change_pte_range()
      but there is a window of opportunity for weirdness to happen. Similarly,
      if a huge page is shrunk and split during a protection update but before
      pmd_numa is cleared then a pte_numa can be left behind.
      
      Instead of adding complexity trying to deal with the case, this patch
      will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
      will not be triggered which is marginal in comparison to the complexity
      in dealing with the corner cases during THP split.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f50c44d
    • Waiman Long's avatar
      mm, thp: move invariant bug check out of loop in __split_huge_page_map · 1da286eb
      Waiman Long authored
      commit f8303c25 upstream.
      
      In __split_huge_page_map(), the check for page_mapcount(page) is
      invariant within the for loop.  Because of the fact that the macro is
      implemented using atomic_read(), the redundant check cannot be optimized
      away by the compiler leading to unnecessary read to the page structure.
      
      This patch moves the invariant bug check out of the loop so that it will
      be done only once.  On a 3.16-rc1 based kernel, the execution time of a
      microbenchmark that broke up 1000 transparent huge pages using munmap()
      had an execution time of 38,245us and 38,548us with and without the
      patch respectively.  The performance gain is about 1%.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1da286eb
    • Nishanth Aravamudan's avatar
      hugetlb: ensure hugepage access is denied if hugepages are not supported · de1fc405
      Nishanth Aravamudan authored
      commit 457c1b27 upstream.
      
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de1fc405
    • Pavel Shilovsky's avatar
      CIFS: Fix SMB2 readdir error handling · 04ceeee9
      Pavel Shilovsky authored
      commit 52755808 upstream.
      
      SMB2 servers indicates the end of a directory search with
      STATUS_NO_MORE_FILE error code that is not processed now.
      This causes generic/257 xfstest to fail. Fix this by triggering
      the end of search by this error code in SMB2_query_directory.
      
      Also when negotiating CIFS protocol we tell the server to close
      the search automatically at the end and there is no need to do
      it itself. In the case of SMB2 protocol, we need to close it
      explicitly - separate close directory checks for different
      protocols.
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      04ceeee9
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Fix infinite spin in reading buffer · 3a525e23
      Steven Rostedt (Red Hat) authored
      commit 24607f11 upstream.
      
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2 "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a525e23
    • Josh Triplett's avatar
      init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu · 769721db
      Josh Triplett authored
      commit 62b4d204 upstream.
      
      commit 03b8c7b6 ("futex: Allow
      architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
      HAVE_FUTEX_CMPXCHG symbol right below FUTEX.  This placed it right in
      the middle of the options for the EXPERT menu.  However,
      HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
      placing items in the EXPERT menu, and displays the remaining several
      EXPERT items (starting with EPOLL) directly in the General Setup menu.
      
      Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
      HAVE_FUTEX_CMPXCHG itself depend on FUTEX.  With this change, the
      subsequent items display as part of the EXPERT menu again; the EMBEDDED
      menu now appears as the next top-level item in the General Setup menu,
      which makes General Setup much shorter and more usable.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      769721db
    • Steve French's avatar
      Fix problem recognizing symlinks · f012edf6
      Steve French authored
      commit 19e81573 upstream.
      
      Changeset eb85d94b introduced a problem where if a cifs open
      fails during query info of a file we
      will still try to close the file (happens with certain types
      of reparse points) even though the file handle is not valid.
      
      In addition for SMB2/SMB3 we were not mapping the return code returned
      by Windows when trying to open a file (like a Windows NFS symlink)
      which is a reparse point.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f012edf6
    • Chris Wilson's avatar
      drm/i915: Flush the PTEs after updating them before suspend · 7365af49
      Chris Wilson authored
      commit 91e56499 upstream.
      
      As we use WC updates of the PTE, we are responsible for notifying the
      hardware when to flush its TLBs. Do so after we zap all the PTEs before
      suspend (and the BIOS tries to read our GTT).
      
      Fixes a regression from
      
      commit 828c7908
      Author: Ben Widawsky <benjamin.widawsky@intel.com>
      Date:   Wed Oct 16 09:21:30 2013 -0700
      
          drm/i915: Disable GGTT PTEs on GEN6+ suspend
      
      that survived and continue to cause harm even after
      
      commit e568af1c
      Author: Daniel Vetter <daniel.vetter@ffwll.ch>
      Date:   Wed Mar 26 20:08:20 2014 +0100
      
          drm/i915: Undo gtt scratch pte unmapping again
      
      v2: Trivial rebase.
      v3: Fixes requires pointer dances.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82340
      Tested-by: ming.yao@intel.com
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
      Cc: Todd Previte <tprevite@gmail.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarJani Nikula <jani.nikula@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7365af49
    • NeilBrown's avatar
      md/raid5: disable 'DISCARD' by default due to safety concerns. · 6af970fa
      NeilBrown authored
      commit 8e0e99ba upstream.
      
      It has come to my attention (thanks Martin) that 'discard_zeroes_data'
      is only a hint.  Some devices in some cases don't do what it
      says on the label.
      
      The use of DISCARD in RAID5 depends on reads from discarded regions
      being predictably zero.  If a write to a previously discarded region
      performs a read-modify-write cycle it assumes that the parity block
      was consistent with the data blocks.  If all were zero, this would
      be the case.  If some are and some aren't this would not be the case.
      This could lead to data corruption after a device failure when
      data needs to be reconstructed from the parity.
      
      As we cannot trust 'discard_zeroes_data', ignore it by default
      and so disallow DISCARD on all raid4/5/6 arrays.
      
      As many devices are trustworthy, and as there are benefits to using
      DISCARD, add a module parameter to over-ride this caution and cause
      DISCARD to work if discard_zeroes_data is set.
      
      If a site want to enable DISCARD on some arrays but not on others they
      should select DISCARD support at the filesystem level, and set the
      raid456 module parameter.
          raid456.devices_handle_discard_safely=Y
      
      As this is a data-safety issue, I believe this patch is suitable for
      -stable.
      DISCARD support for RAID456 was added in 3.7
      
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Heinz Mauelshagen <heinzm@redhat.com>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Fixes: 620125f2Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6af970fa
    • Arnd Bergmann's avatar
      cpufreq: integrator: fix integrator_cpufreq_remove return type · da893cbe
      Arnd Bergmann authored
      commit d62dbf77 upstream.
      
      When building this driver as a module, we get a helpful warning
      about the return type:
      
      drivers/cpufreq/integrator-cpufreq.c:232:2: warning: initialization from incompatible pointer type
        .remove = __exit_p(integrator_cpufreq_remove),
      
      If the remove callback returns void, the caller gets an undefined
      value as it expects an integer to be returned. This fixes the
      problem by passing down the value from cpufreq_unregister_driver.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da893cbe
    • Mel Gorman's avatar
      mm: migrate: Close race between migration completion and mprotect · 5b81f936
      Mel Gorman authored
      commit d3cb8bf6 upstream.
      
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b81f936
    • Peter Zijlstra's avatar
      perf: fix perf bug in fork() · fddac5ed
      Peter Zijlstra authored
      commit 6c72e350 upstream.
      
      Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
      calling perf_event_free_task() when failing sched_fork() we will not yet
      have done the memset() on ->perf_event_ctxp[] and will therefore try and
      'free' the inherited contexts, which are still in use by the parent
      process.  This is bad..
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fddac5ed
    • Jan Kara's avatar
      udf: Avoid infinite loop when processing indirect ICBs · 82335226
      Jan Kara authored
      commit c03aa9f6 upstream.
      
      We did not implement any bound on number of indirect ICBs we follow when
      loading inode. Thus corrupted medium could cause kernel to go into an
      infinite loop, possibly causing a stack overflow.
      
      Fix the possible stack overflow by removing recursion from
      __udf_read_inode() and limit number of indirect ICBs we follow to avoid
      infinite loops.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82335226
  2. 05 Oct, 2014 18 commits