1. 18 Mar, 2015 20 commits
    • Vlastimil Babka's avatar
      mm: when stealing freepages, also take pages created by splitting buddy page · a4f3f96f
      Vlastimil Babka authored
      commit 99592d59 upstream.
      
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4f3f96f
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration entry check in __unmap_hugepage_range · 99537c23
      Naoya Horiguchi authored
      commit 9fbc1f63 upstream.
      
      If __unmap_hugepage_range() tries to unmap the address range over which
      hugepage migration is on the way, we get the wrong page because pte_page()
      doesn't work for migration entries.  This patch simply clears the pte for
      migration entries as we do for hwpoison entries.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99537c23
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection · d0fc2ca5
      Naoya Horiguchi authored
      commit a8bda28d upstream.
      
      There is a race condition between hugepage migration and
      change_protection(), where hugetlb_change_protection() doesn't care about
      migration entries and wrongly overwrites them.  That causes unexpected
      results like kernel crash.  HWPoison entries also can cause the same
      problem.
      
      This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
      function to do proper actions.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0fc2ca5
    • Jiri Pirko's avatar
      team: don't traverse port list using rcu in team_set_mac_address · 67b58fb7
      Jiri Pirko authored
      [ Upstream commit 9215f437 ]
      
      Currently the list is traversed using rcu variant. That is not correct
      since dev_set_mac_address can be called which eventually calls
      rtmsg_ifinfo_build_skb and there, skb allocation can sleep. So fix this
      by remove the rcu usage here.
      
      Fixes: 3d249d4c "net: introduce ethernet teaming device"
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67b58fb7
    • Lorenzo Colitti's avatar
      net: ping: Return EAFNOSUPPORT when appropriate. · f15e90f9
      Lorenzo Colitti authored
      [ Upstream commit 9145736d ]
      
      1. For an IPv4 ping socket, ping_check_bind_addr does not check
         the family of the socket address that's passed in. Instead,
         make it behave like inet_bind, which enforces either that the
         address family is AF_INET, or that the family is AF_UNSPEC and
         the address is 0.0.0.0.
      2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
         if the socket family is not AF_INET6. Return EAFNOSUPPORT
         instead, for consistency with inet6_bind.
      3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
         instead of EINVAL if an incorrect socket address structure is
         passed in.
      4. Make IPv6 ping sockets be IPv6-only. The code does not support
         IPv4, and it cannot easily be made to support IPv4 because
         the protocol numbers for ICMP and ICMPv6 are different. This
         makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
         of making the socket unusable.
      
      Among other things, this fixes an oops that can be triggered by:
      
          int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
          struct sockaddr_in6 sin6 = {
              .sin6_family = AF_INET6,
              .sin6_addr = in6addr_any,
          };
          bind(s, (struct sockaddr *) &sin6, sizeof(sin6));
      
      Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f15e90f9
    • Michal Kubeček's avatar
      udp: only allow UFO for packets from SOCK_DGRAM sockets · 8a0cafc9
      Michal Kubeček authored
      [ Upstream commit acf8dd0a ]
      
      If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
      UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
      CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
      checksum is to be computed on segmentation. However, in this case,
      skb->csum_start and skb->csum_offset are never set as raw socket
      transmit path bypasses udp_send_skb() where they are usually set. As a
      result, driver may access invalid memory when trying to calculate the
      checksum and store the result (as observed in virtio_net driver).
      
      Moreover, the very idea of modifying the userspace provided UDP header
      is IMHO against raw socket semantics (I wasn't able to find a document
      clearly stating this or the opposite, though). And while allowing
      CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
      too intrusive change just to handle a corner case like this. Therefore
      disallowing UFO for packets from SOCK_DGRAM seems to be the best option.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a0cafc9
    • Ben Shelton's avatar
      usb: plusb: Add support for National Instruments host-to-host cable · 37aa4dd0
      Ben Shelton authored
      [ Upstream commit 42c972a1 ]
      
      The National Instruments USB Host-to-Host Cable is based on the Prolific
      PL-25A1 chipset.  Add its VID/PID so the plusb driver will recognize it.
      Signed-off-by: default avatarBen Shelton <ben.shelton@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37aa4dd0
    • Eric Dumazet's avatar
      macvtap: make sure neighbour code can push ethernet header · 00ecb618
      Eric Dumazet authored
      [ Upstream commit 2f1d8b9e ]
      
      Brian reported crashes using IPv6 traffic with macvtap/veth combo.
      
      I tracked the crashes in neigh_hh_output()
      
      -> memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);
      
      Neighbour code assumes headroom to push Ethernet header is
      at least 16 bytes.
      
      It appears macvtap has only 14 bytes available on arches
      where NET_IP_ALIGN is 0 (like x86)
      
      Effect is a corruption of 2 bytes right before skb->head,
      and possible crashes if accessing non existing memory.
      
      This fix should also increase IPv4 performance, as paranoid code
      in ip_finish_output2() wont have to call skb_realloc_headroom()
      Reported-by: default avatarBrian Rak <brak@vultr.com>
      Tested-by: default avatarBrian Rak <brak@vultr.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00ecb618
    • Catalin Marinas's avatar
      net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg · a382f4b2
      Catalin Marinas authored
      [ Upstream commit d720d8ce ]
      
      With commit a7526eb5 (net: Unbreak compat_sys_{send,recv}msg), the
      MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
      changing the kernel compat behaviour from the one before the commit it
      was trying to fix (1be374a0, net: Block MSG_CMSG_COMPAT in
      send(m)msg and recv(m)msg).
      
      On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
      32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
      the kernel). However, on a 64-bit kernel, the compat ABI is different
      with commit a7526eb5.
      
      This patch changes the compat_sys_{send,recv}msg behaviour to the one
      prior to commit 1be374a0.
      
      The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
      kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
      but the general rule is not to break user ABI (even when the user
      behaviour is not entirely sane).
      
      Fixes: a7526eb5 (net: Unbreak compat_sys_{send,recv}msg)
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a382f4b2
    • Jiri Pirko's avatar
      team: fix possible null pointer dereference in team_handle_frame · fdb35e5a
      Jiri Pirko authored
      [ Upstream commit 57e59563 ]
      
      Currently following race is possible in team:
      
      CPU0                                        CPU1
                                                  team_port_del
                                                    team_upper_dev_unlink
                                                      priv_flags &= ~IFF_TEAM_PORT
      team_handle_frame
        team_port_get_rcu
          team_port_exists
            priv_flags & IFF_TEAM_PORT == 0
          return NULL (instead of port got
                       from rx_handler_data)
                                                    netdev_rx_handler_unregister
      
      The thing is that the flag is removed before rx_handler is unregistered.
      If team_handle_frame is called in between, team_port_exists returns 0
      and team_port_get_rcu will return NULL.
      So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
      that team_handle_frame will always see valid rx_handler_data pointer.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdb35e5a
    • Matthew Thode's avatar
      net: reject creation of netdev names with colons · f0bde501
      Matthew Thode authored
      [ Upstream commit a4176a93 ]
      
      colons are used as a separator in netdev device lookup in dev_ioctl.c
      
      Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
      Signed-off-by: default avatarMatthew Thode <mthode@mthode.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0bde501
    • Ignacy Gawędzki's avatar
      ematch: Fix auto-loading of ematch modules. · 96c65236
      Ignacy Gawędzki authored
      [ Upstream commit 34eea79e ]
      
      In tcf_em_validate(), after calling request_module() to load the
      kind-specific module, set em->ops to NULL before returning -EAGAIN, so
      that module_put() is not called again by tcf_em_tree_destroy().
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96c65236
    • Guenter Roeck's avatar
      net: phy: Fix verification of EEE support in phy_init_eee · 058969ab
      Guenter Roeck authored
      [ Upstream commit 54da5a8b ]
      
      phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
      to find a valid entry in the settings array for the given speed
      and duplex value. For full duplex 1000baseT, this will return
      the first matching entry, which is the entry for 1000baseKX_Full.
      
      If the phy eee does not support 1000baseKX_Full, this entry will not
      match, causing phy_init_eee to fail for no good reason.
      
      Fixes: 9a9c56cb ("net: phy: fix a bug when verify the EEE support")
      Fixes: 3e707706 ("phy: Expand phy speed/duplex settings array")
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      058969ab
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should not assume that skb_network_offset is zero · 7ac8d452
      Alexander Drozdov authored
      [ Upstream commit 3e32e733 ]
      
      ip_check_defrag() may be used by af_packet to defragment outgoing packets.
      skb_network_offset() of af_packet's outgoing packets is not zero.
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ac8d452
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should correctly check return value of skb_copy_bits · 0b04f65d
      Alexander Drozdov authored
      [ Upstream commit fba04a9e ]
      
      skb_copy_bits() returns zero on success and negative value on error,
      so it is needed to invert the condition in ip_check_defrag().
      
      Fixes: 1bf3751e ("ipv4: ip_check_defrag must not modify skb before unsharing")
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b04f65d
    • Ignacy Gawędzki's avatar
      gen_stats.c: Duplicate xstats buffer for later use · f1796b11
      Ignacy Gawędzki authored
      [ Upstream commit 1c4cff0c ]
      
      The gnet_stats_copy_app() function gets called, more often than not, with its
      second argument a pointer to an automatic variable in the caller's stack.
      Therefore, to avoid copying garbage afterwards when calling
      gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
      memory that gets freed after use.
      
      [xiyou.wangcong@gmail.com: remove a useless kfree()]
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1796b11
    • WANG Cong's avatar
      rtnetlink: call ->dellink on failure when ->newlink exists · b830ba60
      WANG Cong authored
      [ Upstream commit 7afb8886 ]
      
      Ignacy reported that when eth0 is down and add a vlan device
      on top of it like:
      
        ip link add link eth0 name eth0.1 up type vlan id 1
      
      We will get a refcount leak:
      
        unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
      
      The problem is when rtnl_configure_link() fails in rtnl_newlink(),
      we simply call unregister_device(), but for stacked device like vlan,
      we almost do nothing when we unregister the upper device, more work
      is done when we unregister the lower device, so call its ->dellink().
      Reported-by: default avatarIgnacy Gawedzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b830ba60
    • Martin KaFai Lau's avatar
      ipv6: fix ipv6_cow_metrics for non DST_HOST case · 8f923964
      Martin KaFai Lau authored
      [ Upstream commit 3b471175 ]
      
      ipv6_cow_metrics() currently assumes only DST_HOST routes require
      dynamic metrics allocation from inetpeer.  The assumption breaks
      when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
      Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
      is called after the route is created.
      
      This patch creates the metrics array (by calling dst_cow_metrics_generic) in
      ipv6_cow_metrics().
      
      Test:
      radvd.conf:
      interface qemubr0
      {
      	AdvLinkMTU 1300;
      	AdvCurHopLimit 30;
      
      	prefix fd00:face:face:face::/64
      	{
      		AdvOnLink on;
      		AdvAutonomous on;
      		AdvRouterAddr off;
      	};
      };
      
      Before:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
      fe80::/64 dev eth0  proto kernel  metric 256
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec
      
      After:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
      fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30
      
      Fixes: 8e2ec639 (ipv6: don't use inetpeer to store metrics for routes.)
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f923964
    • Daniel Borkmann's avatar
      rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 07e756ea
      Daniel Borkmann authored
      [ Upstream commit 364d5716 ]
      
      ifla_vf_policy[] is wrong in advertising its individual member types as
      NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
      len member as *max* attribute length [0, len].
      
      The issue is that when do_setvfinfo() is being called to set up a VF
      through ndo handler, we could set corrupted data if the attribute length
      is less than the size of the related structure itself.
      
      The intent is exactly the opposite, namely to make sure to pass at least
      data of minimum size of len.
      
      Fixes: ebc08a6f ("rtnetlink: Add VF config code to rtnetlink")
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07e756ea
    • Sabrina Dubroca's avatar
      pktgen: fix UDP checksum computation · dfef467f
      Sabrina Dubroca authored
      [ Upstream commit 7744b5f3 ]
      
      This patch fixes two issues in UDP checksum computation in pktgen.
      
      First, the pseudo-header uses the source and destination IP
      addresses. Currently, the ports are used for IPv4.
      
      Second, the UDP checksum covers both header and data.  So we need to
      generate the data earlier (move pktgen_finalize_skb up), and compute
      the checksum for UDP header + data.
      
      Fixes: c26bf4a5 ("pktgen: Add UDPCSUM flag to support UDP checksums")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfef467f
  2. 06 Mar, 2015 20 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.14.35 · e8f616ae
      Greg Kroah-Hartman authored
      e8f616ae
    • Hector Marco-Gisbert's avatar
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 14a3e0c9
      Hector Marco-Gisbert authored
      commit 4e7c22d4 upstream.
      
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: default avatarHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: default avatarIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14a3e0c9
    • Thadeu Lima de Souza Cascardo's avatar
      blk-throttle: check stats_cpu before reading it from sysfs · 03265186
      Thadeu Lima de Souza Cascardo authored
      commit 045c47ca upstream.
      
      When reading blkio.throttle.io_serviced in a recently created blkio
      cgroup, it's possible to race against the creation of a throttle policy,
      which delays the allocation of stats_cpu.
      
      Like other functions in the throttle code, just checking for a NULL
      stats_cpu prevents the following oops caused by that race.
      
      [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
      [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
      [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
      [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
      [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
      [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
      [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
      [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
      [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
      [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
      GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
      GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
      GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
      GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
      GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
      GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
      [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
      [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
      [ 1137.734943] Call Trace:
      [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
      [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
      [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
      [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
      [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
      [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
      [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
      [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
      [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
      [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
      [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
      [ 1137.735383] Instruction dump:
      [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
      [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
      
      And here is one code that allows to easily reproduce this, although this
      has first been found by running docker.
      
      void run(pid_t pid)
      {
      	int n;
      	int status;
      	int fd;
      	char *buffer;
      	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
      	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
      	fd = open(CGPATH "/test/tasks", O_WRONLY);
      	write(fd, buffer, n);
      	close(fd);
      	if (fork() > 0) {
      		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
      		read(fd, buffer, 512);
      		close(fd);
      		wait(&status);
      	} else {
      		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
      		n = read(fd, buffer, BUFFER_SIZE);
      		close(fd);
      	}
      	free(buffer);
      	exit(0);
      }
      
      void test(void)
      {
      	int status;
      	mkdir(CGPATH "/test", 0666);
      	if (fork() > 0)
      		wait(&status);
      	else
      		run(getpid());
      	rmdir(CGPATH "/test");
      }
      
      int main(int argc, char **argv)
      {
      	int i;
      	for (i = 0; i < NR_TESTS; i++)
      		test();
      	return 0;
      }
      Reported-by: default avatarRicardo Marin Matinata <rmm@br.ibm.com>
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03265186
    • David Sterba's avatar
      btrfs: fix leak of path in btrfs_find_item · f9e2ba63
      David Sterba authored
      commit 381cf658 upstream.
      
      If btrfs_find_item is called with NULL path it allocates one locally but
      does not free it. Affected paths are inserting an orphan item for a file
      and for a subvol root.
      
      Move the path allocation to the callers.
      
      Fixes: 3f870c28 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9e2ba63
    • David Sterba's avatar
      btrfs: set proper message level for skinny metadata · 74e42361
      David Sterba authored
      commit 5efa0490 upstream.
      
      This has been confusing people for too long, the message is really just
      informative.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74e42361
    • Chen Jie's avatar
      jffs2: fix handling of corrupted summary length · b82eaa1b
      Chen Jie authored
      commit 164c2406 upstream.
      
      sm->offset maybe wrong but magic maybe right, the offset do not have CRC.
      
      Badness at c00c7580 [verbose debug info unavailable]
      NIP: c00c7580 LR: c00c718c CTR: 00000014
      REGS: df07bb40 TRAP: 0700   Not tainted  (2.6.34.13-WR4.3.0.0_standard)
      MSR: 00029000 <EE,ME,CE>  CR: 22084f84  XER: 00000000
      TASK = df84d6e0[908] 'mount' THREAD: df07a000
      GPR00: 00000001 df07bbf0 df84d6e0 00000000 00000001 00000000 df07bb58 00000041
      GPR08: 00000041 c0638860 00000000 00000010 22084f88 100636c8 df814ff8 00000000
      GPR16: df84d6e0 dfa558cc c05adb90 00000048 c0452d30 00000000 000240d0 000040d0
      GPR24: 00000014 c05ae734 c05be2e0 00000000 00000001 00000000 00000000 c05ae730
      NIP [c00c7580] __alloc_pages_nodemask+0x4d0/0x638
      LR [c00c718c] __alloc_pages_nodemask+0xdc/0x638
      Call Trace:
      [df07bbf0] [c00c718c] __alloc_pages_nodemask+0xdc/0x638 (unreliable)
      [df07bc90] [c00c7708] __get_free_pages+0x20/0x48
      [df07bca0] [c00f4a40] __kmalloc+0x15c/0x1ec
      [df07bcd0] [c01fc880] jffs2_scan_medium+0xa58/0x14d0
      [df07bd70] [c01ff38c] jffs2_do_mount_fs+0x1f4/0x6b4
      [df07bdb0] [c020144c] jffs2_do_fill_super+0xa8/0x260
      [df07bdd0] [c020230c] jffs2_fill_super+0x104/0x184
      [df07be00] [c0335814] get_sb_mtd_aux+0x9c/0xec
      [df07be20] [c033596c] get_sb_mtd+0x84/0x1e8
      [df07be60] [c0201ed0] jffs2_get_sb+0x1c/0x2c
      [df07be70] [c0103898] vfs_kern_mount+0x78/0x1e8
      [df07bea0] [c0103a58] do_kern_mount+0x40/0x100
      [df07bec0] [c011fe90] do_mount+0x240/0x890
      [df07bf10] [c0120570] sys_mount+0x90/0xd8
      [df07bf40] [c00110d8] ret_from_syscall+0x0/0x4
      
      === Exception: c01 at 0xff61a34
          LR = 0x100135f0
      Instruction dump:
      38800005 38600000 48010f41 4bfffe1c 4bfc2d15 4bfffe8c 72e90200 4082fc28
      3d20c064 39298860 8809000d 68000001 <0f000000> 2f800000 419efc0c 38000001
      mount: mounting /dev/mtdblock3 on /common failed: Input/output error
      Signed-off-by: default avatarChen Jie <chenjie6@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b82eaa1b
    • Daniel J Blueman's avatar
      EDAC, amd64_edac: Prevent OOPS with >16 memory controllers · a36a2517
      Daniel J Blueman authored
      commit 0c510cc8 upstream.
      
      When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
      the kernel fatally dereferences unallocated structures, see splat below;
      this occurs on at least NumaConnect systems.
      
      Fix by checking if a memory controller info structure was found.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
      IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
      Oops: 0000 [#2] SMP
      Modules linked in:
      CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
      Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
      task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
      RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
      RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
      RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
      RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
      R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
      R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
      FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
      Stack:
       0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
       000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
       ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
      Call Trace:
       <IRQ>
       ? vprintk_default
       ? printk
       amd_decode_mce
       notifier_call_chain
       atomic_notifier_call_chain
       mce_log
       machine_check_poll
       mce_timer_fn
       ? mce_cpu_restart
       call_timer_fn.isra.29
       run_timer_softirq
       __do_softirq
       irq_exit
       smp_apic_timer_interrupt
       apic_timer_interrupt
       <EOI>
       ? down_read_trylock
       __do_page_fault
       ? __schedule
       do_page_fault
       page_fault
      Signed-off-by: default avatarDaniel J Blueman <daniel@numascale.com>
      Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
      [ Boris: massage commit message ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a36a2517
    • Tomáš Hodek's avatar
      md/raid1: fix read balance when a drive is write-mostly. · 1e3240b0
      Tomáš Hodek authored
      commit d1901ef0 upstream.
      
      When a drive is marked write-mostly it should only be the
      target of reads if there is no other option.
      
      This behaviour was broken by
      
      commit 9dedf603
          md/raid1: read balance chooses idlest disk for SSD
      
      which causes a write-mostly device to be *preferred* is some cases.
      
      Restore correct behaviour by checking and setting
      best_dist_disk and best_pending_disk rather than best_disk.
      
      We only need to test one of these as they are both changed
      from -1 or >=0 at the same time.
      
      As we leave min_pending and best_dist unchanged, any non-write-mostly
      device will appear better than the write-mostly device.
      Reported-by: default avatarTomáš Hodek <tomas.hodek@volny.cz>
      Reported-by: default avatarDark Penguin <darkpenguin@yandex.ru>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Link: http://marc.info/?l=linux-raid&m=135982797322422
      Fixes: 9dedf603Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e3240b0
    • NeilBrown's avatar
      md/raid5: Fix livelock when array is both resyncing and degraded. · 0ed85c57
      NeilBrown authored
      commit 26ac1073 upstream.
      
      Commit a7854487:
        md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
      
      Causes an RCW cycle to be forced even when the array is degraded.
      A degraded array cannot support RCW as that requires reading all data
      blocks, and one may be missing.
      
      Forcing an RCW when it is not possible causes a live-lock and the code
      spins, repeatedly deciding to do something that cannot succeed.
      
      So change the condition to only force RCW on non-degraded arrays.
      Reported-by: default avatarManibalan P <pmanibalan@amiindia.co.in>
      Bisected-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Tested-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Fixes: a7854487Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ed85c57
    • James Hogan's avatar
      metag: Fix KSTK_EIP() and KSTK_ESP() macros · 6318c31d
      James Hogan authored
      commit c2996cb2 upstream.
      
      The KSTK_EIP() and KSTK_ESP() macros should return the user program
      counter (PC) and stack pointer (A0StP) of the given task. These are used
      to determine which VMA corresponds to the user stack in
      /proc/<pid>/maps, and for the user PC & A0StP in /proc/<pid>/stat.
      
      However for Meta the PC & A0StP from the task's kernel context are used,
      resulting in broken output. For example in following /proc/<pid>/maps
      output, the 3afff000-3b021000 VMA should be described as the stack:
      
        # cat /proc/self/maps
        ...
        100b0000-100b1000 rwxp 00000000 00:00 0          [heap]
        3afff000-3b021000 rwxp 00000000 00:00 0
      
      And in the following /proc/<pid>/stat output, the PC is in kernel code
      (1074234964 = 0x40078654) and the A0StP is in the kernel heap
      (1335981392 = 0x4fa17550):
      
        # cat /proc/self/stat
        51 (cat) R ... 1335981392 1074234964 ...
      
      Fix the definitions of KSTK_EIP() and KSTK_ESP() to use
      task_pt_regs(tsk)->ctx rather than (tsk)->thread.kernel_context. This
      gets the registers from the user context stored after the thread info at
      the base of the kernel stack, which is from the last entry into the
      kernel from userland, regardless of where in the kernel the task may
      have been interrupted, which results in the following more correct
      /proc/<pid>/maps output:
      
        # cat /proc/self/maps
        ...
        0800b000-08070000 r-xp 00000000 00:02 207        /lib/libuClibc-0.9.34-git.so
        ...
        100b0000-100b1000 rwxp 00000000 00:00 0          [heap]
        3afff000-3b021000 rwxp 00000000 00:00 0          [stack]
      
      And /proc/<pid>/stat now correctly reports the PC in libuClibc
      (134320308 = 0x80190b4) and the A0StP in the [stack] region (989864576 =
      0x3b002280):
      
        # cat /proc/self/stat
        51 (cat) R ... 989864576 134320308 ...
      Reported-by: default avatarAlexey Brodkin <Alexey.Brodkin@synopsys.com>
      Reported-by: default avatarVineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: linux-metag@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6318c31d
    • Jan Kara's avatar
      xfs: Fix quota type in quota structures when reusing quota file · 6d135a3d
      Jan Kara authored
      commit dfcc70a8 upstream.
      
      For filesystems without separate project quota inode field in the
      superblock we just reuse project quota file for group quotas (and vice
      versa) if project quota file is allocated and we need group quota file.
      When we reuse the file, quota structures on disk suddenly have wrong
      type stored in d_flags though. Nobody really cares about this (although
      structure type reported to userspace was wrong as well) except
      that after commit 14bf61ff (quota: Switch ->get_dqblk() and
      ->set_dqblk() to use bytes as space units) assertion in
      xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
      was testing without XFS_DEBUG so I didn't notice when submitting the
      above commit).
      
      Fix the problem by properly resetting ddq->d_flags when running quotacheck
      for a quota file.
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d135a3d
    • Nicolas Saenz Julienne's avatar
      gpio: tps65912: fix wrong container_of arguments · 2c80f141
      Nicolas Saenz Julienne authored
      commit 2f97c20e upstream.
      
      The gpio_chip operations receive a pointer the gpio_chip struct which is
      contained in the driver's private struct, yet the container_of call in those
      functions point to the mfd struct defined in include/linux/mfd/tps65912.h.
      Signed-off-by: default avatarNicolas Saenz Julienne <nicolassaenzj@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c80f141
    • Hans Holmberg's avatar
      gpiolib: of: allow of_gpiochip_find_and_xlate to find more than one chip per node · e2d7fdcf
      Hans Holmberg authored
      commit 9cf75e9e upstream.
      
      The change:
      
      7b8792bb
      gpiolib: of: Correct error handling in of_get_named_gpiod_flags
      
      assumed that only one gpio-chip is registred per of-node.
      Some drivers register more than one chip per of-node, so
      adjust the matching function of_gpiochip_find_and_xlate to
      not stop looking for chips if a node-match is found and
      the translation fails.
      
      Fixes: 7b8792bb ("gpiolib: of: Correct error handling in of_get_named_gpiod_flags")
      Signed-off-by: default avatarHans Holmberg <hans.holmberg@intel.com>
      Acked-by: default avatarAlexandre Courbot <acourbot@nvidia.com>
      Tested-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Tested-by: default avatarTyler Hall <tylerwhall@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2d7fdcf
    • Catalin Marinas's avatar
      arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian · 676b72e2
      Catalin Marinas authored
      commit 9d42d48a upstream.
      
      The native (64-bit) sigval_t union contains sival_int (32-bit) and
      sival_ptr (64-bit). When a compat application invokes a syscall that
      takes a sigval_t value (as part of a larger structure, e.g.
      compat_sys_mq_notify, compat_sys_timer_create), the compat_sigval_t
      union is converted to the native sigval_t with sival_int overlapping
      with either the least or the most significant half of sival_ptr,
      depending on endianness. When the corresponding signal is delivered to a
      compat application, on big endian the current (compat_uptr_t)sival_ptr
      cast always returns 0 since sival_int corresponds to the top part of
      sival_ptr. This patch fixes copy_siginfo_to_user32() so that sival_int
      is copied to the compat_siginfo_t structure.
      Reported-by: default avatarBamvor Jian Zhang <bamvor.zhangjian@huawei.com>
      Tested-by: default avatarBamvor Jian Zhang <bamvor.zhangjian@huawei.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      676b72e2
    • Martin Vajnar's avatar
      hx4700: regulator: declare full constraints · f8226d02
      Martin Vajnar authored
      commit a52d2093 upstream.
      
      Since the removal of CONFIG_REGULATOR_DUMMY option, the touchscreen stopped
      working. This patch enables the "replacement" for REGULATOR_DUMMY and
      allows the touchscreen to work even though there is no regulator for "vcc".
      Signed-off-by: default avatarMartin Vajnar <martin.vajnar@gmail.com>
      Signed-off-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8226d02
    • Marcelo Tosatti's avatar
      KVM: x86: update masterclock values on TSC writes · cd1fcaf5
      Marcelo Tosatti authored
      commit 7f187922 upstream.
      
      When the guest writes to the TSC, the masterclock TSC copy must be
      updated as well along with the TSC_OFFSET update, otherwise a negative
      tsc_timestamp is calculated at kvm_guest_time_update.
      
      Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
      "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
      becomes redundant, so remove the do_request boolean and collapse
      everything into a single condition.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd1fcaf5
    • James Hogan's avatar
      KVM: MIPS: Don't leak FPU/DSP to guest · 623347ab
      James Hogan authored
      commit f798217d upstream.
      
      The FPU and DSP are enabled via the CP0 Status CU1 and MX bits by
      kvm_mips_set_c0_status() on a guest exit, presumably in case there is
      active state that needs saving if pre-emption occurs. However neither of
      these bits are cleared again when returning to the guest.
      
      This effectively gives the guest access to the FPU/DSP hardware after
      the first guest exit even though it is not aware of its presence,
      allowing FP instructions in guest user code to intermittently actually
      execute instead of trapping into the guest OS for emulation. It will
      then read & manipulate the hardware FP registers which technically
      belong to the user process (e.g. QEMU), or are stale from another user
      process. It can also crash the guest OS by causing an FP exception, for
      which a guest exception handler won't have been registered.
      
      First lets save and disable the FPU (and MSA) state with lose_fpu(1)
      before entering the guest. This simplifies the problem, especially for
      when guest FPU/MSA support is added in the future, and prevents FR=1 FPU
      state being live when the FR bit gets cleared for the guest, which
      according to the architecture causes the contents of the FPU and vector
      registers to become UNPREDICTABLE.
      
      We can then safely remove the enabling of the FPU in
      kvm_mips_set_c0_status(), since there should never be any active FPU or
      MSA state to save at pre-emption, which should plug the FPU leak.
      
      DSP state is always live rather than being lazily restored, so for that
      it is simpler to just clear the MX bit again when re-entering the guest.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Sanjay Lal <sanjayl@kymasys.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: kvm@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: <stable@vger.kernel.org> # v3.10+: 044f0f03: MIPS: KVM: Deliver guest interrupts
      Cc: <stable@vger.kernel.org> # v3.10+
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      623347ab
    • Alexey Brodkin's avatar
      ARC: fix page address calculation if PAGE_OFFSET != LINUX_LINK_BASE · 645177f3
      Alexey Brodkin authored
      commit 06f34e1c upstream.
      
      We used to calculate page address differently in 2 cases:
      
      1. In virt_to_page(x) we do
       --->8---
       mem_map + (x - CONFIG_LINUX_LINK_BASE) >> PAGE_SHIFT
       --->8---
      
      2. In in pte_page(x) we do
       --->8---
       mem_map + (pte_val(x) - PAGE_OFFSET) >> PAGE_SHIFT
       --->8---
      
      That leads to problems in case PAGE_OFFSET != CONFIG_LINUX_LINK_BASE -
      different pages will be selected depending on where and how we calculate
      page address.
      
      In particular in the STAR 9000853582 when gdb attempted to read memory
      of another process it got improper page in get_user_pages() because this
      is exactly one of the places where we search for a page by pte_page().
      
      The fix is trivial - we need to calculate page address similarly in both
      cases.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      645177f3
    • John Stultz's avatar
      ntp: Fixup adjtimex freq validation on 32-bit systems · 5e2b0b3f
      John Stultz authored
      commit 29183a70 upstream.
      
      Additional validation of adjtimex freq values to avoid
      potential multiplication overflows were added in commit
      5e5aeb43 (time: adjtimex: Validate the ADJ_FREQUENCY values)
      
      Unfortunately the patch used LONG_MAX/MIN instead of
      LLONG_MAX/MIN, which was fine on 64-bit systems, but being
      much smaller on 32-bit systems caused false positives
      resulting in most direct frequency adjustments to fail w/
      EINVAL.
      
      ntpd only does direct frequency adjustments at startup, so
      the issue was not as easily observed there, but other time
      sync applications like ptpd and chrony were more effected by
      the bug.
      
      See bugs:
      
        https://bugzilla.kernel.org/show_bug.cgi?id=92481
        https://bugzilla.redhat.com/show_bug.cgi?id=1188074
      
      This patch changes the checks to use LLONG_MAX for
      clarity, and additionally the checks are disabled
      on 32-bit systems since LLONG_MAX/PPM_SCALE is always
      larger then the 32-bit long freq value, so multiplication
      overflows aren't possible there.
      Reported-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Reported-by: default avatarGeorge Joseph <george.joseph@fairview5.com>
      Tested-by: default avatarGeorge Joseph <george.joseph@fairview5.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org
      [ Prettified the changelog and the comments a bit. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e2b0b3f
    • Jay Lan's avatar
      kdb: fix incorrect counts in KDB summary command output · 519586df
      Jay Lan authored
      commit 14675592 upstream.
      
      The output of KDB 'summary' command should report MemTotal, MemFree
      and Buffers output in kB. Current codes report in unit of pages.
      
      A define of K(x) as
      is defined in the code, but not used.
      
      This patch would apply the define to convert the values to kB.
      Please include me on Cc on replies. I do not subscribe to linux-kernel.
      Signed-off-by: default avatarJay Lan <jlan@sgi.com>
      Signed-off-by: default avatarJason Wessel <jason.wessel@windriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      519586df