1. 03 Feb, 2018 10 commits
    • Roman Gushchin's avatar
      Revert "defer call to mem_cgroup_sk_alloc()" · edbe69ef
      Roman Gushchin authored
      This patch effectively reverts commit 9f1c2674 ("net: memcontrol:
      defer call to mem_cgroup_sk_alloc()").
      
      Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
      memcg socket memory accounting, as packets received before memcg
      pointer initialization are not accounted and are causing refcounting
      underflow on socket release.
      
      Actually the free-after-use problem was fixed by
      commit c0576e39 ("net: call cgroup_sk_alloc() earlier in
      sk_clone_lock()") for the cgroup pointer.
      
      So, let's revert it and call mem_cgroup_sk_alloc() just before
      cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
      we're cloning, and it holds a reference to the memcg.
      
      Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
      mem_cgroup_sk_alloc(). I see no reasons why bumping the root
      memcg counter is a good reason to panic, and there are no realistic
      ways to hit it.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edbe69ef
    • Eric Dumazet's avatar
      soreuseport: fix mem leak in reuseport_add_sock() · 4db428a7
      Eric Dumazet authored
      reuseport_add_sock() needs to deal with attaching a socket having
      its own sk_reuseport_cb, after a prior
      setsockopt(SO_ATTACH_REUSEPORT_?BPF)
      
      Without this fix, not only a WARN_ONCE() was issued, but we were also
      leaking memory.
      
      Thanks to sysbot and Eric Biggers for providing us nice C repros.
      
      ------------[ cut here ]------------
      socket already in reuseport group
      WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119  
      reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS  
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x194/0x257 lib/dump_stack.c:53
        panic+0x1e4/0x41c kernel/panic.c:183
        __warn+0x1dc/0x200 kernel/panic.c:547
        report_bug+0x211/0x2d0 lib/bug.c:184
        fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
        fixup_bug arch/x86/kernel/traps.c:247 [inline]
        do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
        do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
        invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
      
      Fixes: ef456144 ("soreuseport: define reuseport groups")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: syzbot+c0ea2226f77a42936bf7@syzkaller.appspotmail.com
      Acked-by: default avatarCraig Gallek <kraig@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4db428a7
    • Arnd Bergmann's avatar
      net: qlge: use memmove instead of skb_copy_to_linear_data · cfabb177
      Arnd Bergmann authored
      gcc-8 points out that the skb_copy_to_linear_data() argument points to
      the skb itself, which makes it run into a problem with overlapping
      memcpy arguments:
      
      In file included from include/linux/ip.h:20,
                       from drivers/net/ethernet/qlogic/qlge/qlge_main.c:26:
      drivers/net/ethernet/qlogic/qlge/qlge_main.c: In function 'ql_realign_skb':
      include/linux/skbuff.h:3378:2: error: 'memcpy' source argument is the same as destination [-Werror=restrict]
        memcpy(skb->data, from, len);
      
      It's unclear to me what the best solution is, maybe it ought to use a
      different helper that adjusts the skb data in a safe way. Simply using
      memmove() here seems like the easiest workaround.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfabb177
    • Arnd Bergmann's avatar
      net: qed: use correct strncpy() size · 11f71108
      Arnd Bergmann authored
      passing the strlen() of the source string as the destination
      length is pointless, and gcc-8 now warns about it:
      
      drivers/net/ethernet/qlogic/qed/qed_debug.c: In function 'qed_grc_dump':
      include/linux/string.h:253: error: 'strncpy' specified bound depends on the length of the source argument [-Werror=stringop-overflow=]
      
      This changes qed_grc_dump_big_ram() to instead uses the length of
      the destination buffer, and use strscpy() to guarantee nul-termination.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11f71108
    • Arnd Bergmann's avatar
      net: cxgb4: avoid memcpy beyond end of source buffer · 1a91649f
      Arnd Bergmann authored
      Building with link-time-optimizations revealed that the cxgb4 driver does
      a fixed-size memcpy() from a variable-length constant string into the
      network interface name:
      
      In function 'memcpy',
          inlined from 'cfg_queues_uld.constprop' at drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:335:2,
          inlined from 'cxgb4_register_uld.constprop' at drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c:719:9:
      include/linux/string.h:350:3: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
         __read_overflow2();
         ^
      
      I can see two equally workable solutions: either we use a strncpy() instead
      of the memcpy() to stop at the end of the input, or we make the source buffer
      fixed length as well. This implements the latter.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a91649f
    • Paolo Abeni's avatar
      cls_u32: add missing RCU annotation. · 058a6c03
      Paolo Abeni authored
      In a couple of points of the control path, n->ht_down is currently
      accessed without the required RCU annotation. The accesses are
      safe, but sparse complaints. Since we already held the
      rtnl lock, let use rtnl_dereference().
      
      Fixes: a1b7c5fd ("net: sched: add cls_u32 offload hooks for netdevs")
      Fixes: de5df632 ("net: sched: cls_u32 changes to knode must appear atomic to readers")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      058a6c03
    • David S. Miller's avatar
      Merge branch 'r8152-fix-rx-issues' · 0072f0c4
      David S. Miller authored
      Hayes Wang says:
      
      ====================
      r8152: fix rx issues
      
      The two patched are used to fix rx issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0072f0c4
    • Hayes Wang's avatar
      r8152: set rx mode early when linking on · aece4770
      Hayes Wang authored
      Set rx mode before calling netif_wake_queue() when linking on to avoid
      the device missing the receiving packets.
      
      The transmission may start after calling netif_wake_queue(), and the
      packets of resopnse may reach before calling rtl8152_set_rx_mode()
      which let the device could receive packets. Then, the packets of
      response would be missed.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aece4770
    • Hayes Wang's avatar
      r8152: fix wrong checksum status for received IPv4 packets · ea6499e1
      Hayes Wang authored
      The device could only check the checksum of TCP and UDP packets. Therefore,
      for the IPv4 packets excluding TCP and UDP, the check of checksum is necessary,
      even though the IP checksum is correct.
      
      Take ICMP for example, The IP checksum may be correct, but the ICMP checksum
      may be wrong.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea6499e1
    • Edwin Peer's avatar
      nfp: fix TLV offset calculation · 1d8ef0c0
      Edwin Peer authored
      The data pointer in the config space TLV parser already includes
      NFP_NET_CFG_TLV_BASE, it should not be added again. Incorrect
      offset values were only used in printed user output, rendering
      the bug merely cosmetic.
      
      Fixes: 73a0329b ("nfp: add TLV capabilities to the BAR")
      Signed-off-by: default avatarEdwin Peer <edwin.peer@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d8ef0c0
  2. 01 Feb, 2018 30 commits
    • Alexander Monakov's avatar
      net: pxa168_eth: add netconsole support · 743ffffe
      Alexander Monakov authored
      This implements ndo_poll_controller callback which is necessary to
      enable netconsole.
      Signed-off-by: default avatarAlexander Monakov <amonakov@ispras.ru>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      743ffffe
    • Eric Dumazet's avatar
      net: igmp: add a missing rcu locking section · e7aadb27
      Eric Dumazet authored
      Newly added igmpv3_get_srcaddr() needs to be called under rcu lock.
      
      Timer callbacks do not ensure this locking.
      
      =============================
      WARNING: suspicious RCU usage
      4.15.0+ #200 Not tainted
      -----------------------------
      ./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      3 locks held by syzkaller616973/4074:
       #0:  (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
       #1:  ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
       #2:  (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600
      
      stack backtrace:
      CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
       __in_dev_get_rcu include/linux/inetdevice.h:216 [inline]
       igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline]
       igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389
       add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432
       add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565
       igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605
       igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722
       igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831
       call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:541 [inline]
       smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938
      
      Fixes: a46182b0 ("net: igmp: Use correct source address on IGMPv3 reports")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7aadb27
    • Desnes Augusto Nunes do Rosario's avatar
      ibmvnic: fix firmware version when no firmware level has been provided by the VIOS server · a107311d
      Desnes Augusto Nunes do Rosario authored
      Older versions of VIOS servers do not send the firmware level in the VPD
      buffer for the ibmvnic driver. Thus, not only the current message is mis-
      leading but the firmware version in the ethtool will be NULL. Therefore,
      this patch fixes the firmware string and its warning.
      
      Fixes: 4e6759be ("ibmvnic: Feature implementation of VPD for the ibmvnic driver")
      Signed-off-by: default avatarDesnes A. Nunes do Rosario <desnesn@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a107311d
    • Colin Ian King's avatar
      vmxnet3: remove redundant initialization of pointer 'rq' · 5e264e2b
      Colin Ian King authored
      Pointer rq is being initialized but this value is never read, it
      is being updated inside a for-loop. Remove the initialization and
      move it into the scope of the for-loop.
      
      Cleans up clang warning:
      drivers/net/vmxnet3/vmxnet3_drv.c:2763:27: warning: Value stored
      to 'rq' during its initialization is never read
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarShrikrishna Khare <skhare@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e264e2b
    • Colin Ian King's avatar
      lan78xx: remove redundant initialization of pointer 'phydev' · 3b51cc75
      Colin Ian King authored
      Pointer phydev is initialized and this value is never read, phydev
      is immediately updated to a new value, hence this initialization
      is redundant and can be removed
      
      Cleans up clang warning:
      drivers/net/usb/lan78xx.c:2009:21: warning: Value stored to 'phydev'
      during its initialization is never read
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b51cc75
    • Colin Ian King's avatar
      net: jme: remove unused initialization of 'rxdesc' · f14d244f
      Colin Ian King authored
      Pointer rxdesc is assigned a value that is never read, it is overwritten
      by a new assignment inside a while loop hence the initial assignment
      is redundant and can be removed.
      
      Cleans up clang warning:
      drivers/net/ethernet/jme.c:1074:17: warning: Value stored to 'rxdesc'
      during its initialization is never read
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f14d244f
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · b9a40729
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for your net tree,
      they are:
      
      1) Fix OOM that syskaller triggers with ipt_replace.size = -1 and
         IPT_SO_SET_REPLACE socket option, from Dmitry Vyukov.
      
      2) Check for too long extension name in xt_request_find_{match|target}
         that result in out-of-bound reads, from Eric Dumazet.
      
      3) Fix memory exhaustion bug in ipset hash:*net* types when adding ranges
         that look like x.x.x.x-255.255.255.255, from Jozsef Kadlecsik.
      
      4) Fix pointer leaks to userspace in x_tables, from Dmitry Vyukov.
      
      5) Insufficient sanity checks in clusterip_tg_check(), also from Dmitry.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9a40729
    • Christian Brauner's avatar
      rtnetlink: remove check for IFLA_IF_NETNSID · 7973bfd8
      Christian Brauner authored
      RTM_NEWLINK supports the IFLA_IF_NETNSID property since
      5bb8ed07 so we should not error out
      when it is passed.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7973bfd8
    • Jiri Pirko's avatar
      rocker: fix possible null pointer dereference in rocker_router_fib_event_work · a83165f0
      Jiri Pirko authored
      Currently, rocker user may experience following null pointer
      derefence bug:
      
      [    3.062141] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
      [    3.065163] IP: rocker_router_fib_event_work+0x36/0x110 [rocker]
      
      The problem is uninitialized rocker->wops pointer that is initialized
      only with the first initialized port. So move the port initialization
      before registering the fib events.
      
      Fixes: 936bd486 ("rocker: use FIB notifications instead of switchdev calls")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a83165f0
    • Geert Uytterhoeven's avatar
      inet: Avoid unitialized variable warning in inet_unhash() · 0ba98718
      Geert Uytterhoeven authored
      With gcc-4.1.2:
      
          net/ipv4/inet_hashtables.c: In function ‘inet_unhash’:
          net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function
      
      While this is a false positive, it can easily be avoided by using the
      pointer itself as the canary variable.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ba98718
    • Geert Uytterhoeven's avatar
      net: bridge: Fix uninitialized error in br_fdb_sync_static() · 367dc658
      Geert Uytterhoeven authored
      With gcc-4.1.2.:
      
          net/bridge/br_fdb.c: In function ‘br_fdb_sync_static’:
          net/bridge/br_fdb.c:996: warning: ‘err’ may be used uninitialized in this function
      
      Indeed, if the list is empty, err will be uninitialized, and will be
      propagated up as the function return value.
      
      Fix this by preinitializing err to zero.
      
      Fixes: eb793583 ("net: bridge: use rhashtable for fdbs")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      367dc658
    • Ed Swierk's avatar
      openvswitch: Remove padding from packet before L3+ conntrack processing · 9382fe71
      Ed Swierk authored
      IPv4 and IPv6 packets may arrive with lower-layer padding that is not
      included in the L3 length. For example, a short IPv4 packet may have
      up to 6 bytes of padding following the IP payload when received on an
      Ethernet device with a minimum packet length of 64 bytes.
      
      Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
      and help() in nf_conntrack_ftp) assume skb->len reflects the length of
      the L3 header and payload, rather than referring back to
      ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
      lower-layer padding.
      
      In the normal IPv4 receive path, ip_rcv() trims the packet to
      ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
      path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
      in the br_netfilter receive path, br_validate_ipv4() and
      br_validate_ipv6() trim the packet to the L3 length before invoking
      netfilter hooks.
      
      Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
      the skb to the L3 header but does not trim it to the L3 length before
      calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
      nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
      nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
      message. While extra zero bytes don't affect the checksum, the length
      in the IP pseudoheader does. That length is based on skb->len, and
      without trimming, it doesn't match the length the sender used when
      computing the checksum.
      
      In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
      processing.
      Signed-off-by: default avatarEd Swierk <eswierk@skyportsystems.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9382fe71
    • Neal Cardwell's avatar
      tcp_bbr: fix pacing_gain to always be unity when using lt_bw · 3aff3b4b
      Neal Cardwell authored
      This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
      using lt_bw and returning from the PROBE_RTT state to PROBE_BW.
      
      Previously, when using lt_bw, upon exiting PROBE_RTT and entering
      PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
      end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
      set a pacing gain above 1.0. In such cases this would result in a
      pacing rate that is 1.25x higher than intended, potentially resulting
      in a high loss rate for a little while until we stop using the lt_bw a
      bit later.
      
      This commit is a stable candidate for kernels back as far as 4.9.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reported-by: default avatarBeyers Cronje <bcronje@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3aff3b4b
    • Colin Ian King's avatar
      be2net: remove redundant initialization of 'head' and pointer txq · 2e85283d
      Colin Ian King authored
      Variable head is initialized to a value that is never read and is
      being updated to a new value a few lines later, hence this
      initialization is redundant and can be safely removed as well
      as the now unused pointer txq.
      
      Cleans up clang warning:
      drivers/net/ethernet/emulex/benet/be_main.c:996:6: warning: Value
      stored to 'head' during its initialization is never read
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e85283d
    • David S. Miller's avatar
      Merge branch 'bnx2x-disable-GSO-on-too-large-packets' · 26c26ab0
      David S. Miller authored
      Daniel Axtens says:
      
      ====================
      bnx2x: disable GSO on too-large packets
      
      We observed a case where a packet received on an ibmveth device had a
      GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
      device, where it caused a firmware assert. This is described in detail
      at [0].
      
      Ultimately we want a fix in the core, but that is very tricky to
      backport. So for now, just stop the bnx2x driver from crashing.
      
      When net-next re-opens I will send the fix to the core and a revert
      for this.
      
      v4 changes:
        - fix compilation error with EXPORTs (patch 1)
        - only do slow test if gso_size is greater than 9000 bytes (patch 2)
      
      Thanks,
      Daniel
      
      [0]: https://patchwork.ozlabs.org/patch/859410/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26c26ab0
    • Daniel Axtens's avatar
      bnx2x: disable GSO where gso_size is too big for hardware · 8914a595
      Daniel Axtens authored
      If a bnx2x card is passed a GSO packet with a gso_size larger than
      ~9700 bytes, it will cause a firmware error that will bring the card
      down:
      
      bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert!
      bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2
      bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x00000000 0x25e43e47 0x00463e01 0x00010052
      bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1
      ... (dump of values continues) ...
      
      Detect when the mac length of a GSO packet is greater than the maximum
      packet size (9700 bytes) and disable GSO.
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8914a595
    • Daniel Axtens's avatar
      net: create skb_gso_validate_mac_len() · 2b16f048
      Daniel Axtens authored
      If you take a GSO skb, and split it into packets, will the MAC
      length (L2 + L3 + L4 headers + payload) of those packets be small
      enough to fit within a given length?
      
      Move skb_gso_mac_seglen() to skbuff.h with other related functions
      like skb_gso_network_seglen() so we can use it, and then create
      skb_gso_validate_mac_len to do the full calculation.
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b16f048
    • Linus Torvalds's avatar
      Merge tag 'docs-4.16' of git://git.lwn.net/linux · 255442c9
      Linus Torvalds authored
      Pull documentation updates from Jonathan Corbet:
       "Documentation updates for 4.16.
      
        New stuff includes refcount_t documentation, errseq documentation,
        kernel-doc support for nested structure definitions, the removal of
        lots of crufty kernel-doc support for unused formats, SPDX tag
        documentation, the beginnings of a manual for subsystem maintainers,
        and lots of fixes and updates.
      
        As usual, some of the changesets reach outside of Documentation/ to
        effect kerneldoc comment fixes. It also adds the new LICENSES
        directory, of which Thomas promises I do not need to be the
        maintainer"
      
      * tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
        linux-next: docs-rst: Fix typos in kfigure.py
        linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
        Documentation: Fix misconversion of #if
        docs: add index entry for networking/msg_zerocopy
        Documentation: security/credentials.rst: explain need to sort group_list
        LICENSES: Add MPL-1.1 license
        LICENSES: Add the GPL 1.0 license
        LICENSES: Add Linux syscall note exception
        LICENSES: Add the MIT license
        LICENSES: Add the BSD-3-clause "Clear" license
        LICENSES: Add the BSD 3-clause "New" or "Revised" License
        LICENSES: Add the BSD 2-clause "Simplified" license
        LICENSES: Add the LGPL-2.1 license
        LICENSES: Add the LGPL 2.0 license
        LICENSES: Add the GPL 2.0 license
        Documentation: Add license-rules.rst to describe how to properly identify file licenses
        scripts: kernel_doc: better handle show warnings logic
        fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
        doc: md: Fix a file name to md-fault.c in fault-injection.txt
        errseq: Add to documentation tree
        ...
      255442c9
    • Linus Torvalds's avatar
      Merge branch 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · d76e0a05
      Linus Torvalds authored
      Pull vmci iov_iter updates from Al Viro:
       "Get rid of "is it an iovec or an entire array?" flags in vmxi - just
        use iov_iter. Simplifies the living hell out of that code..."
      
      * 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        vmci: the same on the send side...
        vmci: simplify qp_dequeue_locked()
        vmci: get rid of qp_memcpy_from_queue()
        vmci: fix buf_size in case of iovec-based accesses
      d76e0a05
    • Linus Torvalds's avatar
      Merge branch 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 40b9672a
      Linus Torvalds authored
      Pull asm/uaccess.h whack-a-mole from Al Viro:
       "It's linux/uaccess.h, damnit... Oh, well - eventually they'll stop
        cropping up..."
      
      * 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        asm-prototypes.h: use linux/uaccess.h, not asm/uaccess.h
        riscv: use linux/uaccess.h, not asm/uaccess.h...
        ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
      40b9672a
    • Linus Torvalds's avatar
      Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · dc1efc3c
      Linus Torvalds authored
      Pull dcache updates from Al Viro:
       "Neil Brown's d_move()/d_path() race fix"
      
      * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        VFS: close race between getcwd() and d_move()
      dc1efc3c
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 73da9e1a
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - misc fixes
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
        mm: remove PG_highmem description
        tools, vm: new option to specify kpageflags file
        mm/swap.c: make functions and their kernel-doc agree
        mm, memory_hotplug: fix memmap initialization
        mm: correct comments regarding do_fault_around()
        mm: numa: do not trap faults on shared data section pages.
        hugetlb, mbind: fall back to default policy if vma is NULL
        hugetlb, mempolicy: fix the mbind hugetlb migration
        mm, hugetlb: further simplify hugetlb allocation API
        mm, hugetlb: get rid of surplus page accounting tricks
        mm, hugetlb: do not rely on overcommit limit during migration
        mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
        mm, hugetlb: unify core page allocation accounting and initialization
        mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
        mm/memcontrol.c: make local symbol static
        mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
        include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
        mm/compaction.c: fix comment for try_to_compact_pages()
        mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
        zsmalloc: use U suffix for negative literals being shifted
        ...
      73da9e1a
    • Miles Chen's avatar
      mm: remove PG_highmem description · 3f56a2f8
      Miles Chen authored
      Commit cbe37d09 ("[PATCH] mm: remove PG_highmem") removed PG_highmem
      to save a page flag.  So the description of PG_highmem is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/1517391212-2950-1-git-send-email-miles.chen@mediatek.comSigned-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f56a2f8
    • David Rientjes's avatar
      tools, vm: new option to specify kpageflags file · c7905f20
      David Rientjes authored
      page-types currently hardcodes /proc/kpageflags as the file to parse.
      This works when using the tool to examine the state of pageflags on the
      same system, but does not allow storing a snapshot of pageflags at a
      given time to debug issues nor on a different system.
      
      This allows the user to specify a saved version of kpageflags with a new
      page-types -F option.
      
      [akpm@linux-foundation.org: add "filename" to fix usage() string]
      [rientjes@google.com: fix layout]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301840050.140969@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301458180.153857@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7905f20
    • Randy Dunlap's avatar
      mm/swap.c: make functions and their kernel-doc agree · e02a9f04
      Randy Dunlap authored
      Fix some basic kernel-doc notation in mm/swap.c:
      
       - for function lru_cache_add_anon(), make its kernel-doc function name
         match its function name and change colon to hyphen following the
         function name
      
       - for function pagevec_lookup_entries(), change the function parameter
         name from nr_pages to nr_entries since that is more descriptive of
         what the parameter actually is and then it matches the kernel-doc
         comments also
      
      Fix function kernel-doc to match the change in commit 67fd707f:
      
       - drop the kernel-doc notation for @nr_pages from
         pagevec_lookup_range() and correct the function description for that
         change
      
      Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
      Fixes: 67fd707f ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e02a9f04
    • Michal Hocko's avatar
      mm, memory_hotplug: fix memmap initialization · 9bb5a391
      Michal Hocko authored
      Bharata has noticed that onlining a newly added memory doesn't increase
      the total memory, pointing to commit f7f99100 ("mm: stop zeroing
      memory during allocation in vmemmap") as a culprit.  This commit has
      changed the way how the memory for memmaps is initialized and moves it
      from the allocation time to the initialization time.  This works
      properly for the early memmap init path.
      
      It doesn't work for the memory hotplug though because we need to mark
      page as reserved when the sparsemem section is created and later
      initialize it completely during onlining.  memmap_init_zone is called in
      the early stage of onlining.  With the current code it calls
      __init_single_page and as such it clears up the whole stage and
      therefore online_pages_range skips those pages.
      
      Fix this by skipping mm_zero_struct_page in __init_single_page for
      memory hotplug path.  This is quite uggly but unifying both early init
      and memory hotplug init paths is a large project.  Make sure we plug the
      regression at least.
      
      Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Tested-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bb5a391
    • William Kucharski's avatar
      mm: correct comments regarding do_fault_around() · da391d64
      William Kucharski authored
      There are multiple comments surrounding do_fault_around that memtion
      fault_around_pages() and fault_around_mask(), two routines that do not
      exist.  These comments should be reworded to reference
      fault_around_bytes, the value which is used to determine how much
      do_fault_around() will attempt to read when processing a fault.
      
      These comments should have been updated when fault_around_pages() and
      fault_around_mask() were removed in commit aecd6f44 ("mm: close race
      between do_fault_around() and fault_around_bytes_set()").
      
      Fixes: aecd6f44 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
      Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.comSigned-off-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarLarry Bassel <larry.bassel@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da391d64
    • Henry Willard's avatar
      mm: numa: do not trap faults on shared data section pages. · 859d4adc
      Henry Willard authored
      Workloads consisting of a large number of processes running the same
      program with a very large shared data segment may experience performance
      problems when numa balancing attempts to migrate the shared cow pages.
      This manifests itself with many processes or tasks in
      TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
      
      The program listed below simulates the conditions with these results
      when run with 288 processes on a 144 core/8 socket machine.
      
      Average throughput 	Average throughput     Average throughput
      with numa_balancing=0	with numa_balancing=1  with numa_balancing=1
           			without the patch      with the patch
      ---------------------	---------------------  ---------------------
      2118782			2021534		       2107979
      
      Complex production environments show less variability and fewer poorly
      performing outliers accompanied with a smaller number of processes
      waiting on NUMA page migration with this patch applied.  In some cases,
      %iowait drops from 16%-26% to 0.
      
        // SPDX-License-Identifier: GPL-2.0
        /*
         * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
         */
        #include <sys/time.h>
        #include <stdio.h>
        #include <wait.h>
        #include <sys/mman.h>
      
        int a[1000000] = {13};
      
        int  main(int argc, const char **argv)
        {
      	int n = 0;
      	int i;
      	pid_t pid;
      	int stat;
      	int *count_array;
      	int cpu_count = 288;
      	long total = 0;
      
      	struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};
      
      	if (argc > 2)
      		cpu_count = atoi(argv[2]);
      
      	count_array = mmap(NULL, cpu_count * sizeof(int),
      			   (PROT_READ|PROT_WRITE),
      			   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);
      
      	if (count_array == MAP_FAILED) {
      		perror("mmap:");
      		return 0;
      	}
      
      	for (i = 0; i < cpu_count; ++i) {
      		pid = fork();
      		if (pid <= 0)
      			break;
      		if ((i & 0xf) == 0)
      			usleep(2);
      	}
      
      	if (pid != 0) {
      		if (i == 0) {
      			perror("fork:");
      			return 0;
      		}
      
      		for (;;) {
      			pid = wait(&stat);
      			if (pid < 0)
      				break;
      		}
      
      		for (i = 0; i < cpu_count; ++i)
      			total += count_array[i];
      
      		printf("Total %ld\n", total);
      		munmap(count_array, cpu_count * sizeof(int));
      		return 0;
      	}
      
      	gettimeofday(&t1, 0);
      	timeradd(&t1, &t2, &t1);
      	while (timercmp(&t2, &t1, <)) {
      		int b = 0;
      		int j;
      
      		for (j = 0; j < 1000000; j++)
      			b += a[j];
      		gettimeofday(&t2, 0);
      		n++;
      	}
      	count_array[i] = n;
      	return 0;
        }
      
      This patch changes change_pte_range() to skip shared copy-on-write pages
      when called from change_prot_numa().
      
      NOTE: change_prot_numa() is nominally called from task_numa_work() and
      queue_pages_test_walk().  task_numa_work() is the auto NUMA balancing
      path, and queue_pages_test_walk() is part of explicit NUMA policy
      management.  However, queue_pages_test_walk() only calls
      change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
      not allowed, so change_prot_numa() is only called from auto NUMA
      balancing.
      
      In the case of explicit NUMA policy management, shared pages are not
      migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
      depends on CAP_SYS_NICE.  Currently, there is no way to pass information
      about MPOL_MF_MOVE_ALL to change_pte_range.  This will have to be fixed
      if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
      migration mode.
      
      task_numa_work() skips the read-only VMAs of programs and shared
      libraries.
      
      Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.comSigned-off-by: default avatarHenry Willard <henry.willard@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      859d4adc
    • Michal Hocko's avatar
      hugetlb, mbind: fall back to default policy if vma is NULL · 389c8178
      Michal Hocko authored
      Dan Carpenter has noticed that mbind migration callback (new_page) can
      get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
      relies on the VMA to get the hstate.  We used to BUG_ON this case but
      the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
      mbind hugetlb migration".
      
      The proper way to handle this is to get the hstate from the migrated
      page and rely on huge_node (resp.  get_vma_policy) do the right thing
      with null VMA.  We are currently falling back to the default mempolicy
      in that case which is in line what THP path is doing here.
      
      Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      389c8178
    • Michal Hocko's avatar
      hugetlb, mempolicy: fix the mbind hugetlb migration · ebd63723
      Michal Hocko authored
      do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
      pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
      allocation function which has to take care of reserves, overcommit or
      hugetlb cgroup accounting.  None of that is really required for the page
      migration because the new page is only temporal and either will replace
      the original page or it will be dropped.  This is essentially as for
      other migration call paths and there shouldn't be any reason to handle
      mbind in a special way.
      
      The current implementation is even suboptimal because the migration
      might fail just because the hugetlb cgroup limit is reached, or the
      overcommit is saturated.
      
      Fix this by making mbind like other hugetlb migration paths.  Add a new
      migration helper alloc_huge_page_vma as a wrapper around
      alloc_huge_page_nodemask with additional mempolicy handling.
      
      alloc_huge_page_noerr has no more users and it can go.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebd63723