1. 28 Oct, 2015 7 commits
    • Lendacky, Thomas's avatar
      amd-xgbe: Fix race between access of desc and desc index · 20986ed8
      Lendacky, Thomas authored
      During Tx cleanup it's still possible for the descriptor data to be
      read ahead of the descriptor index. A memory barrier is required between
      the read of the descriptor index and the start of the Tx cleanup loop.
      This allows a change to a lighter-weight barrier in the Tx transmit
      routine just before updating the current descriptor index.
      
      Since the memory barrier does result in extra overhead on arm64, keep
      the previous change to not chase the current descriptor value. This
      prevents the execution of the barrier for each loop performed.
      Suggested-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20986ed8
    • Sowmini Varadhan's avatar
      RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv · 8ce675ff
      Sowmini Varadhan authored
      Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
      If rds_tcp_data_recv() ignores such failures, the application will
      receive corrupted data because the skb has not been correctly
      carved to the RDS datagram size.
      
      Avoid this by handling pskb_pull/pskb_trim failure in the same
      manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
      retry via the deferred call to rds_send_worker() that gets set up on
      ENOMEM from rds_tcp_read_sock()
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ce675ff
    • Neil Horman's avatar
      forcedeth: fix unilateral interrupt disabling in netpoll path · 0b7c8743
      Neil Horman authored
      Forcedeth currently uses disable_irq_lockdep and enable_irq_lockdep, which in
      some configurations simply calls local_irq_disable.  This causes errant warnings
      in the netpoll path as in netpoll_send_skb_on_dev, where we disable irqs using
      local_irq_save, leading to the following warning:
      
      WARNING: at net/core/netpoll.c:352 netpoll_send_skb_on_dev+0x243/0x250() (Not
      tainted)
      Hardware name:
      netpoll_send_skb_on_dev(): eth0 enabled interrupts in poll
      (nv_start_xmit_optimized+0x0/0x860 [forcedeth])
      Modules linked in: netconsole(+) configfs ipv6 iptable_filter ip_tables ppdev
      parport_pc parport sg microcode serio_raw edac_core edac_mce_amd k8temp
      snd_hda_codec_realtek snd_hda_codec_generic forcedeth snd_hda_intel
      snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore
      snd_page_alloc i2c_nforce2 i2c_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod
      crc_t10dif pata_amd ata_generic pata_acpi sata_nv dm_mirror dm_region_hash
      dm_log dm_mod [last unloaded: scsi_wait_scan]
      Pid: 1940, comm: modprobe Not tainted 2.6.32-573.7.1.el6.x86_64.debug #1
      Call Trace:
       [<ffffffff8107bbc1>] ? warn_slowpath_common+0x91/0xe0
       [<ffffffff8107bcc6>] ? warn_slowpath_fmt+0x46/0x60
       [<ffffffffa00fe5b0>] ? nv_start_xmit_optimized+0x0/0x860 [forcedeth]
       [<ffffffff814b3593>] ? netpoll_send_skb_on_dev+0x243/0x250
       [<ffffffff814b37c9>] ? netpoll_send_udp+0x229/0x270
       [<ffffffffa02e3299>] ? write_msg+0x39/0x110 [netconsole]
       [<ffffffffa02e331b>] ? write_msg+0xbb/0x110 [netconsole]
       [<ffffffff8107bd55>] ? __call_console_drivers+0x75/0x90
       [<ffffffff8107bdba>] ? _call_console_drivers+0x4a/0x80
       [<ffffffff8107c445>] ? release_console_sem+0xe5/0x250
       [<ffffffff8107d200>] ? register_console+0x190/0x3e0
       [<ffffffffa02e71a6>] ? init_netconsole+0x1a6/0x216 [netconsole]
       [<ffffffffa02e7000>] ? init_netconsole+0x0/0x216 [netconsole]
       [<ffffffff810020d0>] ? do_one_initcall+0xc0/0x280
       [<ffffffff810d4933>] ? sys_init_module+0xe3/0x260
       [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
      ---[ end trace f349c7af88e6a6d5 ]---
      console [netcon0] enabled
      netconsole: network logging started
      
      Fix it by modifying the forcedeth code to use
      disable_irq_nosync_lockdep_irqsavedisable_irq_nosync_lockdep_irqsave instead,
      which saves and restores irq state properly.  This also saves us a little code
      in the process
      
      Tested by the reporter, with successful restuls
      
      Patch applies to the head of the net tree
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Reported-by: default avatarVasily Averin <vvs@sw.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b7c8743
    • Joe Stringer's avatar
      openvswitch: Fix skb leak using IPv6 defrag · 6f5cadee
      Joe Stringer authored
      nf_ct_frag6_gather() makes a clone of each skb passed to it, and if the
      reassembly is successful, expects the caller to free all of the original
      skbs using nf_ct_frag6_consume_orig(). This call was previously missing,
      meaning that the original fragments were never freed (with the exception
      of the last fragment to arrive).
      
      Fix this by ensuring that all original fragments except for the last
      fragment are freed via nf_ct_frag6_consume_orig(). The last fragment
      will be morphed into the head, so it must not be freed yet. Furthermore,
      retain the ->next pointer for the head after skb_morph().
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f5cadee
    • Joe Stringer's avatar
      ipv6: Export nf_ct_frag6_consume_orig() · 190b8ffb
      Joe Stringer authored
      This is needed in openvswitch to fix an skb leak in the next patch.
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      190b8ffb
    • Joe Stringer's avatar
      openvswitch: Fix double-free on ip_defrag() errors · 74c16618
      Joe Stringer authored
      If ip_defrag() returns an error other than -EINPROGRESS, then the skb is
      freed. When handle_fragments() passes this back up to
      do_execute_actions(), it will be freed again. Prevent this double free
      by never freeing the skb in do_execute_actions() for errors returned by
      ovs_ct_execute. Always free it in ovs_ct_execute() error paths instead.
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74c16618
    • Alexander Duyck's avatar
      fib_trie: leaf_walk_rcu should not compute key if key is less than pn->key · c2229fe1
      Alexander Duyck authored
      We were computing the child index in cases where the key value we were
      looking for was actually less than the base key of the tnode.  As a result
      we were getting incorrect index values that would cause us to skip over
      some children.
      
      To fix this I have added a test that will force us to use child index 0 if
      the key we are looking for is less than the key of the current tnode.
      
      Fixes: 8be33e95 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
      Reported-by: default avatarBrian Rak <brak@gameservers.com>
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2229fe1
  2. 27 Oct, 2015 11 commits
  3. 26 Oct, 2015 10 commits
  4. 23 Oct, 2015 12 commits
    • Li RongQing's avatar
      net: sysctl: fix a kmemleak warning · ce9d9b8e
      Li RongQing authored
      the returned buffer of register_sysctl() is stored into net_header
      variable, but net_header is not used after, and compiler maybe
      optimise the variable out, and lead kmemleak reported the below warning
      
      	comm "swapper/0", pid 1, jiffies 4294937448 (age 267.270s)
      	hex dump (first 32 bytes):
      	90 38 8b 01 c0 ff ff ff 00 00 00 00 01 00 00 00 .8..............
      	01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      	backtrace:
      	[<ffffffc00020f134>] create_object+0x10c/0x2a0
      	[<ffffffc00070ff44>] kmemleak_alloc+0x54/0xa0
      	[<ffffffc0001fe378>] __kmalloc+0x1f8/0x4f8
      	[<ffffffc00028e984>] __register_sysctl_table+0x64/0x5a0
      	[<ffffffc00028eef0>] register_sysctl+0x30/0x40
      	[<ffffffc00099c304>] net_sysctl_init+0x20/0x58
      	[<ffffffc000994dd8>] sock_init+0x10/0xb0
      	[<ffffffc0000842e0>] do_one_initcall+0x90/0x1b8
      	[<ffffffc000966bac>] kernel_init_freeable+0x218/0x2f0
      	[<ffffffc00070ed6c>] kernel_init+0x1c/0xe8
      	[<ffffffc000083bfc>] ret_from_fork+0xc/0x50
      	[<ffffffffffffffff>] 0xffffffffffffffff <<end check kmemleak>>
      
      Before fix, the objdump result on ARM64:
      0000000000000000 <net_sysctl_init>:
         0:   a9be7bfd        stp     x29, x30, [sp,#-32]!
         4:   90000001        adrp    x1, 0 <net_sysctl_init>
         8:   90000000        adrp    x0, 0 <net_sysctl_init>
         c:   910003fd        mov     x29, sp
        10:   91000021        add     x1, x1, #0x0
        14:   91000000        add     x0, x0, #0x0
        18:   a90153f3        stp     x19, x20, [sp,#16]
        1c:   12800174        mov     w20, #0xfffffff4                // #-12
        20:   94000000        bl      0 <register_sysctl>
        24:   b4000120        cbz     x0, 48 <net_sysctl_init+0x48>
        28:   90000013        adrp    x19, 0 <net_sysctl_init>
        2c:   91000273        add     x19, x19, #0x0
        30:   9101a260        add     x0, x19, #0x68
        34:   94000000        bl      0 <register_pernet_subsys>
        38:   2a0003f4        mov     w20, w0
        3c:   35000060        cbnz    w0, 48 <net_sysctl_init+0x48>
        40:   aa1303e0        mov     x0, x19
        44:   94000000        bl      0 <register_sysctl_root>
        48:   2a1403e0        mov     w0, w20
        4c:   a94153f3        ldp     x19, x20, [sp,#16]
        50:   a8c27bfd        ldp     x29, x30, [sp],#32
        54:   d65f03c0        ret
      After:
      0000000000000000 <net_sysctl_init>:
         0:   a9bd7bfd        stp     x29, x30, [sp,#-48]!
         4:   90000000        adrp    x0, 0 <net_sysctl_init>
         8:   910003fd        mov     x29, sp
         c:   a90153f3        stp     x19, x20, [sp,#16]
        10:   90000013        adrp    x19, 0 <net_sysctl_init>
        14:   91000000        add     x0, x0, #0x0
        18:   91000273        add     x19, x19, #0x0
        1c:   f90013f5        str     x21, [sp,#32]
        20:   aa1303e1        mov     x1, x19
        24:   12800175        mov     w21, #0xfffffff4                // #-12
        28:   94000000        bl      0 <register_sysctl>
        2c:   f9002260        str     x0, [x19,#64]
        30:   b40001a0        cbz     x0, 64 <net_sysctl_init+0x64>
        34:   90000014        adrp    x20, 0 <net_sysctl_init>
        38:   91000294        add     x20, x20, #0x0
        3c:   9101a280        add     x0, x20, #0x68
        40:   94000000        bl      0 <register_pernet_subsys>
        44:   2a0003f5        mov     w21, w0
        48:   35000080        cbnz    w0, 58 <net_sysctl_init+0x58>
        4c:   aa1403e0        mov     x0, x20
        50:   94000000        bl      0 <register_sysctl_root>
        54:   14000004        b       64 <net_sysctl_init+0x64>
        58:   f9402260        ldr     x0, [x19,#64]
        5c:   94000000        bl      0 <unregister_sysctl_table>
        60:   f900227f        str     xzr, [x19,#64]
        64:   2a1503e0        mov     w0, w21
        68:   f94013f5        ldr     x21, [sp,#32]
        6c:   a94153f3        ldp     x19, x20, [sp,#16]
        70:   a8c37bfd        ldp     x29, x30, [sp],#48
        74:   d65f03c0        ret
      
      Add the possible error handle to free the net_header to remove the
      kmemleak warning
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce9d9b8e
    • Guillaume Nault's avatar
      ppp: fix pppoe_dev deletion condition in pppoe_release() · 1acea4f6
      Guillaume Nault authored
      We can't rely on PPPOX_ZOMBIE to decide whether to clear po->pppoe_dev.
      PPPOX_ZOMBIE can be set by pppoe_disc_rcv() even when po->pppoe_dev is
      NULL. So we have no guarantee that (sk->sk_state & PPPOX_ZOMBIE) implies
      (po->pppoe_dev != NULL).
      Since we're releasing a PPPoE socket, we want to release the pppoe_dev
      if it exists and reset sk_state to PPPOX_DEAD, no matter the previous
      value of sk_state. So we can just check for po->pppoe_dev and avoid any
      assumption on sk->sk_state.
      
      Fixes: 2b018d57 ("pppoe: drop PPPOX_ZOMBIEs in pppoe_release")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1acea4f6
    • Li RongQing's avatar
      af_key: fix two typos · f6b8dec9
      Li RongQing authored
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6b8dec9
    • Lendacky, Thomas's avatar
      amd-xgbe: Use wmb before updating current descriptor count · 20a41fba
      Lendacky, Thomas authored
      The code currently uses the lightweight dma_wmb barrier before updating
      the current descriptor count. Under heavy load, the Tx cleanup routine
      was seeing the updated current descriptor count before the updated
      descriptor information. As a result, the Tx descriptor was being cleaned
      up before it was used because it was not "owned" by the hardware yet,
      resulting in a Tx queue hang.
      
      Using the wmb barrier insures that the descriptor is updated before the
      descriptor counter preventing the Tx queue hang. For extra insurance,
      the Tx cleanup routine is changed to grab the current decriptor count on
      entry and uses that initial value in the processing loop rather than
      trying to chase the current value.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Tested-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20a41fba
    • Nathan Sullivan's avatar
      net/phy: micrel: Add workaround for bad autoneg · d2fd719b
      Nathan Sullivan authored
      Very rarely, the KSZ9031 will appear to complete autonegotiation, but
      will drop all traffic afterwards.  When this happens, the idle error
      count will read 0xFF after autonegotiation completes.  Reset the PHY
      when in that state.
      Signed-off-by: default avatarNathan Sullivan <nathan.sullivan@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2fd719b
    • David S. Miller's avatar
      Merge branch 'ipv6-overflow-arith' · ec3661b4
      David S. Miller authored
      Hannes Frederic Sowa says:
      
      ====================
      overflow-arith: begin to add support for overflow builtins functions
      
      I add a new header, linux/overflow-arith.h, as the central place to add
      overflow and wrap-around checking functions. The reason I am doing so
      is that it can make use of compiler supported builtin functions which
      can leverage hardware.
      
      As I need this for a fix in the ipv6 stack, which is also included in
      this series, I propose to add it sooner than later over Davem's net
      tree. This is also the reason why I start slowly with only the one
      function I need at this time.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec3661b4
    • Hannes Frederic Sowa's avatar
      ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues · b72a2b01
      Hannes Frederic Sowa authored
      Raw sockets with hdrincl enabled can insert ipv6 extension headers
      right into the data stream. In case we need to fragment those packets,
      we reparse the options header to find the place where we can insert
      the fragment header. If the extension headers exceed the link's MTU we
      actually cannot make progress in such a case.
      
      Instead of ending up in broken arithmetic or rounding towards 0 and
      entering an endless loop in ip6_fragment, just prevent those cases by
      aborting early and signal -EMSGSIZE to user space.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b72a2b01
    • Hannes Frederic Sowa's avatar
      overflow-arith: begin to add support for overflow builtin functions · 79907146
      Hannes Frederic Sowa authored
      The idea of the overflow-arith.h header is to collect overflow checking
      functions in one central place.
      
      If gcc compiler supports the __builtin_overflow_* builtins we use them
      because they might give better performance, otherwise the code falls
      back to normal overflow checking functions.
      
      The builtin_overflow functions are supported by gcc-5 and clang. The
      matter of supporting clang is to just provide a corresponding
      CC_HAVE_BUILTIN_OVERFLOW, because the specific overflow checking builtins
      don't differ between gcc and clang.
      
      I just provide overflow_usub function here as I intend this to get merged
      into net, more functions will definitely follow as they are needed.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79907146
    • Andrew Shewmaker's avatar
      tcp: allow dctcp alpha to drop to zero · c80dbe04
      Andrew Shewmaker authored
      If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
      than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
      given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
      remains 15. The effect isn't noticeable in this case below cwnd=137, but
      could gradually drive uncongested flows with leftover alpha down to
      cwnd=137. A larger dctcp_shift_g would have a greater effect.
      
      This change causes alpha=15 to drop to 0 instead of being decrementing by 1
      as it would when alpha=16. However, it requires one less conditional to
      implement since it doesn't have to guard against subtracting 1 from 0U. A
      decay of 15 is not unreasonable since an equal or greater amount occurs at
      alpha >= 240.
      Signed-off-by: default avatarAndrew G. Shewmaker <agshew@gmail.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c80dbe04
    • lucien's avatar
      ipv6: fix the incorrect return value of throw route · ab997ad4
      lucien authored
      The error condition -EAGAIN, which is signaled by throw routes, tells
      the rules framework to walk on searching for next matches. If the walk
      ends and we stop walking the rules with the result of a throw route we
      have to translate the error conditions to -ENETUNREACH.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab997ad4
    • Jason Wang's avatar
      macvtap: unbreak receiving of gro skb with frag list · f23d538b
      Jason Wang authored
      We don't have fraglist support in TAP_FEATURES. This will lead
      software segmentation of gro skb with frag list. Fixes by having
      frag list support in TAP_FEATURES.
      
      With this patch single session of netperf receiving were restored from
      about 5Gb/s to about 12Gb/s on mlx4.
      
      Fixes a567dd62 ("macvtap: simplify usage of tap_features")
      Cc: Vlad Yasevich <vyasevic@redhat.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f23d538b
    • Pravin B Shelar's avatar
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar authored
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc4099f1