1. 30 Jul, 2015 25 commits
    • Sowmini Varadhan's avatar
      net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket · 8a681736
      Sowmini Varadhan authored
      The newsk returned by sk_clone_lock should hold a get_net()
      reference if, and only if, the parent is not a kernel socket
      (making this similar to sk_alloc()).
      
      E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
      sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
      socket is a kernel socket (defined in sk_alloc() as having
      sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
      and should not hold a get_net() reference.
      
      Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the
            netns of kernel sockets.")
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a681736
    • Daniel Borkmann's avatar
      net: sched: fix refcount imbalance in actions · 28e6b67f
      Daniel Borkmann authored
      Since commit 55334a5d ("net_sched: act: refuse to remove bound action
      outside"), we end up with a wrong reference count for a tc action.
      
      Test case 1:
      
        FOO="1,6 0 0 4294967295,"
        BAR="1,6 0 0 4294967294,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
           action bpf bytecode "$FOO"
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 1 bind 1
        tc actions replace action bpf bytecode "$BAR" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
          index 1 ref 2 bind 1
        tc actions replace action bpf bytecode "$FOO" index 1
        tc actions show action bpf
          action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
          index 1 ref 3 bind 1
      
      Test case 2:
      
        FOO="1,6 0 0 4294967295,"
        tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
        tc actions show action gact
          action order 0: gact action pass
          random type none pass val 0
           index 1 ref 1 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 2 bind 1
        tc actions add action drop index 1
          RTNETLINK answers: File exists [...]
        tc actions show action gact
          action order 0: gact action pass
           random type none pass val 0
           index 1 ref 3 bind 1
      
      What happens is that in tcf_hash_check(), we check tcf_common for a given
      index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
      found an existing action. Now there are the following cases:
      
        1) We do a late binding of an action. In that case, we leave the
           tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
           handler. This is correctly handeled.
      
        2) We replace the given action, or we try to add one without replacing
           and find out that the action at a specific index already exists
           (thus, we go out with error in that case).
      
      In case of 2), we have to undo the reference count increase from
      tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
      do so because of the 'tcfc_bindcnt > 0' check which bails out early with
      an -EPERM error.
      
      Now, while commit 55334a5d prevents 'tc actions del action ...' on an
      already classifier-bound action to drop the reference count (which could
      then become negative, wrap around etc), this restriction only accounts for
      invocations outside a specific action's ->init() handler.
      
      One possible solution would be to add a flag thus we possibly trigger
      the -EPERM ony in situations where it is indeed relevant.
      
      After the patch, above test cases have correct reference count again.
      
      Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28e6b67f
    • David S. Miller's avatar
      Merge branch 'r8152-fixes' · 990c9b34
      David S. Miller authored
      Hayes Wang says:
      
      ====================
      r8152: device reset
      
      v3:
      For patch #2, remove cancel_delayed_work().
      
      v2:
      For patch #1, remove usb_autopm_get_interface(), usb_autopm_put_interface(), and
      the checking of intf->condition.
      
      For patch #2, replace the original method with usb_queue_reset_device() to reset
      the device.
      
      v1:
      Although the driver works normally, we find the device may get all 0xff data when
      transmitting packets on certain platforms. It would break the device and no packet
      could be transmitted. The reset is necessary to recover the hw for this situation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      990c9b34
    • hayeswang's avatar
      r8152: reset device when tx timeout · 37608f3e
      hayeswang authored
      The device reset is necessary if the hw becomes abnormal and stops
      transmitting packets.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37608f3e
    • hayeswang's avatar
      r8152: add pre_reset and post_reset · e501139a
      hayeswang authored
      Add rtl8152_pre_reset() and rtl8152_post_reset() which are used when
      calling usb_reset_device(). The two functions could reduce the time
      of reset when calling usb_reset_device() after probe().
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e501139a
    • Shahed Shaikh's avatar
      qlcnic: Fix corruption while copying · 15f1bb1f
      Shahed Shaikh authored
      Use proper typecasting while performing byte-by-byte copy
      Signed-off-by: default avatarShahed Shaikh <shahed.shaikh@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15f1bb1f
    • Daniel Borkmann's avatar
      act_bpf: fix memory leaks when replacing bpf programs · f4eaed28
      Daniel Borkmann authored
      We currently trigger multiple memory leaks when replacing bpf
      actions, besides others:
      
        comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
        hex dump (first 32 bytes):
          01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
          18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
        backtrace:
          [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0
          [<ffffffff8120a37a>] __vmalloc+0x4a/0x50
          [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0
          [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0
          [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
          [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0
          [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0
          [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0
          [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
          [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf]
          [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910
          [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240
          [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0
          [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40
          [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0
      
      Issue is that the old content from tcf_bpf is allocated and needs
      to be released when we replace it. We seem to do that since the
      beginning of act_bpf on the filter and insns, later on the name as
      well.
      
      Example test case, after patch:
      
        # FOO="1,6 0 0 4294967295,"
        # BAR="1,6 0 0 4294967294,"
        # tc actions add action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$BAR" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
         index 2 ref 1 bind 0
        # tc actions replace action bpf bytecode "$FOO" index 2
        # tc actions show action bpf
         action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
         index 2 ref 1 bind 0
        # tc actions del action bpf index 2
        [...]
        # echo "scan" > /sys/kernel/debug/kmemleak
        # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
        0
      
      Fixes: d23b8ad8 ("tc: add BPF based action")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4eaed28
    • David S. Miller's avatar
      Merge branch 'thunderx-fixes' · f68b1231
      David S. Miller authored
      Aleksey Makarov says:
      
      ====================
      net: thunderx: Misc fixes
      
      Miscellaneous fixes for the ThunderX VNIC driver
      
      All the patches can be applied individually.
      It's ok to drop some if the maintainer feels uncomfortable
      with applying for 4.2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f68b1231
    • Thanneeru Srinivasulu's avatar
      net: thunderx: Fix for crash while BGX teardown · 60f83c89
      Thanneeru Srinivasulu authored
      Cortina phy does not have kernel driver and we don't attach
      device with phy layer for intefaces like XFI, XLAUI etc,
      Hence check for interface type before calling disconnect.
      Signed-off-by: default avatarThanneeru Srinivasulu <tsrinivasulu@caviumnetworks.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60f83c89
    • Sunil Goutham's avatar
    • Sunil Goutham's avatar
      net: thunderx: Fix crash when changing rss with mutliple traffic flows · b49087dd
      Sunil Goutham authored
      This fixes a crash when changing rss with multiple traffic flows.
      
      While interface teardown, disable tx queues after all NAPI threads
      are done. If done otherwise tx queues might be woken up inside NAPI
      if any CQE_TX are processed.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b49087dd
    • Sunil Goutham's avatar
      net: thunderx: Set watchdog timeout value · 3d7a8aaa
      Sunil Goutham authored
      If a txq (SQ) remains in stopped state after this timeout its
      considered as stuck and interface is reinited.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d7a8aaa
    • Sunil Goutham's avatar
      net: thunderx: Wakeup TXQ only if CQE_TX are processed · 74840b83
      Sunil Goutham authored
      Previously TXQ is wakedup whenever napi is executed
      and irrespective of if any CQE_TX are processed or not.
      Added 'txq_stop' and 'txq_wake' counters to aid in debugging
      if there are any future issues.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74840b83
    • Sunil Goutham's avatar
      net: thunderx: Suppress alloc_pages() failure warnings · f8ce9666
      Sunil Goutham authored
      Suppressing standard alloc_pages() warnings. Some kernel configs limit
      alloc size and the network driver may fail. Do not drop a kernel
      warning in this case, instead just drop a oneliner that the network
      driver could not be loaded since the buffer could not be allocated.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8ce9666
    • Sunil Goutham's avatar
      net: thunderx: Fix TSO packet statistic · 2cb468e0
      Sunil Goutham authored
      Fixing TSO packages not being counted.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2cb468e0
    • Sunil Goutham's avatar
      net: thunderx: Fix memory leak when changing queue count · c62cd3c4
      Sunil Goutham authored
      Fix for memory leak when changing queue/channel count via ethtool
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c62cd3c4
    • Sunil Goutham's avatar
      net: thunderx: Fix RQ_DROP miscalculation · 32c1b965
      Sunil Goutham authored
      With earlier configured value sufficient number of CQEs are not
      being reserved for transmitted packets. Hence under heavy incoming
      traffic load, receive notifications will take away most of the CQ
      thus transmit notifications will be lost resulting in tx skbs not
      being freed.
      
      Finally SQ will be full and it will be stopped, watchdog timer
      will kick in. After this fix receive notifications will not take
      morethan half of CQ reserving the rest for transmit notifications.
      
      Also changed CQ & SQ sizes from 16k to 4k.
      This is also due to the receive notifications taking first half of
      CQ under heavy load and time taken by NAPI to clear transmit notifications
      will increase with higher queue sizes. Again results in SQ being stopped.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32c1b965
    • Sunil Goutham's avatar
      net: thunderx: Fix memory leak while tearing down interface · 143ceb0b
      Sunil Goutham authored
      Fixed 'tso_hdrs' memory not being freed properly.
      Also fixed SQ skbuff maintenance issues.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      143ceb0b
    • Sunil Goutham's avatar
      net: thunderx: Fix data integrity issues with LDWB · 4b561c17
      Sunil Goutham authored
      Switching back to LDD transactions from LDWB.
      
      While transmitting packets out with LDWB transactions
      data integrity issues are seen very frequently.
      hence switching back to LDD.
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarRobert Richter <rrichter@cavium.com>
      Signed-off-by: default avatarAleksey Makarov <aleksey.makarov@caviumnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b561c17
    • Eric Dumazet's avatar
      ipv6: flush nd cache on IFF_NOARP change · c8507fb2
      Eric Dumazet authored
      This patch is the IPv6 equivalent of commit
      6c8b4e3f ("arp: flush arp cache on IFF_NOARP change")
      
      Without it, we keep buggy neighbours in the cache, with destination
      MAC address equal to our own MAC address.
      
      Tested:
       tcpdump -i eth0 -s 0 ip6 -n -e &
       ip link set dev eth0 arp off
       ping6 remote   // sends buggy frames
       ip link set dev eth0 arp on
       ping6 remote   // should work once kernel is patched
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMario Fanelli <mariofanelli@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8507fb2
    • David S. Miller's avatar
      Merge branch 'netcp-fixes' · b2428f94
      David S. Miller authored
      Murali Karicheri says:
      
      ====================
      net: netcp: bug fixes for dynamic module support
      
      This series fixes few bugs to allow keystone netcp modules to be
      dynamically loaded and removed. Currently it allows following
      sequence multiple times
      
       insmod cpsw_ale.ko
       insmod davinci_mdio.ko
       insmod keystone_netcp.ko
       insmod keystone_netcp_ethss.ko
       ifup eth0
       ifup eth1
       ping <hosts on eth0>
       ping <hosts on eth1>
       ifdown eth1
       ifdown eth0
       rmmod keystone_netcp_ethss.ko
       rmmod keystone_netcp.ko
       rmmod davinci_mdio.ko
       rmmod cpsw_ale.ko
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2428f94
    • Karicheri, Muralidharan's avatar
      net: netcp: ethss: cleanup gbe_probe() and gbe_remove() functions · 31a184b7
      Karicheri, Muralidharan authored
      This patch clean up error handle code to use goto label properly. In some
      cases, the code unnecessarily use goto instead of just returning the error
      code.  Code also make explicit calls to devm_* APIs on error which is
      not necessary. In the gbe_remove() also it makes similar calls which is
      also unnecessary.
      
      Also fix few checkpatch warnings
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31a184b7
    • Karicheri, Muralidharan's avatar
      net: netcp: ethss: fix up incorrect use of list api · c20afae7
      Karicheri, Muralidharan authored
      The code seems to assume a null is returned when the list is empty
      from first_sec_slave() to break the loop which is incorrect. Fix the
      code by using list_empty().
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c20afae7
    • Karicheri, Muralidharan's avatar
      net: netcp: fix cleanup interface list in netcp_remove() · 01a03099
      Karicheri, Muralidharan authored
      Currently if user do rmmod keystone_netcp.ko following warning is
      seen :-
      
      [   59.035891] ------------[ cut here ]------------
      [   59.040535] WARNING: CPU: 2 PID: 1619 at drivers/net/ethernet/ti/
      netcp_core.c:2127 netcp_remove)
      
      This is because the interface list is not cleaned up in netcp_remove.
      This patch fixes this. Also fix some checkpatch related warnings.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01a03099
    • Daniel Borkmann's avatar
      ebpf, x86: fix general protection fault when tail call is invoked · 2482abb9
      Daniel Borkmann authored
      With eBPF JIT compiler enabled on x86_64, I was able to reliably trigger
      the following general protection fault out of an eBPF program with a simple
      tail call, f.e. tracex5 (or a stripped down version of it):
      
        [  927.097918] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
        [...]
        [  927.100870] task: ffff8801f228b780 ti: ffff880016a64000 task.ti: ffff880016a64000
        [  927.102096] RIP: 0010:[<ffffffffa002440d>]  [<ffffffffa002440d>] 0xffffffffa002440d
        [  927.103390] RSP: 0018:ffff880016a67a68  EFLAGS: 00010006
        [  927.104683] RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000000000 RCX: 0000000000000001
        [  927.105921] RDX: 0000000000000000 RSI: ffff88014e438000 RDI: ffff880016a67e00
        [  927.107137] RBP: ffff880016a67c90 R08: 0000000000000000 R09: 0000000000000001
        [  927.108351] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880016a67e00
        [  927.109567] R13: 0000000000000000 R14: ffff88026500e460 R15: ffff880220a81520
        [  927.110787] FS:  00007fe7d5c1f740(0000) GS:ffff880265000000(0000) knlGS:0000000000000000
        [  927.112021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  927.113255] CR2: 0000003e7bbb91a0 CR3: 000000006e04b000 CR4: 00000000001407e0
        [  927.114500] Stack:
        [  927.115737]  ffffc90008cdb000 ffff880016a67e00 ffff88026500e460 ffff880220a81520
        [  927.117005]  0000000100000000 000000000000001b ffff880016a67aa8 ffffffff8106c548
        [  927.118276]  00007ffcdaf22e58 0000000000000000 0000000000000000 ffff880016a67ff0
        [  927.119543] Call Trace:
        [  927.120797]  [<ffffffff8106c548>] ? lookup_address+0x28/0x30
        [  927.122058]  [<ffffffff8113d176>] ? __module_text_address+0x16/0x70
        [  927.123314]  [<ffffffff8117bf0e>] ? is_ftrace_trampoline+0x3e/0x70
        [  927.124562]  [<ffffffff810c1a0f>] ? __kernel_text_address+0x5f/0x80
        [  927.125806]  [<ffffffff8102086f>] ? print_context_stack+0x7f/0xf0
        [  927.127033]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.128254]  [<ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
        [  927.129461]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.130654]  [<ffffffff8119ee4a>] trace_call_bpf+0x8a/0x140
        [  927.131837]  [<ffffffff8119edfa>] ? trace_call_bpf+0x3a/0x140
        [  927.133015]  [<ffffffff8119f008>] kprobe_perf_func+0x28/0x220
        [  927.134195]  [<ffffffff811a1668>] kprobe_dispatcher+0x38/0x60
        [  927.135367]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.136523]  [<ffffffff81061400>] kprobe_ftrace_handler+0xf0/0x150
        [  927.137666]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.138802]  [<ffffffff8117950c>] ftrace_ops_recurs_func+0x5c/0xb0
        [  927.139934]  [<ffffffffa022b0d5>] 0xffffffffa022b0d5
        [  927.141066]  [<ffffffff81174b91>] ? seccomp_phase1+0x1/0x230
        [  927.142199]  [<ffffffff81174b95>] seccomp_phase1+0x5/0x230
        [  927.143323]  [<ffffffff8102c0a4>] syscall_trace_enter_phase1+0xc4/0x150
        [  927.144450]  [<ffffffff81174b95>] ? seccomp_phase1+0x5/0x230
        [  927.145572]  [<ffffffff8102c0a4>] ? syscall_trace_enter_phase1+0xc4/0x150
        [  927.146666]  [<ffffffff817f9a9f>] tracesys+0xd/0x44
        [  927.147723] Code: 48 8b 46 10 48 39 d0 76 2c 8b 85 fc fd ff ff 83 f8 20 77 21 83
                             c0 01 89 85 fc fd ff ff 48 8d 44 d6 80 48 8b 00 48 83 f8 00 74
                             0a <48> 8b 40 20 48 83 c0 33 ff e0 48 89 d8 48 8b 9d d8 fd ff
                             ff 4c
        [  927.150046] RIP  [<ffffffffa002440d>] 0xffffffffa002440d
      
      The code section with the instructions that traps points into the eBPF JIT
      image of the root program (the one invoking the tail call instruction).
      
      Using bpf_jit_disasm -o on the eBPF root program image:
      
        [...]
        4e:   mov    -0x204(%rbp),%eax
              8b 85 fc fd ff ff
        54:   cmp    $0x20,%eax               <--- if (tail_call_cnt > MAX_TAIL_CALL_CNT)
              83 f8 20
        57:   ja     0x000000000000007a
              77 21
        59:   add    $0x1,%eax                <--- tail_call_cnt++
              83 c0 01
        5c:   mov    %eax,-0x204(%rbp)
              89 85 fc fd ff ff
        62:   lea    -0x80(%rsi,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 44 d6 80
        67:   mov    (%rax),%rax
              48 8b 00
        6a:   cmp    $0x0,%rax                <--- check for NULL
              48 83 f8 00
        6e:   je     0x000000000000007a
              74 0a
        70:   mov    0x20(%rax),%rax          <--- GPF triggered here! fetch of bpf_func
              48 8b 40 20                              [ matches <48> 8b 40 20 ... from above ]
        74:   add    $0x33,%rax               <--- prologue skip of new prog
              48 83 c0 33
        78:   jmpq   *%rax                    <--- jump to new prog insns
              ff e0
        [...]
      
      The problem is that rax has 5a5a5a5a5a5a5a5a, which suggests a tail call
      jump to map slot 0 is pointing to a poisoned page. The issue is the following:
      
      lea instruction has a wrong offset, i.e. it should be ...
      
        lea    0x80(%rsi,%rdx,8),%rax
      
      ... but it actually seems to be ...
      
        lea   -0x80(%rsi,%rdx,8),%rax
      
      ... where 0x80 is offsetof(struct bpf_array, prog), thus the offset needs
      to be positive instead of negative. Disassembling the interpreter, we btw
      similarly do:
      
        [...]
        c88:  lea     0x80(%rax,%rdx,8),%rax  <--- prog = array->prog[index]
              48 8d 84 d0 80 00 00 00
        c90:  add     $0x1,%r13d
              41 83 c5 01
        c94:  mov     (%rax),%rax
              48 8b 00
        [...]
      
      Now the other interesting fact is that this panic triggers only when things
      like CONFIG_LOCKDEP are being used. In that case offsetof(struct bpf_array,
      prog) starts at offset 0x80 and in non-CONFIG_LOCKDEP case at offset 0x50.
      Reason is that the work_struct inside struct bpf_map grows by 48 bytes in my
      case due to the lockdep_map member (which also has CONFIG_LOCK_STAT enabled
      members).
      
      Changing the emitter to always use the 4 byte displacement in the lea
      instruction fixes the panic on my side. It increases the tail call instruction
      emission by 3 more byte, but it should cover us from various combinations
      (and perhaps other future increases on related structures).
      
      After patch, disassembly:
      
        [...]
        9e:   lea    0x80(%rsi,%rdx,8),%rax   <--- CONFIG_LOCKDEP/CONFIG_LOCK_STAT
              48 8d 84 d6 80 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
        [...]
        9e:   lea    0x50(%rsi,%rdx,8),%rax   <--- No CONFIG_LOCKDEP
              48 8d 84 d6 50 00 00 00
        a6:   mov    (%rax),%rax
              48 8b 00
        [...]
      
      Fixes: b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2482abb9
  2. 29 Jul, 2015 7 commits
    • Nikolay Aleksandrov's avatar
      bridge: mdb: fix delmdb state in the notification · 7ae90a4f
      Nikolay Aleksandrov authored
      Since mdb states were introduced when deleting an entry the state was
      left as it was set in the delete request from the user which leads to
      the following output when doing a monitor (for example):
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 temp
      ^^^
      Note the "temp" state in the delete notification which is wrong since
      the entry was permanent, the state in a delete is always reported as
      "temp" regardless of the real state of the entry.
      
      After this patch:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      There's one important note to make here that the state is actually not
      matched when doing a delete, so one can delete a permanent entry by
      stating "temp" in the end of the command, I've chosen this fix in order
      not to break user-space tools which rely on this (incorrect) behaviour.
      
      So to give an example after this patch and using the wrong state:
      $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
      (monitor) dev br0 port eth3 grp 239.0.0.1 permanent
      
      Note the state of the entry that got deleted is correct in the
      notification.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: ccb1c31a ("bridge: add flags to distinguish permanent mdb entires")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ae90a4f
    • Satish Ashok's avatar
      bridge: mcast: give fast leave precedence over multicast router and querier · 544586f7
      Satish Ashok authored
      When fast leave is configured on a bridge port and an IGMP leave is
      received for a group, the group is not deleted immediately if there is
      a router detected or if multicast querier is configured.
      Ideally the group should be deleted immediately when fast leave is
      configured.
      Signed-off-by: default avatarSatish Ashok <sashok@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      544586f7
    • Toshiaki Makita's avatar
      bridge: Fix network header pointer for vlan tagged packets · df356d5e
      Toshiaki Makita authored
      There are several devices that can receive vlan tagged packets with
      CHECKSUM_PARTIAL like tap, possibly veth and xennet.
      When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
      by bridge to a device with the IP_CSUM feature, they end up with checksum
      error because before entering bridge, the network header is set to
      ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
      get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.
      
      Since the network header is exepected to be ETH_HLEN in flow-dissection
      and hash-calculation in RPS in rx path, and since the header pointer fix
      is needed only in tx path, set the appropriate network header on forwarding
      packets.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df356d5e
    • Alexander Drozdov's avatar
      packet: tpacket_snd(): fix signed/unsigned comparison · dbd46ab4
      Alexander Drozdov authored
      tpacket_fill_skb() can return a negative value (-errno) which
      is stored in tp_len variable. In that case the following
      condition will be (but shouldn't be) true:
      
      tp_len > dev->mtu + dev->hard_header_len
      
      as dev->mtu and dev->hard_header_len are both unsigned.
      
      That may lead to just returning an incorrect EMSGSIZE errno
      to the user.
      
      Fixes: 52f1454f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbd46ab4
    • Eric Dumazet's avatar
      arp: filter NOARP neighbours for SIOCGARP · 11c91ef9
      Eric Dumazet authored
      When arp is off on a device, and ioctl(SIOCGARP) is queried,
      a buggy answer is given with MAC address of the device, instead
      of the mac address of the destination/gateway.
      
      We filter out NUD_NOARP neighbours for /proc/net/arp,
      we must do the same for SIOCGARP ioctl.
      
      Tested:
      
      lpaa23:~# ./arp 10.246.7.190
      MAC=00:01:e8:22:cb:1d      // correct answer
      
      lpaa23:~# ip link set dev eth0 arp off
      lpaa23:~# cat /proc/net/arp   # check arp table is now 'empty'
      IP address       HW type     Flags       HW address    Mask     Device
      lpaa23:~# ./arp 10.246.7.190
      MAC=00:1a:11:c3:0d:7f   // buggy answer before patch (this is eth0 mac)
      
      After patch :
      
      lpaa23:~# ip link set dev eth0 arp off
      lpaa23:~# ./arp 10.246.7.190
      ioctl(SIOCGARP) failed: No such device or address
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarVytautas Valancius <valas@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11c91ef9
    • David Ward's avatar
      net/ipv4: suppress NETDEV_UP notification on address lifetime update · 865b8042
      David Ward authored
      This notification causes the FIB to be updated, which is not needed
      because the address already exists, and more importantly it may undo
      intentional changes that were made to the FIB after the address was
      originally added. (As a point of comparison, when an address becomes
      deprecated because its preferred lifetime expired, a notification on
      this chain is not generated.)
      
      The motivation for this commit is fixing an incompatibility between
      DHCP clients which set and update the address lifetime according to
      the lease, and a commercial VPN client which replaces kernel routes
      in a way that outbound traffic is sent only through the tunnel (and
      disconnects if any further route changes are detected via netlink).
      Signed-off-by: default avatarDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      865b8042
    • Nikolay Aleksandrov's avatar
      bridge: stp: when using userspace stp stop kernel hello and hold timers · 76b91c32
      Nikolay Aleksandrov authored
      These should be handled only by the respective STP which is in control.
      They become problematic for devices with limited resources with many
      ports because the hold_timer is per port and fires each second and the
      hello timer fires each 2 seconds even though it's global. While in
      user-space STP mode these timers are completely unnecessary so it's better
      to keep them off.
      Also ensure that when the bridge is up these timers are started only when
      running with kernel STP.
      Signed-off-by: default avatarSatish Ashok <sashok@cumulusnetworks.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76b91c32
  3. 27 Jul, 2015 8 commits