1. 16 Oct, 2015 40 commits
    • Jiri Pirko's avatar
      mlxsw: reg: Add Switch FDB Notification register definition · f5d88f58
      Jiri Pirko authored
      Add SFN register which is used to poll for newly added and aged-out FDB
      entries.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5d88f58
    • Jiri Pirko's avatar
      mlxsw: reg: Add Switch Filtering Database register definition · 236033b3
      Jiri Pirko authored
      Add the SFD register which is responsible for filtering database
      manipulation, including static and dynamic FDB entries.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      236033b3
    • Jiri Pirko's avatar
      mlxsw: item: Add MLXSW_ITEM_BUF_INDEXED helper · d64b1592
      Jiri Pirko authored
      Add missing item helper which allows to access char bufs on multiple
      offsets. This is needed by SFD and SFN register definitions.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d64b1592
    • Jiri Pirko's avatar
      7b0989b5
    • Ido Schimmel's avatar
      mlxsw: cmd: Introduce FID-offset flooding tables · 12fd35ab
      Ido Schimmel authored
      Packets destined to offloaded netdevs will be classified to FIDs in the
      device and flooded in case of BUM.
      
      The flooding table used is of type FID-offset, which allows one to
      create different flooding domains for different FIDs and specify the
      offset in the flooding table for each FID (not necessarily equal to FID
      or VID).
      
      Add support for this flooding table type, by exposing the configuration
      of the number of tables from this type and their size.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12fd35ab
    • Ido Schimmel's avatar
      mlxsw: cmd: Introduce per-FID flooding tables · 453b6a8d
      Ido Schimmel authored
      In the newly introduced Spectrum switch ASIC, packets destined to not
      offloaded netdevs will be classified to special FIDs (vFIDs) in the
      device and flooded to the CPU port.
      
      The flooding table used is of type per-FID, which allows one to create
      different flooding domains for different vFIDs.
      
      While using a simple single-entry flood table is certainly sufficient at
      this point, we do plan to offload 802.1D bridges involving VLAN
      interfaces, thus making this change necessary.
      
      Add support for this flooding table type, by exposing the configuration
      of the number of tables from this type and their size.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      453b6a8d
    • Ido Schimmel's avatar
      mlxsw: Enable configuration of flooding domains · bc2055f8
      Ido Schimmel authored
      As part of the introduction of L2 offloads, allow different ports to
      join/leave the flooding domain, according to user configuration.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc2055f8
    • Jiri Pirko's avatar
      net: introduce pre-change upper device notifier · 573c7ba0
      Jiri Pirko authored
      This newly introduced netdevice notifier is called before actual change
      upper happens. That provides a possibility for notifier handlers to
      know upper change will happen and react to it, including possibility to
      forbid the change. That is valuable for drivers which can check if the
      upper device linkage is supported and forbid that in case it is not.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      573c7ba0
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 125ecf4b
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2015-10-16
      
      This series contains updates to e1000, e1000e, igb, igbvf, ixgbe, ixgbevf,
      i40e, i40evf and fm10k.
      
      Alex Duyck fixes the polling routine for i40e/i40evf were the NAPI budget
      for receive cleanup was being rounded up to 1 but the netpoll call was
      expecting no Rx to be processed as the budget passed was 0.  Also cleaned
      up IN_NETPOLL flag that was not adding any value due to the receive
      cleanup was handled in NAPI.  Added support for netpoll for i40evf as
      well.
      
      Jesse updates all of our drivers to use napi_complete_done() instead of
      napi_complete(), which allows us to use
      /sys/class/net/ethX/gro_flush_timeout.  Added ethtool support to control
      and report the new Interrupt Limit register, since the XL710 hardware
      has a different interrupt moderation design that can support a limit of
      total interrupts per second per vector.
      
      Shannon cleans up startup log entries to cut down the number by putting
      a couple behind debug flags and combining others into single line.  Added
      support to enable/disable printing VEB statistics via ethtool.
      
      Jingjing fixes a compile issue by adding const to functions that return
      strings that are not going to be modified.
      
      Greg Rose cleans up defines that were not used and were causing customer
      confusion.
      
      Greg Bowers adds support for setting a new bit in the Set Local LLDP MIB
      admin queue command Type field.
      
      Mitch fixes an issue where vlan_features field was set to the same value
      as netdev features field, but before the features were actually being
      set up, leaving the vlan_features empty.  Resolve the issue by setting
      up the netdev features first, then mask out the VLAN feature bits when
      assigning vlan_features.  Fixed VF init timing, where in some instances
      the VFs would fail to initialize the first time you loaded the driver.
      To correct this, increased the delay time for the init task and wait
      longer before giving up.
      
      v2: fix missing space in function header comment in patch 3, based on
          feedback from Sergei Shtylyov.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      125ecf4b
    • Catherine Sullivan's avatar
      i40e/i40evf: Bump i40e to 1.3.34 and i40evf to 1.3.21 · d1d39516
      Catherine Sullivan authored
      Bump.
      
      Change-ID: I7ec818a507554648675b9b245ced9e6b6bd9ed4e
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d1d39516
    • Mitch Williams's avatar
      i40e: increase AQ work limit · 628f096d
      Mitch Williams authored
      With 64 VFs, we can easily overwhelm the AQ on the PF if we have too low
      a limit on the number of AQ requests. This leads to ARQ overflow errors,
      and occasionally VFs that fail to initialize.
      
      Since we really only hit this condition on initial VF driver load, the
      requests that we process are lightweight, so this extra work doesn't
      cause problems for the PF driver.
      
      Change-ID: I620221520d8af987df6ace9ba938ffaf22107681
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      628f096d
    • Mitch Williams's avatar
      i40evf: relax and stagger init timing a bit · 3f7e5c33
      Mitch Williams authored
      On some devices, in some systems, in some configurations, the VFs would
      fail to initialize the first time you loaded the driver.
      
      To correct this, increase the delay time for the init task slightly, and
      wait longer before giving up.
      
      If we enable VFs and load the VF driver in the same kernel as the PF
      driver, we can totally overwhelm the PF driver with AQ requests because
      all of the instances try to initialize at the same time.
      
      To help alleviate this, stagger the initial scheduling of the init task
      using the PCIe function as a multiplier. We mask off the function to
      only three bits so no instance has to wait too long.
      
      With these two changes, initializing 128 VFs on a single device goes
      from four minutes to just a few seconds.
      
      Change-ID: If3d8720c1c4e838ab36d8781d9ec295a62380936
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3f7e5c33
    • Catherine Sullivan's avatar
      i40e: Recognize 1000Base_T_Optical phy type when link is up · 48becae6
      Catherine Sullivan authored
      1000Base_T_Optical got added to the function that figures out what
      is supported when link is down but not when link is up. Add it in there
      too so that we display the correct information.
      
      Change-ID: I85ebcdfa7c02d898c44c673b1500552a53c8042e
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      48becae6
    • Mitch Williams's avatar
      i40evf: correctly populate vlan_features · cc7e406c
      Mitch Williams authored
      The vlan_features field was correctly being set to the same value as the
      netdev features field. However, this was being done before the features
      were actually being set up, leaving the vlan_features empty.
      
      Also, after a reset, vlan_features will be incorrectly assigned the
      previous netdev feature flags, which can contain VLAN feature bits. This
      makes the VLAN code angry and will cause a stack dump.
      
      To fix these issues, set up the netdev features first, then mask out the
      VLAN feature bits when assigning vlan_features.
      
      Change-ID: Ib0548869dc83cf6a841cb8697dd94c12359ba4d2
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      cc7e406c
    • Jingjing Wu's avatar
      i40e: reset the invalid msg counter in vf when a valid msg is received · 5d38c93e
      Jingjing Wu authored
      When the number of invalid messages from a VF is exceeded, the VF
      will be disabled, due to the invalid messages.  This happens if
      other VF drivers (like DPDK) send a message through the driver's
      mailbox (aka virtchannel) interface, but the message is not
      supported by the i40e pf driver, such as CONFIG_PROMISCUOUS_MODE.
      
      This patch changes the num_invalid_msgs in struct i40e_vf to record
      the continuous invalid msgs, and it will be reset when a valid msg
      is received.
      
      Change-ID: Iaec42fd3dcdd281476b3518be23261dd46fc3718
      Signed-off-by: default avatarJingjing Wu <jingjing.wu@intel.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5d38c93e
    • Jesse Brandeburg's avatar
      i40e/i40evf: moderate interrupts differently · ac26fc13
      Jesse Brandeburg authored
      The XL710 hardware has a different interrupt moderation design
      that can support a limit of total interrupts per second per
      vector, in addition to the "number of interrupts per second"
      controls already established in the driver.  This combination
      of hardware features allows us to set very low default latency
      settings but minimize the total CPU utilization by not
      making too many interrupts, should the user desire.
      
      The current driver implementation is still enabling the dynamic
      moderation in the driver, and only using the rx/tx-usecs
      limit in ethtool to limit the interrupt rate per second, by default.
      
      The new code implemented in this patch
      2) adds init/use of the new "Interrupt Limit" register
      3) adds ethtool knob to control/report the limits above
      
      Usage is ethtool -C ethx rx-usecs-high <value> Where <value> is number
      of microseconds to create a rate of 1/N interrupts per second,
      regardless of rx-usecs or tx-usecs values. Since there is a credit based
      scheme in the hardware, the rx-usecs and tx-usecs can be configured for
      very low latency for short bursts, but once the credit runs out the
      refill rate on the credits is limited by rx-usecs-high.
      
      Change-ID: I3a1075d3296123b0f4f50623c779b027af5b188d
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ac26fc13
    • Greg Bowers's avatar
      i40e: Add support for non-willing Apps · 947570e8
      Greg Bowers authored
      Adds support for setting a new bit in the Set Local LLDP MIB AQ command
      Type field.  When set to 1, the bit indicates to FW that Apps should be
      treated as non-willing.  When 0, FW behaves as before.
      
      Change-ID: I0d2101c1606c59c7188d3e6a0c7810e0f205233a
      Signed-off-by: default avatarGreg Bowers <gregory.j.bowers@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      947570e8
    • Shannon Nelson's avatar
      i40e: priv flag for controlling VEB stats · 1cdfd88f
      Shannon Nelson authored
      Add an ethtool priv flag to enable and disable printing
      the VEB statistics.
      
      Change-ID: I7654054a3a73b08aa8310d94ee8fce6219107dd8
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1cdfd88f
    • Greg Rose's avatar
      i40e: Removed unused defines · d9d17cf7
      Greg Rose authored
      Two defines that are not used are causing customer confusion - remove
      them.
      
      Change-ID: Icef0325aca8e0f4fcdfc519e026bdd375e791200
      Signed-off-by: default avatarGreg Rose <gregory.v.rose@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d9d17cf7
    • Shannon Nelson's avatar
      i40e: remove read/write failed messages from nvmupdate · 3c5c4205
      Shannon Nelson authored
      Allow the nvmupdate application to decide when a read or write error
      should be exposed to the user.  Since the application needs to use
      write probes to find the ReadOnly sections on a potentially unknown NVM
      version in the HW and read probes to check the status of the last write,
      some error messages are expected, but need not be shown to the users.
      The driver doesn't know which are ignorable from real errors, so needs
      to let the application make the decision.
      
      Change-ID: I78fca8ab672bede11c10c820b83c26adfd536d03
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3c5c4205
    • Jingjing Wu's avatar
      i40e/i40evf: Fix compile issue related to const string · 4e68adfe
      Jingjing Wu authored
      Add const to functions that return strings that aren't going to be
      modified. This addresses some reported compile complaints.
      
      Change-ID: Ic56b1e814ab4d23a50480e7fdec652445f776ee8
      Signed-off-by: default avatarJingjing Wu <jingjing.wu@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e68adfe
    • Shannon Nelson's avatar
      i40e: generate fewer startup messages · 6dec1017
      Shannon Nelson authored
      Cut down on the number of startup log entries by putting a couple behind
      debug flags and combining a couple others into a single line.
      
      Change-ID: I708089f086308f84d43f8b6f0e8a634a02d058fb
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6dec1017
    • Jesse Brandeburg's avatar
      drivers/net/intel: use napi_complete_done() · 32b3e08f
      Jesse Brandeburg authored
      As per Eric Dumazet's previous patches:
      (see commit (24d2e4a5) - tg3: use napi_complete_done())
      
      Quoting verbatim:
      Using napi_complete_done() instead of napi_complete() allows
      us to use /sys/class/net/ethX/gro_flush_timeout
      
      GRO layer can aggregate more packets if the flush is delayed a bit,
      without having to set too big coalescing parameters that impact
      latencies.
      </end quote>
      
      Tested
      configuration: low latency via ethtool -C ethx adaptive-rx off
      				rx-usecs 10 adaptive-tx off tx-usecs 15
      workload: streaming rx using netperf TCP_MAERTS
      
      igb:
      MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
      ...
      Interim result:  941.48 10^6bits/s over 1.000 seconds ending at 1440193171.589
      
      Alignment      Offset         Bytes    Bytes       Recvs   Bytes    Sends
      Local  Remote  Local  Remote  Xfered   Per                 Per
      Recv   Send    Recv   Send             Recv (avg)          Send (avg)
          8       8      0       0 1176930056  1475.36    797726   16384.00  71905
      
      MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
      ...
      Interim result:  941.49 10^6bits/s over 0.997 seconds ending at 1440193142.763
      
      Alignment      Offset         Bytes    Bytes       Recvs   Bytes    Sends
      Local  Remote  Local  Remote  Xfered   Per                 Per
      Recv   Send    Recv   Send             Recv (avg)          Send (avg)
          8       8      0       0 1175182320  50476.00     23282   16384.00  71816
      
      i40e:
      Hard to test because the traffic is incoming so fast (24Gb/s) that GRO
      always receives 87kB, even at the highest interrupt rate.
      
      Other drivers were only compile tested.
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      32b3e08f
    • Alexander Duyck's avatar
    • Alexander Duyck's avatar
      i40e/i40evf: Drop useless "IN_NETPOLL" flag · 8b650359
      Alexander Duyck authored
      The code in i40e and i40evf is using an "IN_NETPOLL" flag that has never
      added any value due to the fact that the Rx clean-up is handled in NAPI.
      As such the flag was set, the queue was scheduled via NAPI, and then polled
      from the netpoll controller and if any Rx packets were processed the were
      processed in the wrong context.
      
      In addition the flag itself just added an unneeded conditional to the
      hot-path so it can safely be dropped and save us a few instructions.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8b650359
    • Alexander Duyck's avatar
      i40e/i40evf: Fix handling of napi budget · c67caceb
      Alexander Duyck authored
      The polling routine for i40e was rounding up the budget for Rx cleanup to
      1.  This is incorrect as the netpoll poll call is expecting no Rx to be
      processed as the budget passed was 0.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c67caceb
    • David Ahern's avatar
      net: Fix suspicious RCU usage in fib_rebalance · 51161aa9
      David Ahern authored
      This command:
        ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2
      
      generated this suspicious RCU usage message:
      
      [ 63.249262]
      [ 63.249939] ===============================
      [ 63.251571] [ INFO: suspicious RCU usage. ]
      [ 63.253250] 4.3.0-rc3+ #298 Not tainted
      [ 63.254724] -------------------------------
      [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      [ 63.259450]
      [ 63.259450] other info that might help us debug this:
      [ 63.259450]
      [ 63.262297]
      [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
      [ 63.264647] 1 lock held by ip/2870:
      [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
      [ 63.268858]
      [ 63.268858] stack backtrace:
      [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
      [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
      [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
      [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
      [ 63.283177] Call Trace:
      [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
      [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
      [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
      [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
      [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
      [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
      [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
      [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
      [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
      [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
      [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
      [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
      [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
      [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
      [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
      [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
      [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
      [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
      [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
      [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
      [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
      [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
      [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
      [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
      [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
      [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It looks like all of the code paths to fib_rebalance are under rtnl.
      
      Fixes: 0e884c78 ("ipv4: L3 hash-based multipath")
      Cc: Peter Nørlund <pch@ordbogen.com>
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51161aa9
    • Tom Herbert's avatar
      bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put · ac00737f
      Tom Herbert authored
      Currently, is only called from __prog_put_rcu in the bpf_prog_release
      path. Need this to call this from bpf_prog_put also to get correct
      accounting.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac00737f
    • David S. Miller's avatar
      Merge branch 'robust_listener' · a302afe9
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp/dccp: make our listener code more robust
      
      This patch series addresses request sockets leaks and listener dismantle
      phase. This survives a stress test with listeners being added/removed
      quite randomly.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a302afe9
    • Eric Dumazet's avatar
      tcp/dccp: fix race at listener dismantle phase · ebb516af
      Eric Dumazet authored
      Under stress, a close() on a listener can trigger the
      WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()
      
      We need to test if listener is still active before queueing
      a child in inet_csk_reqsk_queue_add()
      
      Create a common inet_child_forget() helper, and use it
      from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebb516af
    • Eric Dumazet's avatar
      tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper · f03f2e15
      Eric Dumazet authored
      Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
      In many cases we also need to release reference on request socket,
      so add a helper to do this, reducing code size and complexity.
      
      Fixes: 4bdc3d66 ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f03f2e15
    • Eric Dumazet's avatar
      Revert "inet: fix double request socket freeing" · ef84d8ce
      Eric Dumazet authored
      This reverts commit c6973669.
      
      At the time of above commit, tcp_req_err() and dccp_req_err()
      were dead code, as SYN_RECV request sockets were not yet in ehash table.
      
      Real bug was fixed later in a different commit.
      
      We need to revert to not leak a refcount on request socket.
      
      inet_csk_reqsk_queue_drop_and_put() will be added
      in following commit to make clean inet_csk_reqsk_queue_drop()
      does not release the reference owned by caller.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef84d8ce
    • Ivan Vecera's avatar
      drivers/net: get rid of unnecessary initializations in .get_drvinfo() · 47ea0325
      Ivan Vecera authored
      Many drivers initialize uselessly n_priv_flags, n_stats, testinfo_len,
      eedump_len & regdump_len fields in their .get_drvinfo() ethtool op.
      It's not necessary as these fields is filled in ethtool_get_drvinfo().
      
      v2: removed unused variable
      v3: removed another unused variable
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47ea0325
    • David S. Miller's avatar
      Merge branch 'tipc-link-improvements' · ae230518
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: some link level code improvements
      
      Extensive testing has revealed some weaknesses and non-optimal solutions
      in the link level code.
      
      This commit series addresses those issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae230518
    • Jon Paul Maloy's avatar
      tipc: update node FSM when peer RESET message is received · c8199300
      Jon Paul Maloy authored
      The change made in the previous commit revealed a small flaw in the way
      the node FSM is updated. When the function tipc_node_link_down() is
      called for the last link to a node, we should check whether this was
      caused by a local reset or by a received RESET message from the peer.
      In the latter case, we can directly issue a PEER_LOST_CONTACT_EVT to
      the node FSM, so that it is ready to re-establish contact. If this is
      not done, the peer node will sometimes have to go through a second
      establish cycle before the link becomes stable.
      
      We fix this in this commit by conditionally issuing the mentioned
      event in the function tipc_node_link_down(). We also move LINK_RESET
      FSM even away from the link_reset() function and into the caller
      function, partially because it is easier to follow the code when state
      changes are gathered at a limited number of locations, partially
      because there will be cases in future commits where we don't want the
      link to go RESET mode when link_reset() is called.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8199300
    • Jon Paul Maloy's avatar
      tipc: send out RESET immediately when link goes down · 282b3a05
      Jon Paul Maloy authored
      When a link is taken down because of a node local event, such as
      disabling of a bearer or an interface, we currently leave it to the
      peer node to discover the broken communication. The default time for
      such failure discovery is 1.5-2 seconds.
      
      If we instead allow the terminating link endpoint to send out a RESET
      message at the moment it is reset, we can achieve the impression that
      both endpoints are going down instantly. Since this is a very common
      scenario, we find it worthwhile to make this small modification.
      
      Apart from letting the link produce the said message, we also have to
      ensure that the interface is able to transmit it before TIPC is
      detached. We do this by performing the disabling of a bearer in three
      steps:
      
      1) Disable reception of TIPC packets from the interface in question.
      2) Take down the links, while allowing them so send out a RESET message.
      3) Disable transmission of TIPC packets on the interface.
      
      Apart from this, we now have to react on the NETDEV_GOING_DOWN event,
      instead of as currently the NEDEV_DOWN event, to ensure that such
      transmission is possible during the teardown phase.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      282b3a05
    • Jon Paul Maloy's avatar
      tipc: delay ESTABLISH state event when link is established · 73f646ce
      Jon Paul Maloy authored
      Link establishing, just like link teardown, is a non-atomic action, in
      the sense that discovering that conditions are right to establish a link,
      and the actual adding of the link to one of the node's send slots is done
      in two different lock contexts. The link FSM is designed to help bridging
      the gap between the two contexts in a safe manner.
      
      We have now discovered a weakness in the implementaton of this FSM.
      Because we directly let the link go from state LINK_ESTABLISHING to
      state LINK_ESTABLISHED already in the first lock context, we are unable
      to distinguish between a fully established link, i.e., a link that has
      been added to its slot, and a link that has not yet reached the second
      lock context. It may hence happen that a manual intervention, e.g., when
      disabling an interface, causes the function tipc_node_link_down() to try
      removing the link from the node slots, decrementing its active link
      counter etc, although the link was never added there in the first place.
      
      We solve this by delaying the actual state change until we reach the
      second lock context, inside the function tipc_node_link_up(). This
      makes it possible for potentail callers of __tipc_node_link_down() to
      know if they should proceed or not, and the problem is solved.
      
      Unforunately, the situation described above also has a second problem.
      Since there by necessity is a tipc_node_link_up() call pending once
      the node lock has been released, we must defuse that call by setting
      the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
      us to make a slight modification to the link FSM, which will now look
      as follows.
      
       +------------------------------------+
       |RESET_EVT                           |
       |                                    |
       |                             +--------------+
       |           +-----------------|   SYNCHING   |-----------------+
       |           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
       |           |                  A            |                  |
       |           |                  |            |                  |
       |           |                  |            |                  |
       |           |                  |SYNCH_      |SYNCH_            |
       |           |                  |BEGIN_EVT   |END_EVT           |
       |           |                  |            |                  |
       |           V                  |            V                  V
       |    +-------------+          +--------------+          +------------+
       |    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
       |    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
       |           |        EVT        |    A         RESET_EVT       |
       |           |                   |    |                         |
       |           |  +----------------+    |                         |
       |  RESET_EVT|  |RESET_EVT            |                         |
       |           |  |                     |                         |
       |           |  |                     |ESTABLISH_EVT            |
       |           |  |  +-------------+    |                         |
       |           |  |  | RESET_EVT   |    |                         |
       |           |  |  |             |    |                         |
       |           V  V  V             |    |                         |
       |    +-------------+          +--------------+        RESET_EVT|
       +--->|    RESET    |--------->| ESTABLISHING |<----------------+
            +-------------+ PEER_    +--------------+
             |           A  RESET_EVT       |
             |           |                  |
             |           |                  |
             |FAILOVER_  |FAILOVER_         |FAILOVER_
             |BEGIN_EVT  |END_EVT           |BEGIN_EVT
             |           |                  |
             V           |                  |
            +-------------+                 |
            | FAILINGOVER |<----------------+
            +-------------+
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73f646ce
    • Jon Paul Maloy's avatar
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy authored
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8306f99a
    • Jon Paul Maloy's avatar
      tipc: improve sequence number checking · 81204c49
      Jon Paul Maloy authored
      The sequence number of an incoming packet is currently only checked
      for less than, equality to, or bigger than the next expected number,
      meaning that the receive window in practice becomes one half sequence
      number cycle, or U16_MAX/2. This does not make sense, and may not even
      be safe if there are extreme delays in the network. Any packet sent by
      the peer during the ongoing cycle must belong inside his current send
      window, or should otherwise be dropped if possible.
      
      Since a link endpoint cannot know its peer's current send window, it
      has to base this sanity check on a worst-case assumption, i.e., that
      the peer is using a maximum sized window of 8191 packets. Using this
      assumption, we now add a check that the sequence number is not bigger
      than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
      done, so that the receive window test is performed before the gap test.
      This way, we are guaranteed that no packet with illegal sequence numbers
      are ever added to the deferred queue.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81204c49
    • Jon Paul Maloy's avatar
      tipc: simplify tipc_link_rcv() reception loop · f9aa358a
      Jon Paul Maloy authored
      Currently, all packets received in tipc_link_rcv() are unconditionally
      added to the packet deferred queue, whereafter that queue is walked and
      all its buffers evaluated for delivery. This is both non-optimal and
      and makes the queue sorting function unnecessary complex.
      
      This commit changes the loop so that an arrived packet is evaluated
      first, and added to the deferred queue only when a sequence number gap
      is discovered. A non-empty deferred queue is walked until it is empty
      or until its head's sequence number doesn't fit.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9aa358a