1. 07 Feb, 2017 37 commits
    • David S. Miller's avatar
      Merge branch 'replace-dst_confirm' · 29ba6e74
      David S. Miller authored
      Julian Anastasov says:
      
      ====================
      net: dst_confirm replacement
      
      	This patchset addresses the problem of neighbour
      confirmation where received replies from one nexthop
      can cause confirmation of different nexthop when using
      the same dst. Thanks to YueHaibing <yuehaibing@huawei.com>
      for tracking the dst->pending_confirm problem.
      
      	Sockets can obtain cached output route. Such
      routes can be to known nexthop (rt_gateway=IP) or to be
      used simultaneously for different nexthop IPs by different
      subnet prefixes (nh->nh_scope = RT_SCOPE_HOST, rt_gateway=0).
      
      	At first look, there are more problems:
      
      - dst_confirm() sets flag on dst and not on dst->path,
      as result, indication is lost when XFRM is used
      
      - DNAT can change the nexthop, so the really used nexthop is
      not confirmed
      
      	So, the following solution is to avoid using
      dst->pending_confirm.
      
      	The current dst_confirm() usage is as follows:
      
      Protocols confirming dst on received packets:
      - TCP (1 dst per socket)
      - SCTP (1 dst per transport)
      - CXGB*
      
      Protocols supporting sendmsg with MSG_CONFIRM [ | MSG_PROBE ] to
      confirm neighbour:
      - UDP IPv4/IPv6
      - ICMPv4 PING
      - RAW IPv4/IPv6
      - L2TP/IPv6
      
      MSG_CONFIRM for other purposes (fix not needed):
      - CAN
      
      Sending without locking the socket:
      - UDP (when no cork)
      - RAW (when hdrincl=1)
      
      Redirects from old to new GW:
      - rt6_do_redirect
      
      	The patchset includes the following changes:
      
      1. sock: add sk_dst_pending_confirm flag
      
      - used only by TCP with patch 4 to remember the received
      indication in sk->sk_dst_pending_confirm
      
      2. net: add dst_pending_confirm flag to skbuff
      
      - skb->dst_pending_confirm will be used by all protocols
      in following patches, via skb_{set,get}_dst_pending_confirm
      
      3. sctp: add dst_pending_confirm flag
      
      - SCTP uses per-transport dsts and can not use
      sk->sk_dst_pending_confirm like TCP
      
      4. tcp: replace dst_confirm with sk_dst_confirm
      
      5. net: add confirm_neigh method to dst_ops
      
      - IPv4 and IPv6 provision for slow neigh lookups for MSG_PROBE users.
      I decided to use neigh lookup only for this case because on
      MSG_PROBE the skb may pass MTU checks but it does not reach
      the neigh confirmation code. This patch will be used from patch 6.
      
      - xfrm_confirm_neigh: we use the last tunnel address, if present.
      When there are only transports, the original dest address is used.
      
      6. net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP
      
      - dst_confirm conversion for UDP, RAW, ICMP and L2TP/IPv6
      
      - these protocols use MSG_CONFIRM propagated by ip*_append_data
      to skb->dst_pending_confirm. sk->sk_dst_pending_confirm is not
      used because some sending paths do not lock the socket. For
      MSG_PROBE we use the slow lookup (dst_confirm_neigh).
      
      - there are also 2 cases that need the slow lookup:
      __ip6_rt_update_pmtu and rt6_do_redirect. I hope
      &ipv6_hdr(skb)->saddr is the correct nexthop address to use here.
      
      7. net: pending_confirm is not used anymore
      
      - I failed to understand the CXGB* code, I see dst_confirm()
      calls but I'm not sure dst_neigh_output() was called. For now
      I just removed the dst->pending_confirm flag and left all
      dst_confirm() calls there. Any better idea?
      
      - Now may be old function neigh_output() should be restored
      instead of dst_neigh_output?
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29ba6e74
    • Julian Anastasov's avatar
      net: pending_confirm is not used anymore · 51ce8bd4
      Julian Anastasov authored
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      As last step, we can remove the pending_confirm flag.
      Reported-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Fixes: 5110effe ("net: Do delayed neigh confirmation.")
      Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51ce8bd4
    • Julian Anastasov's avatar
      net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP · 0dec879f
      Julian Anastasov authored
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      
      The datagram protocols can use MSG_CONFIRM to confirm the
      neighbour. When used with MSG_PROBE we do not reach the
      code where neighbour is confirmed, so we have to do the
      same slow lookup by using the dst_confirm_neigh() helper.
      When MSG_PROBE is not used, ip_append_data/ip6_append_data
      will set the skb flag dst_pending_confirm.
      Reported-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Fixes: 5110effe ("net: Do delayed neigh confirmation.")
      Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dec879f
    • Julian Anastasov's avatar
      net: add confirm_neigh method to dst_ops · 63fca65d
      Julian Anastasov authored
      Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6
      to lookup and confirm the neighbour. Its usage via the new helper
      dst_confirm_neigh() should be restricted to MSG_PROBE users for
      performance reasons.
      
      For XFRM prefer the last tunnel address, if present. With help
      from Steffen Klassert.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63fca65d
    • Julian Anastasov's avatar
      tcp: replace dst_confirm with sk_dst_confirm · c3a2e837
      Julian Anastasov authored
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      Use the new sk_dst_confirm() helper to propagate the
      indication from received packets to sock_confirm_neigh().
      Reported-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Fixes: 5110effe ("net: Do delayed neigh confirmation.")
      Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
      Tested-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3a2e837
    • Julian Anastasov's avatar
      sctp: add dst_pending_confirm flag · c86a773c
      Julian Anastasov authored
      Add new transport flag to allow sockets to confirm neighbour.
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      The flag is propagated from transport to every packet.
      It is reset when cached dst is reset.
      Reported-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Fixes: 5110effe ("net: Do delayed neigh confirmation.")
      Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c86a773c
    • Julian Anastasov's avatar
      net: add dst_pending_confirm flag to skbuff · 4ff06203
      Julian Anastasov authored
      Add new skbuff flag to allow protocols to confirm neighbour.
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      
      Add sock_confirm_neigh() helper to confirm the neighbour and
      use it for IPv4, IPv6 and VRF before dst_neigh_output.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ff06203
    • Julian Anastasov's avatar
      sock: add sk_dst_pending_confirm flag · 9b8805a3
      Julian Anastasov authored
      Add new sock flag to allow sockets to confirm neighbour.
      When same struct dst_entry can be used for many different
      neighbours we can not use it for pending confirmations.
      As not all call paths lock the socket use full word for
      the flag.
      
      Add sk_dst_confirm as replacement for dst_confirm when
      called for received packets.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b8805a3
    • Florian Fainelli's avatar
      net: phy: bcm7xxx: Add BCM74371 PHY ID · b08d46b0
      Florian Fainelli authored
      Add the BCM74371 PHY ID to the list of supported chips. This is a 28nm
      technology Gigabit PHY SoC.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b08d46b0
    • Arnd Bergmann's avatar
      mlxsw: add psample dependency for spectrum · 8d1fb01d
      Arnd Bergmann authored
      When PSAMPLE is a loadable module, spectrum must not be built-in:
      
      drivers/net/built-in.o: In function `mlxsw_sp_rx_listener_sample_func':
      spectrum.c:(.text+0xe357e): undefined reference to `psample_sample_packet'
      
      This adds a Kconfig dependency to enforce usable configurations.
      
      Fixes: 98d0f7b9 ("mlxsw: spectrum: Add packet sample offloading support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarYotam Gigi <yotamg@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d1fb01d
    • Wei Yongjun's avatar
      ipv6: sr: fix non static symbol warnings · bb4005ba
      Wei Yongjun authored
      Fixes the following sparse warnings:
      
      net/ipv6/seg6_iptunnel.c:58:5: warning:
       symbol 'nla_put_srh' was not declared. Should it be static?
      net/ipv6/seg6_iptunnel.c:238:5: warning:
       symbol 'seg6_input' was not declared. Should it be static?
      net/ipv6/seg6_iptunnel.c:254:5: warning:
       symbol 'seg6_output' was not declared. Should it be static?
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb4005ba
    • Wei Yongjun's avatar
      net/sched: act_mirred: remove duplicated include from act_mirred.c · 89d82452
      Wei Yongjun authored
      Remove duplicated include.
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89d82452
    • Wei Yongjun's avatar
      net: wan: slic_ds26522: Remove .owner field for driver · fee40221
      Wei Yongjun authored
      Remove .owner field if calls are used which set it automatically.
      
      Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fee40221
    • Wei Yongjun's avatar
      net: wan: slic_ds26522: Use module_spi_driver to simplify the code · c3afa995
      Wei Yongjun authored
      module_spi_driver() makes the code simpler by eliminating
      boilerplate code.
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3afa995
    • David S. Miller's avatar
      Merge branch 'dsa2-pdata' · 521613c5
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: dsa: Support for pdata in dsa2
      
      This is not exactly new, and was sent before, although back then, I did not
      have an user of the pre-declared MDIO board information, but now we do. Note
      that I have additional changes queued up to have b53 register platform data for
      MIPS bcm47xx and bcm63xx.
      
      Yes I know that we should have the Orion platforms eventually be converted to
      Device Tree, but until that happens, I don't want any remaining users of the
      old "dsa" platform device (hence the previous DTS submissions for ARM/mvebu)
      and, there will be platforms out there that most likely won't never see DT
      coming their way (BCM47xx is almost 100% sure, BCM63xx maybe not in a distant
      future).
      
      We would probably want the whole series to be merged via David Miller's tree
      to simplify things.
      
      Thanks!
      
      Changes in v5:
      
      - dropped changes to drivers/base/ because after more than a month, we cannot
        get any answer from Greg KH
      
      Changes in v4:
      
      - Changed device_find_class() to device_find_in_class_name()
      - Added kerneldoc above device_find_in_class_name() to explain what it does
        and the calling convention regarding device reference counts
      - Changed dev_to_net_device to device_to_net_device() added comments
        about what it does and the caller conventions regarding reference counts
      
      Changes in v3:
      
      - Tested EPROBE_DEFER from a mockup MDIO/DSA switch driver and everything
        is fine, once the driver finally probes we have access to platform data
        as expected
      
      - added comment above dsa_port_is_valid() that port->name is mandatory
        for platform data cases
      
      - added an extra check in dsa_parse_member() for a NULL pdata pointer
      
      - fixed a bunch of checkpatch errors and warnings
      
      Changes in v2:
      
      - Rebased against latest net-next/master
      
      - Moved dev_find_class() to device_find_class() into drivers/base/core.c
      
      - Moved dev_to_net_device into net/core/dev.c
      
      - Utilize dsa_chip_data directly instead of dsa_platform_data
      
      - Augmented dsa_chip_data to be multi-CPU port ready
      
      Changes from last submission (few months back):
      
      - rebased against latest net-next
      
      - do not introduce dsa2_platform_data which was overkill and was meant to
        allow us to do exaclty the same things with platform data and Device Tree
        we use the existing dsa_platform_data instead
      
      - properly register MDIO devices when the MDIO bus is registered and associate
        platform_data with them
      
      - add a change to the Orion platform code to demonstrate how this can be used
      
      Thank you
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      521613c5
    • Florian Fainelli's avatar
      ARM: orion: Register DSA switch as a MDIO device · 575e93f7
      Florian Fainelli authored
      Utilize the ability to pass board specific MDIO bus information towards a
      particular MDIO device thus allowing us to provide the per-port switch layout
      to the Marvell 88E6XXX switch driver.
      
      Since we would end-up with conflicting registration paths, do not register the
      "dsa" platform device anymore.
      
      Note that the MDIO devices registered by code in net/dsa/dsa2.c does not
      parse a dsa_platform_data, but directly take a dsa_chip_data (specific
      to a single switch chip), so we update the different call sites to pass
      this structure down to orion_ge00_switch_init().
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      575e93f7
    • Florian Fainelli's avatar
      net: phy: Allow pre-declaration of MDIO devices · 648ea013
      Florian Fainelli authored
      Allow board support code to collect pre-declarations for MDIO devices by
      registering them with mdiobus_register_board_info(). SPI and I2C buses
      have a similar feature, we were missing this for MDIO devices, but this
      is particularly useful for e.g: MDIO-connected switches which need to
      provide their port layout (often board-specific) to a MDIO Ethernet
      switch driver.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648ea013
    • Florian Fainelli's avatar
      net: dsa: Add support for platform data · 71e0bbde
      Florian Fainelli authored
      Allow drivers to use the new DSA API with platform data. Most of the
      code in net/dsa/dsa2.c does not rely so much on device_nodes and can get
      the same information from platform_data instead.
      
      We purposely do not support distributed configurations with platform
      data, so drivers should be providing a pointer to a 'struct
      dsa_chip_data' structure if they wish to communicate per-port layout.
      
      Multiple CPUs port could potentially be supported and dsa_chip_data is
      extended to receive up to one reference to an upstream network device
      per port described by a dsa_chip_data structure.
      
      dsa_dev_to_net_device() increments the network device's reference count,
      so we intentionally call dev_put() to be consistent with the DT-enabled
      path, until we have a generic notifier based solution.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71e0bbde
    • Florian Fainelli's avatar
      net: dsa: Rename and export dev_to_net_device() · 14b89f36
      Florian Fainelli authored
      In preparation for using this function in net/dsa/dsa2.c, rename the function
      to make its scope DSA specific, and export it.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14b89f36
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor remaining port setup · a23b2961
      Andrew Lunn authored
      Move the remaining port configuration code which varies per device
      into port.c, using ops were necessary. This makes
      mv88e6xxx_6185_family() and mv88e6xxx_6095_family() unused, so remove
      them.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a23b2961
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement Clause 45 access to SMI devices · cf3e80df
      Andrew Lunn authored
      The mv88e6390 MDIO bus controllers can support for clause 45 accesses.
      The internal SERDES interfaces need this, and it is likely external
      10GHz PHYs will be clause 45.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf3e80df
    • David S. Miller's avatar
      Merge branch 'mv88e6390-CMODE' · 8661a631
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      Set the CMODE for mv88e6390 ports
      
      The mv88e6390 ports 9 & 10 allow there CMODE to be set. CMODE is part
      of what linux defines as phy-mode. Add the needed phy-modes to linux,
      and add code which will act upon the phy-mode property to configure
      the switch port.
      
      These patches have been posted before as part of a bigger patchset
      which has now been broken up. I've added the received reviewed by
      tags, and added device tree documentation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8661a631
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Set the CMODE for mv88e6390 ports 9 & 10 · f39908d3
      Andrew Lunn authored
      Unlike most ports, ports 9 and 10 of the 6390X family have configurable
      PHY modes. Set the mode as part of adjust_link().
      
      Ordering is important, because the SERDES interfaces connected to
      ports 9 and 10 can be split and assigned to other ports. The CMODE has
      to be correctly set before the SERDES interface on another port can be
      configured. Such configuration is likely to be performed in
      port_enable() and port_disabled(), called on slave_open() and
      slave_close().
      
      The simple case is port 9 and 10 are used for 'CPU' or 'DSA'. In this
      case, the CMODE is set via a phy-mode in dsa_cpu_dsa_setup(), which is
      called early in the switch setup.
      
      When ports 9 or 10 are used as user ports, and have a fixed-phy, when
      the fixed fixed-phy is attached, dsa_slave_adjust_link() is called,
      which results in the adjust_link function being called, setting the
      cmode. The port_enable() will for other ports will be called much
      later.
      
      When ports 9 or 10 are used as user ports and have a real phy attached
      which does not use all the available SERDES interface, e.g. a 1Gbps
      SGMII, there is currently no mechanism in place to set the CMODE of
      the port from software. It must be hoped the stripping resistors are
      correct.
      
      At the same time, add a function to get the cmode. This will be needed
      when configuring the SERDES interfaces.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f39908d3
    • Andrew Lunn's avatar
      net: phy: Add 2000base-x, 2500base-x and rxaui modes · 55601a88
      Andrew Lunn authored
      The mv88e6390 ports 9 and 10 supports some additional PHY modes. Add
      these modes to the PHY core so they can be used in the binding.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55601a88
    • David S. Miller's avatar
      Merge branch 'virtio_net-XDP-adjust_head' · 108d9c71
      David S. Miller authored
      John Fastabend says:
      
      ====================
      XDP adjust head support for virtio
      
      This series adds adjust head support for virtio. The following is my
      test setup. I use qemu + virtio as follows,
      
      ./x86_64-softmmu/qemu-system-x86_64 \
        -hda /var/lib/libvirt/images/Fedora-test0.img \
        -m 4096  -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
        -device virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9
      
      In order to use XDP with virtio until LRO is supported TSO must be
      turned off in the host. The important fields in the above command line
      are the following,
      
        guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off
      
      Also note it is possible to conusme more queues than can be supported
      because when XDP is enabled for retransmit XDP attempts to use a queue
      per cpu. My standard queue count is 'queues=4'.
      
      After loading the VM I run the relevant XDP test programs in,
      
        ./sammples/bpf
      
      For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
      with iperf (-d option to get bidirectional traffic), ping, and pktgen.
      I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
      the normal traffic path to the stack continues to work with XDP loaded.
      
      It would be great to automate this soon. At the moment I do it by hand
      which is starting to get tedious.
      
      v2: original series dropped trace points after merge.
      ====================
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      108d9c71
    • John Fastabend's avatar
      virtio_net: XDP support for adjust_head · 2de2f7f4
      John Fastabend authored
      Add support for XDP adjust head by allocating a 256B header region
      that XDP programs can grow into. This is only enabled when a XDP
      program is loaded.
      
      In order to ensure that we do not have to unwind queue headroom push
      queue setup below bpf_prog_add. It reads better to do a prog ref
      unwind vs another queue setup call.
      
      At the moment this code must do a full reset to ensure old buffers
      without headroom on program add or with headroom on program removal
      are not used incorrectly in the datapath. Ideally we would only
      have to disable/enable the RX queues being updated but there is no
      API to do this at the moment in virtio so use the big hammer. In
      practice it is likely not that big of a problem as this will only
      happen when XDP is enabled/disabled changing programs does not
      require the reset. There is some risk that the driver may either
      have an allocation failure or for some reason fail to correctly
      negotiate with the underlying backend in this case the driver will
      be left uninitialized. I have not seen this ever happen on my test
      systems and for what its worth this same failure case can occur
      from probe and other contexts in virtio framework.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2de2f7f4
    • John Fastabend's avatar
      virtio_net: refactor freeze/restore logic into virtnet reset logic · 9fe7bfce
      John Fastabend authored
      For XDP we will need to reset the queues to allow for buffer headroom
      to be configured. In order to do this we need to essentially run the
      freeze()/restore() code path. Unfortunately the locking requirements
      between the freeze/restore and reset paths are different however so
      we can not simply reuse the code.
      
      This patch refactors the code path and adds a reset helper routine.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fe7bfce
    • John Fastabend's avatar
      virtio_net: remove duplicate queue pair binding in XDP · 722d8283
      John Fastabend authored
      Factor out qp assignment.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      722d8283
    • John Fastabend's avatar
      virtio_net: factor out xdp handler for readability · 0354e4d1
      John Fastabend authored
      At this point the do_xdp_prog is mostly if/else branches handling
      the different modes of virtio_net. So remove it and handle running
      the program in the per mode handlers.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0354e4d1
    • John Fastabend's avatar
      virtio_net: wrap rtnl_lock in test for calling with lock already held · 47315329
      John Fastabend authored
      For XDP use case and to allow ethtool reset tests it is useful to be
      able to use reset paths from contexts where rtnl lock is already
      held.
      
      This requries updating virtnet_set_queues and free_receive_bufs the
      two places where rtnl_lock is taken in virtio_net. To do this we
      use the following pattern,
      
      	_foo(...) { do stuff }
      	foo(...) { rtnl_lock(); _foo(...); rtnl_unlock()};
      
      this allows us to use freeze()/restore() flow from both contexts.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47315329
    • David S. Miller's avatar
      Merge branch 'bridge-improve-cache-utilization' · 152bff37
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      bridge: improve cache utilization
      
      This is the first set which begins to deal with the bad bridge cache
      access patterns. The first patch rearranges the bridge and port structs
      a little so the frequently (and closely) accessed members are in the same
      cache line. The second patch then moves the garbage collection to a
      workqueue trying to improve system responsiveness under load (many fdbs)
      and more importantly removes the need to check if the matched entry is
      expired in __br_fdb_get which was a major source of false-sharing.
      The third patch is a preparation for the final one which
      If properly configured, i.e. ports bound to CPUs (thus updating "updated"
      locally) then the bridge's HitM goes from 100% to 0%, but even without
      binding we get a win because previously every lookup that iterated over
      the hash chain caused false-sharing due to the first cache line being
      used for both mac/vid and used/updated fields.
      
      Some results from tests I've run:
      (note that these were run in good conditions for the baseline, everything
       ran on a single NUMA node and there were only 3 fdbs)
      
      1. baseline
      100% Load HitM on the fdbs (between everyone who has done lookups and hit
                                  one of the 3 hash chains of the communicating
                                  src/dst fdbs)
      Overall 5.06% Load HitM for the bridge, first place in the list
      
      2. patched & ports bound to CPUs
      0% Local load HitM, bridge is not even in the c2c report list
      Also there's 3% consistent improvement in netperf tests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      152bff37
    • Nikolay Aleksandrov's avatar
      bridge: fdb: write to used and updated at most once per jiffy · 83a718d6
      Nikolay Aleksandrov authored
      Writing once per jiffy is enough to limit the bridge's false sharing.
      After this change the bridge doesn't show up in the local load HitM stats.
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83a718d6
    • Nikolay Aleksandrov's avatar
      bridge: move write-heavy fdb members in their own cache line · 1214628c
      Nikolay Aleksandrov authored
      Fdb's used and updated fields are written to on every packet forward and
      packet receive respectively. Thus if we are receiving packets from a
      particular fdb, they'll cause false-sharing with everyone who has looked
      it up (even if it didn't match, since mac/vid share cache line!). The
      "used" field is even worse since it is updated on every packet forward
      to that fdb, thus the standard config where X ports use a single gateway
      results in 100% fdb false-sharing. Note that this patch does not prevent
      the last scenario, but it makes it better for other bridge participants
      which are not using that fdb (and are only doing lookups over it).
      The point is with this move we make sure that only communicating parties
      get the false-sharing, in a later patch we'll show how to avoid that too.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1214628c
    • Nikolay Aleksandrov's avatar
      bridge: move to workqueue gc · f7cdee8a
      Nikolay Aleksandrov authored
      Move the fdb garbage collector to a workqueue which fires at least 10
      milliseconds apart and cleans chain by chain allowing for other tasks
      to run in the meantime. When having thousands of fdbs the system is much
      more responsive. Most importantly remove the need to check if the
      matched entry has expired in __br_fdb_get that causes false-sharing and
      is completely unnecessary if we cleanup entries, at worst we'll get 10ms
      of traffic for that entry before it gets deleted.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7cdee8a
    • Nikolay Aleksandrov's avatar
      bridge: modify bridge and port to have often accessed fields in one cache line · 1f90c7f3
      Nikolay Aleksandrov authored
      Move around net_bridge so the vlan fields are in the beginning since
      they're checked on every packet even if vlan filtering is disabled.
      For the port move flags & vlan group to the beginning, so they're in the
      same cache line with the port's state (both flags and state are checked
      on each packet).
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f90c7f3
    • William Tu's avatar
      bpf: enable verifier to add 0 to packet ptr · 63dfef75
      William Tu authored
      The patch fixes the case when adding a zero value to the packet
      pointer.  The zero value could come from src_reg equals type
      BPF_K or CONST_IMM.  The patch fixes both, otherwise the verifer
      reports the following error:
        [...]
          R0=imm0,min_value=0,max_value=0
          R1=pkt(id=0,off=0,r=4)
          R2=pkt_end R3=fp-12
          R4=imm4,min_value=4,max_value=4
          R5=pkt(id=0,off=4,r=4)
        269: (bf) r2 = r0     // r2 becomes imm0
        270: (77) r2 >>= 3
        271: (bf) r4 = r1     // r4 becomes pkt ptr
        272: (0f) r4 += r2    // r4 += 0
        addition of negative constant to packet pointer is not allowed
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarMihai Budiu <mbudiu@vmware.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63dfef75
    • Josef Bacik's avatar
      bpf: test for AND edge cases · 29200c19
      Josef Bacik authored
      These two tests are based on the work done for f23cc643.  The first test is
      just a basic one to make sure we don't allow AND'ing negative values, even if it
      would result in a valid index for the array.  The second is a cleaned up version
      of the original testcase provided by Jann Horn that resulted in the commit.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29200c19
  2. 06 Feb, 2017 3 commits
    • David S. Miller's avatar
      Merge branch 'dsa-add-fabric-notifier' · 9172d2a0
      David S. Miller authored
      Vivien Didelot says:
      
      ====================
      net: dsa: add fabric notifier
      
      When a switch fabric is composed of multiple switch chips, these chips
      must be programmed accordingly when an event occurred on one of them.
      
      Examples of such event include hardware bridging: when a Linux bridge
      spans interconnected chips, they must be programmed to allow external
      ports to ingress frames on their internal ports.
      
      Another example is cross-chip hardware VLANs. Switch chips in-between
      interconnected bridge ports must also configure a given VLAN to allow
      packets to pass through them.
      
      In order to support that, this patchset introduces a non-intrusive
      notifier mechanism. It adds a notifier head in every DSA switch tree
      (the said fabric), and a notifier block in every DSA switch chip.
      
      When an even occurs, it is chained to all notifiers of the fabric.
      Switch chips can react accordingly if they are cross-chip capable.
      
      On a dynamic debug enabled system, bridging a port in a multi-chip
      fabric will print something like this (ZII Rev B board):
      
          # brctl addif br0 lan3
          mv88e6085 0.1:00: crosschip DSA port 1.0 bridged to br0
          mv88e6085 0.4:00: crosschip DSA port 1.0 bridged to br0
          # brctl delif br0 lan3
          mv88e6085 0.1:00: crosschip DSA port 1.0 unbridged from br0
          mv88e6085 0.4:00: crosschip DSA port 1.0 unbridged from br0
      
      Currently only bridging events are added. A patchset introducing support
      for cross-chip hardware bridging configuration in mv88e6xxx will follow
      right after. Then events for switchdev operations are next on the line.
      
      We should note that non-switchdev events do not support rolling-back
      switch-wide operations. We'll have to work on closer integration with
      switchdev for that, like introducing new attributes or objects, to
      benefit from the prepare and commit phases.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9172d2a0
    • Vivien Didelot's avatar
      net: dsa: introduce bridge notifier · 04d3a4c6
      Vivien Didelot authored
      A slave device will now notify the switch fabric once its port is
      bridged or unbridged, instead of calling directly its switch operations.
      
      This code allows propagating cross-chip bridging events in the fabric.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04d3a4c6
    • Vivien Didelot's avatar
      net: dsa: add switch notifier · f515f192
      Vivien Didelot authored
      Add a notifier block per DSA switch, registered against a notifier head
      in the switch fabric they belong to.
      
      This infrastructure will allow to propagate fabric-wide events such as
      port bridging, VLAN configuration, etc. If a DSA switch driver cares
      about cross-chip configuration, such events can be caught.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f515f192