1. 15 Jan, 2015 7 commits
    • David S. Miller's avatar
      Merge branch 'vxlan_group_policy_extension' · 2e62fa69
      David S. Miller authored
      Thomas Graf says:
      
      ====================
      VXLAN Group Policy Extension
      
      Implements supports for the Group Policy VXLAN extension [0] to provide
      a lightweight and simple security label mechanism across network peers
      based on VXLAN. The security context and associated metadata is mapped
      to/from skb->mark. This allows further mapping to a SELinux context
      using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
      tc, etc.
      
      The extension is disabled by default and should be run on a distinct
      port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
      which ignore unknown reserved bits will be able to receive VXLAN-GBP
      frames.
      
      Simple usage example:
      
      10.1.1.1:
         # ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
         # iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
      
      10.1.1.2:
         # ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
         # iptables -I INPUT -m mark --mark 0x200 -j DROP
      
      iproute2 [1] and OVS [2] support will be provided in separate patches.
      
      [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
      [1] https://github.com/tgraf/iproute2/tree/vxlan-gbp
      [2] https://github.com/tgraf/ovs/tree/vxlan-gbp
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e62fa69
    • Thomas Graf's avatar
      openvswitch: Support VXLAN Group Policy extension · 1dd144cf
      Thomas Graf authored
      Introduces support for the group policy extension to the VXLAN virtual
      port. The extension is disabled by default and only enabled if the user
      has provided the respective configuration.
      
        ovs-vsctl add-port br0 vxlan0 -- \
           set Interface vxlan0 type=vxlan options:exts=gbp
      
      The configuration interface to enable the extension is based on a new
      attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
      which can carry additional extensions as needed in the future.
      
      The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
      internally just like Geneve options but transported as nested Netlink
      attributes to user space.
      
      Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
      binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.
      
      The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
      OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dd144cf
    • Thomas Graf's avatar
      openvswitch: Allow for any level of nesting in flow attributes · 81bfe3c3
      Thomas Graf authored
      nlattr_set() is currently hardcoded to two levels of nesting. This change
      introduces struct ovs_len_tbl to define minimal length requirements plus
      next level nesting tables to traverse the key attributes to arbitrary depth.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81bfe3c3
    • Thomas Graf's avatar
      openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS() · d91641d9
      Thomas Graf authored
      Also factors out Geneve validation code into a new separate function
      validate_and_copy_geneve_opts().
      
      A subsequent patch will introduce VXLAN options. Rename the existing
      GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
      tunnel metadata options.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d91641d9
    • Thomas Graf's avatar
      vxlan: Only bind to sockets with compatible flags enabled · ac5132d1
      Thomas Graf authored
      A VXLAN net_device looking for an appropriate socket may only consider
      a socket which has a matching set of flags/extensions enabled. If
      incompatible flags are enabled, return a conflict to have the caller
      create a distinct socket with distinct port.
      
      The OVS VXLAN port is kept unaware of extensions at this point.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac5132d1
    • Thomas Graf's avatar
      vxlan: Group Policy extension · 3511494c
      Thomas Graf authored
      Implements supports for the Group Policy VXLAN extension [0] to provide
      a lightweight and simple security label mechanism across network peers
      based on VXLAN. The security context and associated metadata is mapped
      to/from skb->mark. This allows further mapping to a SELinux context
      using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
      tc, etc.
      
      The group membership is defined by the lower 16 bits of skb->mark, the
      upper 16 bits are used for flags.
      
      SELinux allows to manage label to secure local resources. However,
      distributed applications require ACLs to implemented across hosts. This
      is typically achieved by matching on L2-L4 fields to identify the
      original sending host and process on the receiver. On top of that,
      netlabel and specifically CIPSO [1] allow to map security contexts to
      universal labels.  However, netlabel and CIPSO are relatively complex.
      This patch provides a lightweight alternative for overlay network
      environments with a trusted underlay. No additional control protocol
      is required.
      
                 Host 1:                       Host 2:
      
            Group A        Group B        Group B     Group A
            +-----+   +-------------+    +-------+   +-----+
            | lxc |   | SELinux CTX |    | httpd |   | VM  |
            +--+--+   +--+----------+    +---+---+   +--+--+
      	  \---+---/                     \----+---/
      	      |                              |
      	  +---+---+                      +---+---+
      	  | vxlan |                      | vxlan |
      	  +---+---+                      +---+---+
      	      +------------------------------+
      
      Backwards compatibility:
      A VXLAN-GBP socket can receive standard VXLAN frames and will assign
      the default group 0x0000 to such frames. A Linux VXLAN socket will
      drop VXLAN-GBP  frames. The extension is therefore disabled by default
      and needs to be specifically enabled:
      
         ip link add [...] type vxlan [...] gbp
      
      In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
      must run on a separate port number.
      
      Examples:
       iptables:
        host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
        host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
      
       OVS:
        # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
        # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
      
      [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
      [1] http://lwn.net/Articles/204905/Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3511494c
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 3f3558bb
      David S. Miller authored
      Conflicts:
      	drivers/net/xen-netfront.c
      
      Minor overlapping changes in xen-netfront.c, mostly to do
      with some buffer management changes alongside the split
      of stats into TX and RX.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f3558bb
  2. 14 Jan, 2015 33 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · a6391a92
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Don't use uninitialized data in IPVS, from Dan Carpenter.
      
       2) conntrack race fixes from Pablo Neira Ayuso.
      
       3) Fix TX hangs with i40e, from Jesse Brandeburg.
      
       4) Fix budget return from poll calls in dnet and alx, from Eric
          Dumazet.
      
       5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
          Jaeger.
      
       6) Fix bug introduced by conversion to list_head in TIPC retransmit
          code, from Jon Paul Maloy.
      
       7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
          Khoroshilov.
      
       8) Fix bridge build with INET disabled, from Arnd Bergmann.
      
       9) Fix netlink array overrun for PROBE attributes in openvswitch, from
          Thomas Graf.
      
      10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
          Prashant Sreedharan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
        tg3: Release tp->lock before invoking synchronize_irq()
        tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
        tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
        team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
        openvswitch: packet messages need their own probe attribtue
        i40e: adds FCoE configure option
        cxgb4vf: Fix queue allocation for 40G adapter
        netdevice: Add missing parentheses in macro
        bridge: only provide proxy ARP when CONFIG_INET is enabled
        neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
        net: fec: fix MDIO bus assignement for dual fec SoC's
        xen-netfront: use different locks for Rx and Tx stats
        drivers: net: cpsw: fix multicast flush in dual emac mode
        cxgb4vf: Initialize mdio_addr before using it
        net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
        usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
        MAINTAINERS: add me as ibmveth maintainer
        tipc: fix bug in broadcast retransmit code
        update ip-sysctl.txt documentation (v2)
        net/at91_ether: prepare and unprepare clock
        ...
      a6391a92
    • David S. Miller's avatar
      Merge branch 'tg3-net' · c637dbce
      David S. Miller authored
      Prashant Sreedharan says:
      
      ====================
      tg3: synchronize_irq() should be called without taking locks
      
      v2: Added Reported-by, Tested-by fields and reference to the thread that
          reported the problem
      
      This series addresses the problem reported by Peter Hurley in mail thread
      https://lkml.org/lkml/2015/1/12/1082
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c637dbce
    • Prashant Sreedharan's avatar
      tg3: Release tp->lock before invoking synchronize_irq() · 932f19de
      Prashant Sreedharan authored
      synchronize_irq() can sleep waiting, for pending IRQ handlers so driver
      should release the tp->lock spin lock before invoking synchronize_irq()
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Tested-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      932f19de
    • Prashant Sreedharan's avatar
      tg3: tg3_reset_task() needs to use rtnl_lock to synchronize · db84bf43
      Prashant Sreedharan authored
      Currently tg3_reset_task() uses only tp->lock for synchronizing with code
      paths like tg3_open() etc. But since tp->lock is released before doing
      synchronize_irq(), rtnl_lock should be taken in tg3_reset_task() to
      synchronize it with other code paths.
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Tested-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db84bf43
    • Prashant Sreedharan's avatar
      tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync · 4fd190a9
      Prashant Sreedharan authored
      This is to avoid the race between tg3_timer() and the execution paths
      which does not invoke tg3_timer_stop() and releases tp->lock before
      calling synchronize_irq()
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Tested-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fd190a9
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 48c53db2
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Two bugfixes for arm64.  I will have another pull request next week,
        but otherwise things are calm"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        arm64: KVM: Fix HCR setting for 32bit guests
        arm64: KVM: Fix TLB invalidation by IPA/VMID
      48c53db2
    • Jiri Pirko's avatar
      team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin · b0d11b42
      Jiri Pirko authored
      This patch is fixing a race condition that may cause setting
      count_pending to -1, which results in unwanted big bulk of arp messages
      (in case of "notify peers").
      
      Consider following scenario:
      
      count_pending == 2
         CPU0                                           CPU1
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
       team_notify_peers
         atomic_add (adding 1 to count_pending)
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 1)
      					  schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to 0)
         schedule_delayed_work
      					team_notify_peers_work
      					  atomic_dec_and_test (dec count_pending to -1)
      
      Fix this race by using atomic_dec_if_positive - that will prevent
      count_pending running under 0.
      
      Fixes: fc423ff0 ("team: add peer notification")
      Fixes: 492b200e  ("team: add support for sending multicast rejoins")
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0d11b42
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 6fb400d3
      Linus Torvalds authored
      Pull s390 fixes from Martin Schwidefsky:
       "Two small performance tweaks, the plumbing for the execveat system
        call and a couple of bug fixes"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/uprobes: fix user space PER events
        s390/bpf: Fix JMP_JGE_X (A > X) and JMP_JGT_X (A >= X)
        s390/bpf: Fix ALU_NEG (A = -A)
        s390/mm: avoid using pmd_to_page for !USE_SPLIT_PMD_PTLOCKS
        s390/timex: fix get_tod_clock_ext() inline assembly
        s390: wire up execveat syscall
        s390/kernel: use stnsm 255 instead of stosm 0
        s390/vtime: Get rid of redundant WARN_ON
        s390/zcrypt: kernel oops at insmod of the z90crypt device driver
      6fb400d3
    • Thomas Graf's avatar
      openvswitch: packet messages need their own probe attribtue · 1ba39804
      Thomas Graf authored
      User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
      and packet messages. This leads to an out-of-bounds access in
      ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
      OVS_PACKET_ATTR_MAX.
      
      Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
      as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
      while maintaining to be binary compatible with existing OVS binaries.
      
      Fixes: 05da5898 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.")
      Reported-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Tracked-down-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Reviewed-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ba39804
    • Vasu Dev's avatar
      i40e: adds FCoE configure option · 776d4e9f
      Vasu Dev authored
      Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
      as needed but otherwise have it disabled by default.
      
      This also eliminate multiple FCoE config checks, instead now just
      one config check for CONFIG_I40E_FCOE.
      
      The I40E FCoE was added with 3.17 kernel and therefore this patch
      shall be applied to stable 3.17 kernel also.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarVasu Dev <vasu.dev@intel.com>
      Tested-by: default avatarJim Young <jamesx.m.young@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      776d4e9f
    • Hariprasad Shenai's avatar
    • Linus Torvalds's avatar
      Merge tag 'locks-v3.19-1' of git://git.samba.org/jlayton/linux · fb005c47
      Linus Torvalds authored
      Pull file locking fix from Jeff Layton:
       "Just a simple bugfix for a regression that I introduced into v3.18
        with the internal lease API overhaul -- mea culpa.  Kudos to Linda and
        Neil for tracking this down and fixing it"
      
      * tag 'locks-v3.19-1' of git://git.samba.org/jlayton/linux:
        locks: fix NULL-deref in generic_delete_lease
      fb005c47
    • zhuyj's avatar
      ipv6:icmp:remove unnecessary brackets · 9a6b4b39
      zhuyj authored
      There are too many brackets. Maybe only one bracket is enough.
      Signed-off-by: default avatarZhu Yanjun <Yanjun.Zhu@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a6b4b39
    • Benjamin Poirier's avatar
      netdevice: Add missing parentheses in macro · 4ccce02e
      Benjamin Poirier authored
      For example, one could conceivably call
      	for_each_netdev_in_bond_rcu(condition ? bond1 : bond2, slave)
      and get an unexpected result.
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ccce02e
    • Fan Du's avatar
      openvswitch: Introduce ovs_tunnel_route_lookup · 3f4c1d87
      Fan Du authored
      Introduce ovs_tunnel_route_lookup to consolidate route lookup
      shared by vxlan, gre, and geneve ports.
      Signed-off-by: default avatarFan Du <fan.du@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f4c1d87
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 31238e61
      Linus Torvalds authored
      Pull block layer fixes from Jens Axboe:
       "The major part is an update to the NVMe driver, fixing various issues
        around surprise removal and hung controllers.  Most of that is from
        Keith, and parts are simple blk-mq fixes or exports/additions of minor
        functions to aid this effort, and parts are changes directly to the
        NVMe driver.
      
        Apart from the above, this contains:
      
         - Small blk-mq change from me, killing an unused member of the
           hardware queue structure.
      
         - Small fix from Ming Lei, fixing up a few drivers that didn't
           properly check for ERR_PTR() returns from blk_mq_init_queue()"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        NVMe: Fix locking on abort handling
        NVMe: Start and stop h/w queues on reset
        NVMe: Command abort handling fixes
        NVMe: Admin queue removal handling
        NVMe: Reference count admin queue usage
        NVMe: Start all requests
        blk-mq: End unstarted requests on a dying queue
        blk-mq: Allow requests to never expire
        blk-mq: Add helper to abort requeued requests
        blk-mq: Let drivers cancel requeue_work
        blk-mq: Export if requests were started
        blk-mq: Wake tasks entering queue on dying
        blk-mq: get rid of ->cmd_size in the hardware queue
        block: fix checking return value of blk_mq_init_queue
        block: wake up waiters when a queue is marked dying
        NVMe: Fix double free irq
        blk-mq: Export freeze/unfreeze functions
        blk-mq: Exit queue on alloc failure
      31238e61
    • David S. Miller's avatar
      Merge branch 'vxlan_rco' · 27331353
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      net: Remote checksum offload for VXLAN
      
      This patch set adds support for remote checksum offload in VXLAN.
      
      The remote checksum offload is generalized by creating a common
      function (remcsum_adjust) that does the work of modifying the
      checksum in remote checksum offload. This function can be called
      from normal or GRO path. GUE was modified to use this function.
      
      To support RCO is VXLAN we use the 9th bit in the reserved
      flags to indicated remote checksum offload. The start and offset
      values are encoded n a compressed form in the low order (reserved)
      byte of the vni field.
      
      Remote checksum offload is described in
      https://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
      
      Changes in v2:
        - Add udp_offload_callbacks which has GRO functions that take a
          udp_offload pointer argument. This argument can be used to retrieve
          a per port structure of the encapsulation for use in gro processing
          (mostly by doing container_of on the structure).
        - Use the 10th bit in VXLAN flags for RCO which does not seem to
          conflict with other proposals at this time (ie. VXLAN-GPE and
          VXLAN-GPB)
        - Require that RCO must be explicitly enabled on the receiver
          as well as the sender.
      
      Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
      
      With UDP checksums and Remote Checksum Offload
        IPv4
            Client
              11.84% CPU utilization
            Server
              12.96% CPU utilization
            9197 Mbps
        IPv6
            Client
              12.46% CPU utilization
            Server
              14.48% CPU utilization
            8963 Mbps
      
      With UDP checksums, no remote checksum offload
        IPv4
            Client
              15.67% CPU utilization
            Server
              14.83% CPU utilization
            9094 Mbps
        IPv6
            Client
              16.21% CPU utilization
            Server
              14.32% CPU utilization
            9058 Mbps
      
      No UDP checksums
        IPv4
            Client
              15.03% CPU utilization
            Server
              23.09% CPU utilization
            9089 Mbps
        IPv6
            Client
              16.18% CPU utilization
            Server
              26.57% CPU utilization
             8954 Mbps
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27331353
    • Tom Herbert's avatar
      vxlan: Remote checksum offload · dfd8645e
      Tom Herbert authored
      Add support for remote checksum offload in VXLAN. This uses a
      reserved bit to indicate that RCO is being done, and uses the low order
      reserved eight bits of the VNI to hold the start and offset values in a
      compressed manner.
      
      Start is encoded in the low order seven bits of VNI. This is start >> 1
      so that the checksum start offset is 0-254 using even values only.
      Checksum offset (transport checksum field) is indicated in the high
      order bit in the low order byte of the VNI. If the bit is set, the
      checksum field is for UDP (so offset = start + 6), else checksum
      field is for TCP (so offset = start + 16). Only TCP and UDP are
      supported in this implementation.
      
      Remote checksum offload for VXLAN is described in:
      
      https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
      
      Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
      
      With UDP checksums and Remote Checksum Offload
        IPv4
            Client
              11.84% CPU utilization
            Server
              12.96% CPU utilization
            9197 Mbps
        IPv6
            Client
              12.46% CPU utilization
            Server
              14.48% CPU utilization
            8963 Mbps
      
      With UDP checksums, no remote checksum offload
        IPv4
            Client
              15.67% CPU utilization
            Server
              14.83% CPU utilization
            9094 Mbps
        IPv6
            Client
              16.21% CPU utilization
            Server
              14.32% CPU utilization
            9058 Mbps
      
      No UDP checksums
        IPv4
            Client
              15.03% CPU utilization
            Server
              23.09% CPU utilization
            9089 Mbps
        IPv6
            Client
              16.18% CPU utilization
            Server
              26.57% CPU utilization
             8954 Mbps
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dfd8645e
    • Tom Herbert's avatar
      udp: pass udp_offload struct to UDP gro callbacks · a2b12f3c
      Tom Herbert authored
      This patch introduces udp_offload_callbacks which has the same
      GRO functions (but not a GSO function) as offload_callbacks,
      except there is an argument to a udp_offload struct passed to
      gro_receive and gro_complete functions. This additional argument
      can be used to retrieve the per port structure of the encapsulation
      for use in gro processing (mostly by doing container_of on the
      structure).
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2b12f3c
    • Arnd Bergmann's avatar
      bridge: only provide proxy ARP when CONFIG_INET is enabled · d92cfdbb
      Arnd Bergmann authored
      When IPV4 support is disabled, we cannot call arp_send from
      the bridge code, which would result in a kernel link error:
      
      net/built-in.o: In function `br_handle_frame_finish':
      :(.text+0x59914): undefined reference to `arp_send'
      :(.text+0x59a50): undefined reference to `arp_tbl'
      
      This makes the newly added proxy ARP support in the bridge
      code depend on the CONFIG_INET symbol and lets the compiler
      optimize the code out to avoid the link error.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 95850116 ("bridge: Add support for IEEE 802.11 Proxy ARP")
      Cc: Kyeyoon Park <kyeyoonp@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d92cfdbb
    • hayeswang's avatar
      r8152: replace tasklet with NAPI · d823ab68
      hayeswang authored
      Replace tasklet with NAPI.
      
      Add rx_queue to queue the remaining rx packets if the number of the
      rx packets is more than the request from poll().
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d823ab68
    • David S. Miller's avatar
      Merge branch 'hip04' · 237de6ef
      David S. Miller authored
      Ding Tianhong says:
      
      ====================
      add hisilicon hip04 ethernet driver
      
      v13:
      - Fix the problem of alignment parameters for function and checkpatch warming.
      
      v12:
      - According Alex's suggestion, modify the changelog and add MODULE_DEVICE_TABLE
        for hip04 ethernet.
      
      v11:
      - Add ethtool support for tx coalecse getting and setting, the xmit_more
        is not supported for this patch, but I think it could work for hip04,
        will support it later after some tests for performance better.
      
        Here are some performance test results by ping and iperf(add tx_coalesce_frames/users),
        it looks that the performance and latency is more better by tx_coalesce_frames/usecs.
      
        - Before:
          $ ping 192.168.1.1 ...
          === 192.168.1.1 ping statistics ===
          24 packets transmitted, 24 received, 0% packet loss, time 22999ms
          rtt min/avg/max/mdev = 0.180/0.202/0.403/0.043 ms
      
          $ iperf -c 192.168.1.1 ...
          [ ID] Interval       Transfer     Bandwidth
          [  3]  0.0- 1.0 sec   115 MBytes   945 Mbits/sec
      
        - After:
          $ ping 192.168.1.1 ...
          === 192.168.1.1 ping statistics ===
          24 packets transmitted, 24 received, 0% packet loss, time 22999ms
          rtt min/avg/max/mdev = 0.178/0.190/0.380/0.041 ms
      
          $ iperf -c 192.168.1.1 ...
          [ ID] Interval       Transfer     Bandwidth
          [  3]  0.0- 1.0 sec   115 MBytes   965 Mbits/sec
      
      v10:
      - According Arnd's suggestion, remove the skb_orphan and use the hrtimer
        for the cleanup of the TX queue and add some modification for the hip04
        drivers.
        1) drop the broken skb_orphan call
        2) drop the workqueue
        3) batch cleanup based on tx_coalesce_frames/usecs for better throughput
        4) use a reasonable default tx timeout (200us, could be shorted
           based on measurements) with a range timer
        5) fix napi poll function return value
        6) use a lockless queue for cleanup
      
      v9:
      - There is no tx completion interrupts to free DMAd Tx packets, it means taht
        we rely on new tx packets arriving to run the destructors of completed packets,
        which open up space in their sockets's send queues. Sometimes we don't get such
        new packets causing Tx to stall, a single UDP transmitter is a good example of
        this situation, so we need a clean up workqueue to reclaims completed packets,
        the workqueue will only free the last packets which is already stay for several jiffies.
        Also fix some format cleanups.
      
      v8:
      - Use poll to reclaim xmitted buffer as workaround since no tx done interrupt
      
      v7:
      - Remove select NET_CORE in 0002
      
      v6:
      - Suggest by Russell: Use netdev_sent_queue & netdev_completed_queue to solve latency issue
        Also shorten the period of timer, which is used to wakeup the queue since no
        tx completed interrupt.
      
      v5:
      - no big change, fix typo
      
      v4:
      - Modify accoringly to the suggetion from Arnd, Florian, Eric, David
        Use of_parse_phandle_with_fixed_args & syscon_node_to_regmap get ppe info
        Add skb_orphan() and tx_timer for reclaim since no tx_finished interrupt
        Update timeout, and move of_phy_connect to probe to reuse open/stop
      
      v3:
      - Suggest from Arnd, use syscon & regmap_write/read to replace static void __iomem *ppebase.
        Modify hisilicon-hip04-net.txt accrordingly to suggestion from Florian and Sergei.
      
      v2:
      - Got many suggestions from Russell, Arnd, Florian, Mark and Sergei
        Remove memcpy, use dma_map/unmap_single, use dma_alloc_coherent rather than dma_pool, etc.
        Refer property in ethernet.txt, change ppe description, etc.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      237de6ef
    • dingtianhong's avatar
      net: hisilicon: new hip04 ethernet driver · a41ea46a
      dingtianhong authored
      Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
      The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
      
      v13: Fix the problem of alignment parameters for function and checkpatch warming.
      
      v12: According Alex's suggestion, modify the changelog and add MODULE_DEVICE_TABLE
           for hip04 ethernet.
      
      v11: Add ethtool support for tx coalecse getting and setting, the xmit_more
           is not supported for this patch, but I think it could work for hip04,
           will support it later after some tests for performance better.
      
           Here are some performance test results by ping and iperf(add tx_coalesce_frames/users),
           it looks that the performance and latency is more better by tx_coalesce_frames/usecs.
      
           - Before:
           $ ping 192.168.1.1 ...
           === 192.168.1.1 ping statistics ===
           24 packets transmitted, 24 received, 0% packet loss, time 22999ms
           rtt min/avg/max/mdev = 0.180/0.202/0.403/0.043 ms
      
           $ iperf -c 192.168.1.1 ...
           [ ID] Interval       Transfer     Bandwidth
           [  3]  0.0- 1.0 sec   115 MBytes   945 Mbits/sec
      
           - After:
           $ ping 192.168.1.1 ...
           === 192.168.1.1 ping statistics ===
           24 packets transmitted, 24 received, 0% packet loss, time 22999ms
           rtt min/avg/max/mdev = 0.178/0.190/0.380/0.041 ms
      
           $ iperf -c 192.168.1.1 ...
           [ ID] Interval       Transfer     Bandwidth
           [  3]  0.0- 1.0 sec   115 MBytes   965 Mbits/sec
      
      v10: According David Miller and Arnd Bergmann's suggestion, add some modification
           for v9 version
           - drop the workqueue
           - batch cleanup based on tx_coalesce_frames/usecs for better throughput
           - use a reasonable default tx timeout (200us, could be shorted
             based on measurements) with a range timer
           - fix napi poll function return value
           - use a lockless queue for cleanup
      Signed-off-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a41ea46a
    • Zhangfei Gao's avatar
      net: hisilicon: new hip04 MDIO driver · 4a841ee9
      Zhangfei Gao authored
      Hisilicon hip04 platform mdio driver
      Reuse Marvell phy drivers/net/phy/marvell.c
      Signed-off-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a841ee9
    • Zhangfei Gao's avatar
      Documentation: add Device tree bindings for Hisilicon hip04 ethernet · ef80c32d
      Zhangfei Gao authored
      This patch adds the Device Tree bindings for the Hisilicon hip04
      Ethernet controller, including 100M / 1000M controller.
      Signed-off-by: default avatarZhangfei Gao <zhangfei.gao@linaro.org>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef80c32d
    • Jean-Francois Remy's avatar
      neighbour: fix base_reachable_time(_ms) not effective immediatly when changed · 4bf6980d
      Jean-Francois Remy authored
      When setting base_reachable_time or base_reachable_time_ms on a
      specific interface through sysctl or netlink, the reachable_time
      value is not updated.
      
      This means that neighbour entries will continue to be updated using the
      old value until it is recomputed in neigh_period_work (which
          recomputes the value every 300*HZ).
      On systems with HZ equal to 1000 for instance, it means 5mins before
      the change is effective.
      
      This patch changes this behavior by recomputing reachable_time after
      each set on base_reachable_time or base_reachable_time_ms.
      The new value will become effective the next time the neighbour's timer
      is triggered.
      
      Changes are made in two places: the netlink code for set and the sysctl
      handling code. For sysctl, I use a proc_handler. The ipv6 network
      code does provide its own handler but it already refreshes
      reachable_time correctly so it's not an issue.
      Any other user of neighbour which provide its own handlers must
      refresh reachable_time.
      Signed-off-by: default avatarJean-Francois Remy <jeff@melix.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4bf6980d
    • Stefan Agner's avatar
      net: fec: fix MDIO bus assignement for dual fec SoC's · 3d125f9c
      Stefan Agner authored
      On i.MX28, the MDIO bus is shared between the two FEC instances.
      The driver makes sure that the second FEC uses the MDIO bus of the
      first FEC. This is done conditionally if FEC_QUIRK_ENET_MAC is set.
      However, in newer designs, such as Vybrid or i.MX6SX, each FEC MAC
      has its own MDIO bus. Simply removing the quirk FEC_QUIRK_ENET_MAC
      is not an option since other logic, triggered by this quirk, is
      still needed.
      
      Furthermore, there are board designs which use the same MDIO bus
      for both PHY's even though the second bus would be available on the
      SoC side. Such layout are popular since it saves pins on SoC side.
      Due to the above quirk, those boards currently do work fine. The
      boards in the mainline tree with such a layout are:
      - Freescale Vybrid Tower with TWR-SER2 (vf610-twr.dts)
      - Freescale i.MX6 SoloX SDB Board (imx6sx-sdb.dts)
      
      This patch adds a new quirk FEC_QUIRK_SINGLE_MDIO for i.MX28, which
      makes sure that the MDIO bus of the first FEC is used in any case.
      
      However, the boards above do have a SoC with a MDIO bus for each FEC
      instance. But the PHY's are not connected in a 1:1 configuration. A
      proper device tree description is needed to allow the driver to
      figure out where to find its PHY. This patch fixes that shortcoming
      by adding a MDIO bus child node to the first FEC instance, along
      with the two PHY's on that bus, and making use of the phy-handle
      property to add a reference to the PHY's.
      Acked-by: default avatarSascha Hauer <s.hauer@pengutronix.de>
      Signed-off-by: default avatarStefan Agner <stefan@agner.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d125f9c
    • Xander Huff's avatar
      net/macb: improved ethtool statistics support · 3ff13f1c
      Xander Huff authored
      Currently `ethtool -S` simply returns "no stats available". It
      would be more useful to see what the various ethtool statistics
      registers' values are. This change implements get_ethtool_stats,
      get_strings, and get_sset_count functions to accomplish this.
      
      Read all GEM statistics registers and sum them into
      macb.ethtool_stats. Add the necessary infrastructure to make this
      accessible via `ethtool -S`.
      
      Update gem_update_stats to utilize ethtool_stats.
      Signed-off-by: default avatarXander Huff <xander.huff@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ff13f1c
    • Xander Huff's avatar
      net/macb: Adding comments to various #defs to make interpretation easier · 5c2fa0f6
      Xander Huff authored
      This change is to help improve at-a-glace knowledge of the purpose of the
      various Cadence MACB/GEM registers. Comments are more helpful for human
      readability than short acronyms.
      
      Describe various #define varibles Cadence MACB/GEM registers as documented
      in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
      (UG-585)"
      Signed-off-by: default avatarXander Huff <xander.huff@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c2fa0f6
    • David S. Miller's avatar
      Merge branch 'xen-netfront-next' · 6a38cc2b
      David S. Miller authored
      David Vrabel says:
      
      ====================
      xen-netfront: refactor making Tx requests
      
      As netfront as evolved to handle different sorts of skbs the code to
      fill a Tx requests has been copy and pasted several times.  The series
      refactors this and a few other areas.
      
      The first patch is to a Xen header but this can be merged via
      net-next.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a38cc2b
    • David Vrabel's avatar
      xen-netfront: refactor making Tx requests · a55e8bb8
      David Vrabel authored
      Eliminate all the duplicate code for making Tx requests by
      consolidating them into a single xennet_make_one_txreq() function.
      
      xennet_make_one_txreq() and xennet_make_txreqs() work with pages and
      offsets so it will be easier to make netfront handle highmem frags in
      the future.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a55e8bb8
    • David Vrabel's avatar
      xen-netfront: refactor skb slot counting · e84448d5
      David Vrabel authored
      A function to count the number of slots an skb needs is more useful
      than one that counts the slots needed for only the frags.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e84448d5
    • David Vrabel's avatar
      xen: add page_to_mfn() · 28e98c2c
      David Vrabel authored
      pfn_to_mfn(page_to_pfn(p)) is a common use case so add a generic
      helper for it.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28e98c2c