1. 10 Mar, 2014 5 commits
    • Eric Dumazet's avatar
      netlink: autosize skb lengthes · 9063e21f
      Eric Dumazet authored
      One known problem with netlink is the fact that NLMSG_GOODSIZE is
      really small on PAGE_SIZE==4096 architectures, and it is difficult
      to know in advance what buffer size is used by the application.
      
      This patch adds an automatic learning of the size.
      
      First netlink message will still be limited to ~4K, but if user used
      bigger buffers, then following messages will be able to use up to 16KB.
      
      This speedups dump() operations by a large factor and should be safe
      for legacy applications.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9063e21f
    • Edward Cree's avatar
      sfc: Use ether_addr_copy and eth_broadcast_addr · cd84ff4d
      Edward Cree authored
      Faster than memcpy/memset on some architectures.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd84ff4d
    • David S. Miller's avatar
      Merge branch 'gianfar-next' · 19433646
      David S. Miller authored
      Claudiu Manoil says:
      
      ====================
      gianfar: Tx timeout issue
      
      There's an older Tx timeout issue showing up on etsec2 devices
      with 2 CPUs.  I pinned this issue down to processing overhead
      incurred by supporting multiple Tx/Rx rings, as explained in
      the 2nd patch below.  But before this, there's also a concurency
      issue leading to Rx/Tx spurrious interrupts, addressed by the
      'Tx NAPI' patch below.
      The Tx timeout can be triggered with multiple Tx flows,
      'iperf -c -N 8' commands, on a 2 CPUs etsec2 based (P1020) board.
      
      Before the patches:
      """
      root@p1020rdb-pc:~# iperf -c 172.16.1.3 -n 1000M -P 8 &
      [...]
      root@p1020rdb-pc:~# NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue 1 timed out
      WARNING: at net/sched/sch_generic.c:279
      Modules linked in:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13.0-rc3-03386-g89ea59c #23
      task: ed84ef40 ti: ed868000 task.ti: ed868000
      NIP: c04627a8 LR: c04627a8 CTR: c02fb270
      REGS: ed869d00 TRAP: 0700   Not tainted  (3.13.0-rc3-03386-g89ea59c)
      MSR: 00029000 <CE,EE,ME>  CR: 44000022  XER: 20000000
      [...]
      
      root@p1020rdb-pc:~# [ ID] Interval       Transfer     Bandwidth
      [  5]  0.0-19.3 sec  1000 MBytes    434 Mbits/sec
      [  8]  0.0-39.7 sec  1000 MBytes    211 Mbits/sec
      [  9]  0.0-40.1 sec  1000 MBytes    209 Mbits/sec
      [  3]  0.0-40.2 sec  1000 MBytes    209 Mbits/sec
      [ 10]  0.0-59.0 sec  1000 MBytes    142 Mbits/sec
      [  7]  0.0-74.6 sec  1000 MBytes    112 Mbits/sec
      [  6]  0.0-74.7 sec  1000 MBytes    112 Mbits/sec
      [  4]  0.0-74.7 sec  1000 MBytes    112 Mbits/sec
      [SUM]  0.0-74.7 sec  7.81 GBytes    898 Mbits/sec
      
      root@p1020rdb-pc:~# ifconfig eth1
      eth1      Link encap:Ethernet  HWaddr 00:04:9f:00:13:01
                inet addr:172.16.1.1  Bcast:172.16.255.255  Mask:255.255.0.0
                inet6 addr: fe80::204:9fff:fe00:1301/64 Scope:Link
                UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                RX packets:708722 errors:0 dropped:0 overruns:0 frame:0
                TX packets:8717849 errors:6 dropped:0 overruns:1470 carrier:0
                collisions:0 txqueuelen:1000
                RX bytes:58118018 (55.4 MiB)  TX bytes:274069482 (261.3 MiB)
                Base address:0xa000
      
      """
      
      After applying the patches:
      """
      root@p1020rdb-pc:~# iperf -c 172.16.1.3 -n 1000M -P 8 &
      [...]
      root@p1020rdb-pc:~# [ ID] Interval       Transfer     Bandwidth
      [  9]  0.0-70.5 sec  1000 MBytes    119 Mbits/sec
      [  5]  0.0-70.5 sec  1000 MBytes    119 Mbits/sec
      [  6]  0.0-70.7 sec  1000 MBytes    119 Mbits/sec
      [  4]  0.0-71.0 sec  1000 MBytes    118 Mbits/sec
      [  8]  0.0-71.1 sec  1000 MBytes    118 Mbits/sec
      [  3]  0.0-71.2 sec  1000 MBytes    118 Mbits/sec
      [ 10]  0.0-71.3 sec  1000 MBytes    118 Mbits/sec
      [  7]  0.0-71.3 sec  1000 MBytes    118 Mbits/sec
      [SUM]  0.0-71.3 sec  7.81 GBytes    942 Mbits/sec
      
      root@p1020rdb-pc:~# ifconfig eth1
      eth1      Link encap:Ethernet  HWaddr 00:04:9f:00:13:01
                inet addr:172.16.1.1  Bcast:172.16.255.255  Mask:255.255.0.0
                inet6 addr: fe80::204:9fff:fe00:1301/64 Scope:Link
                UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                RX packets:728446 errors:0 dropped:0 overruns:0 frame:0
                TX packets:8690057 errors:0 dropped:0 overruns:0 carrier:0
                collisions:0 txqueuelen:1000
                RX bytes:59732650 (56.9 MiB)  TX bytes:271554306 (258.9 MiB)
                Base address:0xa000
      """
      v2: PATCH 2:
          Replaced CPP check with run-time condition to
          limit the number of queues. Updated comments.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19433646
    • Claudiu Manoil's avatar
      gianfar: Use Single-Queue polling for "fsl,etsec2" · 71ff9e3d
      Claudiu Manoil authored
      For the "fsl,etsec2" compatible models the driver currently
      supports 8 Tx and Rx DMA rings (aka HW queues).  However, there
      are only 2 pairs of Rx/Tx interrupt lines, as these controllers
      are integrated in low power SoCs with 2 CPUs at most.  As a result,
      there are at most 2 NAPI instances that have to service multiple
      Tx and Rx queues for these devices.  This complicates the NAPI
      polling routine having to iterate over the mutiple Rx/Tx queues
      hooked to the same interrupt lines.  And there's also an overhead
      at HW level, as the controller needs to service all the 8 Tx rings
      in a round robin manner.  The combined overhead shows up for multi
      parallel Tx flows transmitted by the kernel stack, when the driver
      usually starts returning NETDEV_TX_BUSY leading to NETDEV WATCHDOG
      Tx timeout triggering if the Tx path is congested for too long.
      
      As an alternative, this patch makes the driver support only one
      Tx/Rx DMA ring per NAPI instance (per interrupt group or pair
      of Tx/Rx interrupt lines) by default.  The simplified single queue
      polling routine (gfar_poll_sq) will be the default napi poll routine
      for the etsec2 devices too.  Some adjustments needed to be made to
      link the Tx/Rx HW queues with each NAPI instance (2 in this case).
      The gfar_poll_sq() is already successfully used by older SQ_SG_MODE
      (single interrupt group) controllers.
      This patch fixes Tx timeout triggering under heavy Tx traffic load
      (i.e. iperf -c -P 8) for the "fsl,etsec2" (currently the only
      MQ_MG_MODE devices).  There's also a significant memory footprint
      reduction by supporting 2 Rx/Tx DMA rings (at most), instead of 8,
      for these devices.
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71ff9e3d
    • Claudiu Manoil's avatar
      gianfar: Separate out the Tx interrupt handling (Tx NAPI) · aeb12c5e
      Claudiu Manoil authored
      There are some concurrency issues on devices w/ 2 CPUs related
      to the handling of Rx and Tx interrupts.  eTSEC has separate
      interrupt lines for Rx and Tx but a single imask register
      to mask these interrupts and a single NAPI instance to handle
      both Rx and Tx work.  As a result, the Rx and Tx ISRs are
      identical, both are invoking gfar_schedule_cleanup(), however
      both handlers can be entered at the same time when the Rx and
      Tx interrupts are taken by different CPUs.  In this case
      spurrious interrupts (SPU) show up (in /proc/interrupts)
      indicating a concurrency issue.  Also, Tx overruns followed
      by Tx timeout have been observed under heavy Tx traffic load.
      
      To address these issues, the schedule cleanup ISR part has
      been changed to handle the Rx and Tx interrupts independently.
      The patch adds a separate NAPI poll routine for Tx cleanup to
      be triggerred independently by the Tx confirmation interrupts
      only.  Existing poll functions are modified to handle only
      the Rx path processing.  The Tx poll routine does not need a
      budget, since Tx processing doesn't consume NAPI budget, and
      hence it is registered with minimum NAPI weight.
      NAPI scheduling does not require locking since there are
      different NAPI instances between the Rx and Tx confirmation
      paths now.
      So, the patch fixes the occurence of spurrious Rx/Tx interrupts.
      Tx overruns also occur less frequently now.
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aeb12c5e
  2. 09 Mar, 2014 3 commits
  3. 08 Mar, 2014 16 commits
  4. 07 Mar, 2014 16 commits
    • Alexander Aring's avatar
      6lowpan: reassembly: fix return of init function · 37147652
      Alexander Aring authored
      This patch adds a missing return after fragmentation init. Otherwise we
      register a sysctl interface and deregister it afterwards which makes no
      sense.
      Signed-off-by: default avatarAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37147652
    • David S. Miller's avatar
      Merge tag 'linux-can-next-for-3.15-20140307' of git://gitorious.org/linux-can/linux-can-next · d03e9d07
      David S. Miller authored
      Marc Kleine-Budde says:
      
      ====================
      pull-request: can-next 2014-02-12
      
      this is a pull request of twelve patches for net-next/master.
      
      Alexander Shiyan contributes two patches for the mcp251x, one making
      the driver more quiet and the other one improves the compile time
      coverage by removing the #ifdef CONFIG_PM_SLEEP. Then two patches for
      the flexcan driver by me, one removing the #ifdef CONFIG_PM_SLEEP, too,
      the other one making use of platform_get_device_id(). Another patch by
      me which converts the janz-ican3 driver to use netdev_<level>(). The
      remaining 7 patches are by Oliver Hartkopp, they add CAN FD support to
      the netlink configuration interface.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d03e9d07
    • David S. Miller's avatar
      Merge branch 'r8152' · a5d5ff57
      David S. Miller authored
      Hayes Wang says:
      
      ====================
      r8152: tx/rx improvement
      
       - Select the suitable spin lock for each function.
       - Add additional check to reduce the spin lock.
       - Up the priority of the tx to avoid interrupted by rx.
       - Support rx checksum, large send, and IPv6 hw checksum.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5d5ff57
    • hayeswang's avatar
      r8152: support IPv6 · 6128d1bb
      hayeswang authored
      Support hw IPv6 checksum for TCP and UDP packets.
      
      Note that the hw has the limitation of the range of the transport
      offset. Besides, the TCP Pseudo Header of the IPv6 TSO of the hw
      bases on the Microsoft document which excludes the packet length.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6128d1bb
    • hayeswang's avatar
      r8152: support TSO · 60c89071
      hayeswang authored
      Support scatter gather and TSO.
      
      Adjust the tx checksum function and set the max gso size to fix the
      size of the tx aggregation buffer.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60c89071
    • hayeswang's avatar
      r8152: support rx checksum · 565cab0a
      hayeswang authored
      Support hw rx checksum for TCP and UDP packets.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      565cab0a
    • hayeswang's avatar
      r8152: calculate the dropped packets for rx · 5e2f7485
      hayeswang authored
      Continue dealing with the remain rx packets, even though the allocation
      of the skb fail. This could calculate the correct dropped packets.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e2f7485
    • hayeswang's avatar
      r8152: up the priority of the transmission · 0c3121fc
      hayeswang authored
      move the tx_bottom() from delayed_work to tasklet. It makes the rx
      and tx balanced. If the device is in runtime suspend when getting
      the tx packet, wakeup the device before trasmitting.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c3121fc
    • hayeswang's avatar
      r8152: check tx agg list before spin lock · 21949ab7
      hayeswang authored
      Check tx agg list before spin lock to avoid doing spin lock every
      times.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21949ab7
    • hayeswang's avatar
      r8152: replace spin_lock_irqsave and spin_unlock_irqrestore · 2685d410
      hayeswang authored
      Use spin_lock and spin_unlock in interrupt context.
      
      The ndo_start_xmit would not be called in interrupt context, so
      replace the relative spin_lock_irqsave and spin_unlock_irqrestore
      with spin_lock_bh and spin_unlock_bh.
      Signed-off-by: default avatarHayes Wang <hayeswang@realtek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2685d410
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 91bd66e4
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to i40e and i40evf.
      
      Most notable are:
      Joseph completes the implementation of the ethtool ntuple rule
      management interface by adding the get, update and delete interface
      reset.
      
      Akeem provides a fix to prevent a possible overflow due to multiplication
      of number and size by using kzalloc, so use kcalloc.
      
      Jesse provides an implementation for skb_set_hash() and adds the L4 type
      return when we know it is an L4 hash.  He also adds a counter to
      statistics for Tx timeouts to help users.  Lastly he provides a change
      to stay away from the cache line where the done bit may be getting
      written back for the transmit ring since the hardware may be writing the
      whole cache line for a partial update.
      
      Shannon cleans up code comments.
      
      Anjali removes a firmware workaround for newer firmware since the number
      of MSIx vectors are being reported correctly.
      
      v2:
       -  dropped patch 01 of the series based on feedback from the author
          Joe Perches and Shannon Nelson.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91bd66e4
    • David S. Miller's avatar
      Merge tag 'rxrpc-devel-20140304' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 38940042
      David S. Miller authored
      David Howells says:
      
      ====================
      net-next: AF_RXRPC fixes and development
      
      Here are some AF_RXRPC fixes:
      
       (1) Fix to remove incorrect checksum calculation made during recvmsg().  It's
           unnecessary to try to do this there since we check the checksum before
           reading the RxRPC header from the packet.
      
       (2) Fix to prevent the sending of an ABORT packet in response to another
           ABORT packet and inducing a storm.
      
       (3) Fix UDP MTU calculation from parsing ICMP_FRAG_NEEDED packets where we
           don't handle the ICMP packet not specifying an MTU size.
      
      And development patches:
      
       (4) Add sysctls for configuring RxRPC parameters, specifically various delays
           pertaining to ACK generation, the time before we resend a packet for
           which we don't receive an ACK, the maximum time a call is permitted to
           live and the amount of time transport, connection and dead call
           information is cached.
      
       (5) Improve ACK packet production by adjusting the handling of ACK_REQUESTED
           packets, ignoring the MORE_PACKETS flag, delaying the production of
           otherwise immediate ACK_IDLE packets and delaying all ACK_IDLE production
           (barring the call termination) to half a second.
      
       (6) Add more sysctl parameters to expose the Rx window size, the maximum
           packet size that we're willing to receive and the number of jumbo rxrpc
           packets we're willing to handle in a single UDP packet.
      
       (7) Request ACKs on alternate DATA packets so that the other side doesn't
           wait till we fill up the Tx window.
      
       (8) Use a RCU hash table to look up the rxrpc_call for an incoming packet
           rather than stepping through a hierarchy involving several spinlocks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38940042
    • David S. Miller's avatar
      Merge branch 'xen-netback-next' · 4caeccb4
      David S. Miller authored
      Zoltan Kiss says:
      
      ====================
      xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
      
      A long known problem of the upstream netback implementation that on the TX
      path (from guest to Dom0) it copies the whole packet from guest memory into
      Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
      huge perfomance penalty. The classic kernel version of netback used grant
      mapping, and to get notified when the page can be unmapped, it used page
      destructors. Unfortunately that destructor is not an upstreamable solution.
      Ian Campbell's skb fragment destructor patch series [1] tried to solve this
      problem, however it seems to be very invasive on the network stack's code,
      and therefore haven't progressed very well.
      This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
      know when the skb is freed up. That is the way KVM solved the same problem,
      and based on my initial tests it can do the same for us. Avoiding the extra
      copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
      Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
      running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
      switch)
      Based on my investigations the packet get only copied if it is delivered to
      Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
      DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
      unfortunate, but luckily it doesn't cause a major regression for this usecase.
      In the future we should try to eliminate that copy somehow.
      There are a few spinoff tasks which will be addressed in separate patches:
      - grant copy the header directly instead of map and memcpy. This should help
        us avoiding TLB flushing
      - use something else than ballooned pages
      - fix grant map to use page->index properly
      I've tried to broke it down to smaller patches, with mixed results, so I
      welcome suggestions on that part as well:
      1: Use skb->cb to store pending_idx
      2: Some refactoring
      3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
      review it after #4)
      4: Introduce TX grant mapping
      5: Remove old TX grant copy definitons and fix indentations
      6: Add stat counters for zerocopy
      7: Handle guests with too many frags
      8: Timeout packets in RX path
      9: Aggregate TX unmap operations
      
      v2: I've fixed some smaller things, see the individual patches. I've added a
      few new stat counters, and handling the important use case when an older guest
      sends lots of slots. Instead of delayed copy now we timeout packets on the RX
      path, based on the assumption that otherwise packets should get stucked
      anywhere else. Finally some unmap batching to avoid too much TLB flush
      
      v3: Apart from fixing a few things mentioned in responses the important change
      is the use the hypercall directly for grant [un]mapping, therefore we can
      avoid m2p override.
      
      v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
      timeout logic changed also.
      
      v5: Only minor fixes based on Wei's comments
      
      v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
      first 2 patches. Also rework of handling packets with too many slots, and
      reorder the series a bit.
      
      v7: Small fixes in comments/log messages/error paths, and merging the frag
      overflow stats patch into its parent.
      
      [1] http://lwn.net/Articles/491522/
      [2] https://lkml.org/lkml/2012/7/20/363
      ====================
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4caeccb4
    • Zoltan Kiss's avatar
      xen-netback: Aggregate TX unmap operations · e9275f5e
      Zoltan Kiss authored
      Unmapping causes TLB flushing, therefore we should make it in the largest
      possible batches. However we shouldn't starve the guest for too long. So if
      the guest has space for at least two big packets and we don't have at least a
      quarter ring to unmap, delay it for at most 1 milisec.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9275f5e
    • Zoltan Kiss's avatar
      xen-netback: Timeout packets in RX path · 09350788
      Zoltan Kiss authored
      A malicious or buggy guest can leave its queue filled indefinitely, in which
      case qdisc start to queue packets for that VIF. If those packets came from an
      another guest, it can block its slots and prevent shutdown. To avoid that, we
      make sure the queue is drained in every 10 seconds.
      The QDisc queue in worst case takes 3 round to flush usually.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09350788
    • Zoltan Kiss's avatar
      xen-netback: Handle guests with too many frags · e3377f36
      Zoltan Kiss authored
      Xen network protocol had implicit dependency on MAX_SKB_FRAGS. Netback has to
      handle guests sending up to XEN_NETBK_LEGACY_SLOTS_MAX slots. To achieve that:
      - create a new skb
      - map the leftover slots to its frags (no linear buffer here!)
      - chain it to the previous through skb_shinfo(skb)->frag_list
      - map them
      - copy and coalesce the frags into a brand new one and send it to the stack
      - unmap the 2 old skb's pages
      
      It's also introduces new stat counters, which help determine how often the guest
      sends a packet with more than MAX_SKB_FRAGS frags.
      
      NOTE: if bisect brought you here, you should apply the series up until
      "xen-netback: Timeout packets in RX path", otherwise malicious guests can block
      other guests by not releasing their sent packets.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3377f36