1. 07 Mar, 2014 12 commits
    • David S. Miller's avatar
      Merge branch 'xen-netback-next' · 4caeccb4
      David S. Miller authored
      Zoltan Kiss says:
      
      ====================
      xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
      
      A long known problem of the upstream netback implementation that on the TX
      path (from guest to Dom0) it copies the whole packet from guest memory into
      Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
      huge perfomance penalty. The classic kernel version of netback used grant
      mapping, and to get notified when the page can be unmapped, it used page
      destructors. Unfortunately that destructor is not an upstreamable solution.
      Ian Campbell's skb fragment destructor patch series [1] tried to solve this
      problem, however it seems to be very invasive on the network stack's code,
      and therefore haven't progressed very well.
      This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
      know when the skb is freed up. That is the way KVM solved the same problem,
      and based on my initial tests it can do the same for us. Avoiding the extra
      copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
      Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
      running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
      switch)
      Based on my investigations the packet get only copied if it is delivered to
      Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
      DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
      unfortunate, but luckily it doesn't cause a major regression for this usecase.
      In the future we should try to eliminate that copy somehow.
      There are a few spinoff tasks which will be addressed in separate patches:
      - grant copy the header directly instead of map and memcpy. This should help
        us avoiding TLB flushing
      - use something else than ballooned pages
      - fix grant map to use page->index properly
      I've tried to broke it down to smaller patches, with mixed results, so I
      welcome suggestions on that part as well:
      1: Use skb->cb to store pending_idx
      2: Some refactoring
      3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
      review it after #4)
      4: Introduce TX grant mapping
      5: Remove old TX grant copy definitons and fix indentations
      6: Add stat counters for zerocopy
      7: Handle guests with too many frags
      8: Timeout packets in RX path
      9: Aggregate TX unmap operations
      
      v2: I've fixed some smaller things, see the individual patches. I've added a
      few new stat counters, and handling the important use case when an older guest
      sends lots of slots. Instead of delayed copy now we timeout packets on the RX
      path, based on the assumption that otherwise packets should get stucked
      anywhere else. Finally some unmap batching to avoid too much TLB flush
      
      v3: Apart from fixing a few things mentioned in responses the important change
      is the use the hypercall directly for grant [un]mapping, therefore we can
      avoid m2p override.
      
      v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
      timeout logic changed also.
      
      v5: Only minor fixes based on Wei's comments
      
      v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
      first 2 patches. Also rework of handling packets with too many slots, and
      reorder the series a bit.
      
      v7: Small fixes in comments/log messages/error paths, and merging the frag
      overflow stats patch into its parent.
      
      [1] http://lwn.net/Articles/491522/
      [2] https://lkml.org/lkml/2012/7/20/363
      ====================
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4caeccb4
    • Zoltan Kiss's avatar
      xen-netback: Aggregate TX unmap operations · e9275f5e
      Zoltan Kiss authored
      Unmapping causes TLB flushing, therefore we should make it in the largest
      possible batches. However we shouldn't starve the guest for too long. So if
      the guest has space for at least two big packets and we don't have at least a
      quarter ring to unmap, delay it for at most 1 milisec.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9275f5e
    • Zoltan Kiss's avatar
      xen-netback: Timeout packets in RX path · 09350788
      Zoltan Kiss authored
      A malicious or buggy guest can leave its queue filled indefinitely, in which
      case qdisc start to queue packets for that VIF. If those packets came from an
      another guest, it can block its slots and prevent shutdown. To avoid that, we
      make sure the queue is drained in every 10 seconds.
      The QDisc queue in worst case takes 3 round to flush usually.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09350788
    • Zoltan Kiss's avatar
      xen-netback: Handle guests with too many frags · e3377f36
      Zoltan Kiss authored
      Xen network protocol had implicit dependency on MAX_SKB_FRAGS. Netback has to
      handle guests sending up to XEN_NETBK_LEGACY_SLOTS_MAX slots. To achieve that:
      - create a new skb
      - map the leftover slots to its frags (no linear buffer here!)
      - chain it to the previous through skb_shinfo(skb)->frag_list
      - map them
      - copy and coalesce the frags into a brand new one and send it to the stack
      - unmap the 2 old skb's pages
      
      It's also introduces new stat counters, which help determine how often the guest
      sends a packet with more than MAX_SKB_FRAGS frags.
      
      NOTE: if bisect brought you here, you should apply the series up until
      "xen-netback: Timeout packets in RX path", otherwise malicious guests can block
      other guests by not releasing their sent packets.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3377f36
    • Zoltan Kiss's avatar
      xen-netback: Add stat counters for zerocopy · 1bb332af
      Zoltan Kiss authored
      These counters help determine how often the buffers had to be copied. Also
      they help find out if packets are leaked, as if "sent != success + fail",
      there are probably packets never freed up properly.
      
      NOTE: if bisect brought you here, you should apply the series up until
      "xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
      properly and malicious guests can block other guests by not releasing their sent
      packets.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bb332af
    • Zoltan Kiss's avatar
      xen-netback: Remove old TX grant copy definitons and fix indentations · 62bad319
      Zoltan Kiss authored
      These became obsolete with grant mapping. I've left intentionally the
      indentations in this way, to improve readability of previous patches.
      
      NOTE: if bisect brought you here, you should apply the series up until
      "xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
      properly and malicious guests can block other guests by not releasing their sent
      packets.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62bad319
    • Zoltan Kiss's avatar
      xen-netback: Introduce TX grant mapping · f53c3fe8
      Zoltan Kiss authored
      This patch introduces grant mapping on netback TX path. It replaces grant copy
      operations, ditching grant copy coalescing along the way. Another solution for
      copy coalescing is introduced in "xen-netback: Handle guests with too many
      frags", older guests and Windows can broke before that patch applies.
      There is a callback (xenvif_zerocopy_callback) from core stack to release the
      slots back to the guests when kfree_skb or skb_orphan_frags called. It feeds a
      separate dealloc thread, as scheduling NAPI instance from there is inefficient,
      therefore we can't do dealloc from the instance.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f53c3fe8
    • Zoltan Kiss's avatar
      xen-netback: Handle foreign mapped pages on the guest RX path · 3e2234b3
      Zoltan Kiss authored
      RX path need to know if the SKB fragments are stored on pages from another
      domain.
      Logically this patch should be after introducing the grant mapping itself, as
      it makes sense only after that. But to keep bisectability, I moved it here. It
      shouldn't change any functionality here. xenvif_zerocopy_callback and
      ubuf_to_vif are just stubs here, they will be introduced properly later on.
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e2234b3
    • Zoltan Kiss's avatar
      xen-netback: Minor refactoring of netback code · 121fa4b7
      Zoltan Kiss authored
      This patch contains a few bits of refactoring before introducing the grant
      mapping changes:
      - introducing xenvif_tx_pending_slots_available(), as this is used several
        times, and will be used more often
      - rename the thread to vifX.Y-guest-rx, to signify it does RX work from the
        guest point of view
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      121fa4b7
    • Zoltan Kiss's avatar
      xen-netback: Use skb->cb for pending_idx · 8f13dd96
      Zoltan Kiss authored
      Storing the pending_idx at the first byte of the linear buffer never looked
      good, skb->cb is a more proper place for this. It also prevents the header to
      be directly grant copied there, and we don't have the pending_idx after we
      copied the header here, so it's time to change it.
      It also introduces helpers for the RX side
      Signed-off-by: default avatarZoltan Kiss <zoltan.kiss@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f13dd96
    • Eric Dumazet's avatar
      l2tp: keep original skb ownership · 31c70d59
      Eric Dumazet authored
      There is no reason to orphan skb in l2tp.
      
      This breaks things like per socket memory limits, TCP Small queues...
      
      Fix this before more people copy/paste it.
      
      This is very similar to commit 8f646c92
      ("vxlan: keep original skb ownership")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31c70d59
    • Eric Dumazet's avatar
      tcp: do not leak non zero tstamp in output packets · 21962692
      Eric Dumazet authored
      Usage of skb->tstamp should remain private to TCP stack
      (only set on packets on write queue, not on cloned ones)
      
      Otherwise, packets given to loopback interface with a non null tstamp
      can confuse netif_rx() / net_timestamp_check()
      
      Other possibility would be to clear tstamp in loopback_xmit(),
      as done in skb_scrub_packet()
      
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21962692
  2. 06 Mar, 2014 17 commits
  3. 05 Mar, 2014 2 commits
  4. 04 Mar, 2014 9 commits
    • Sathya Perla's avatar
      be2net: dma_sync each RX frag before passing it to the stack · e50287be
      Sathya Perla authored
      The driver currently maps a page for DMA, divides the page into multiple
      frags and posts them to the HW. It un-maps the page after data is received
      on all the frags of the page. This scheme doesn't work when bounce buffers
      are used for DMA (swiotlb=force kernel param).
      
      This patch fixes this problem by calling dma_sync_single_for_cpu() for each
      frag (excepting the last one) so that the data is copied from the bounce
      buffers. The page is un-mapped only when DMA finishes on the last frag of
      the page.
      (Thanks Ben H. for suggesting the dma_sync API!)
      
      This patch also renames the "last_page_user" field of be_rx_page_info{}
      struct to "last_frag" to improve readability of the fixed code.
      Reported-by: default avatarLi Fengmao <li.fengmao@zte.com.cn>
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e50287be
    • David S. Miller's avatar
      Merge branch 'mpls_tc' · 9e82e7f4
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      this series contains a header file proposal for MPLS labels. These
      labels do not seem to be properly defined in the kernel so far. We are
      developing a wired/wireless 802.21/MPLS switch and need to check the
      MPLS labels to use the traffic control info for transmissions over
      802.11 networks.
      
      Changes to third version:
      
       * rename mpls_label_stack to mpls_label (thanks Neil)
       * fix over-indendented closing brac (thanks Sergei)
       * add Johannes' Ack
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e82e7f4
    • Simon Wunderlich's avatar
      cfg80211: add MPLS and 802.21 classification · 960d97f9
      Simon Wunderlich authored
      MPLS labels may contain traffic control information, which should be
      evaluated and used by the wireless subsystem if present.
      
      Also check for IEEE 802.21 which is always network control traffic.
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
      Acked-by: default avatarJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      960d97f9
    • Simon Wunderlich's avatar
      UAPI: add MPLS label stack definition · f3baa393
      Simon Wunderlich authored
      Labels for the Multiprotocol Label Switching are defined in RFC 3032
      which was superseded by RFC 5462. Add the definition to UAPI and a stub
      header for include/linux.
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3baa393
    • Simon Wunderlich's avatar
      if_ether.h: add IEEE 802.21 Ethertype · b62faf3c
      Simon Wunderlich authored
      Add the Ethertype for IEEE Std 802.21 - Media Independent Handover
      Protocol. This Ethertype is used for network control messages.
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      Signed-off-by: default avatarMathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b62faf3c
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c3bebc71
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix memory leak in ieee80211_prep_connection(), sta_info leaked on
          error.  From Eytan Lifshitz.
      
       2) Unintentional switch case fallthrough in nft_reject_inet_eval(),
          from Patrick McHardy.
      
       3) Must check if payload lenth is a power of 2 in
          nft_payload_select_ops(), from Nikolay Aleksandrov.
      
       4) Fix mis-checksumming in xen-netfront driver, ip_hdr() is not in the
          correct place when we invoke skb_checksum_setup().  From Wei Liu.
      
       5) TUN driver should not advertise HW vlan offload features in
          vlan_features.  Fix from Fernando Luis Vazquez Cao.
      
       6) IPV6_VTI needs to select NET_IPV_TUNNEL to avoid build errors, fix
          from Steffen Klassert.
      
       7) Add missing locking in xfrm_migrade_state_find(), we must hold the
          per-namespace xfrm_state_lock while traversing the lists.  Fix from
          Steffen Klassert.
      
       8) Missing locking in ath9k driver, access to tid->sched must be done
          under ath_txq_lock().  Fix from Stanislaw Gruszka.
      
       9) Fix two bugs in TCP fastopen.  First respect the size argument given
          to tcp_sendmsg() in the fastopen path, and secondly prevent
          tcp_send_syn_data() from potentially using order-5 allocations.
          From Eric Dumazet.
      
      10) Fix handling of default neigh garbage collection params, from Jiri
          Pirko.
      
      11) Fix cwnd bloat and over-inflation of RTT when transmit segmentation
          is in use.  From Eric Dumazet.
      
      12) Missing initialization of Realtek r8169 driver's statistics
          seqlocks.  Fix from Kyle McMartin.
      
      13) Fix RTNL assertion failures in 802.3ad and AB ARP monitor of bonding
          driver, from Ding Tianhong.
      
      14) Bonding slave release race can cause divide by zero, fix from
          Nikolay Aleksandrov.
      
      15) Overzealous return from neigh_periodic_work() causes reachability
          time to not be computed.  Fix from Duain Jiong.
      
      16) Fix regression in ipv6_find_hdr(), it should not return -ENOENT when
          a specific target is specified and found.  From Hans Schillstrom.
      
      17) Fix VLAN tag stripping regression in BNA driver, from Ivan Vecera.
      
      18) Tail loss probe can calculate bogus RTTs due to missing packet
          marking on retransmit.  Fix from Yuchung Cheng.
      
      19) We cannot do skb_dst_drop() in iptunnel_pull_header() because
          multicast loopback detection in later code paths need access to
          skb_rtable().  Fix from Xin Long.
      
      20) The macvlan driver regresses in that it propagates lower device
          offload support disables into itself, causing severe slowdowns when
          running over a bridge.  Provide the software offloads always on
          macvlan devices to deal with this and the regression is gone.  From
          Vlad Yasevich.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
        macvlan: Add support for 'always_on' offload features
        net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable
        ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer
        net: cpsw: fix cpdma rx descriptor leak on down interface
        be2net: isolate TX workarounds not applicable to Skyhawk-R
        be2net: Fix skb double free in be_xmit_wrokarounds() failure path
        be2net: clear promiscuous bits in adapter->flags while disabling promiscuous mode
        be2net: Fix to reset transparent vlan tagging
        qlcnic: dcb: a couple off by one bugs
        tcp: fix bogus RTT on special retransmission
        hsr: off by one sanity check in hsr_register_frame_in()
        can: remove CAN FD compatibility for CAN 2.0 sockets
        can: flexcan: factor out soft reset into seperate funtion
        can: flexcan: flexcan_remove(): add missing netif_napi_del()
        can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze
        can: flexcan: factor out transceiver {en,dis}able into seperate functions
        can: flexcan: fix transition from and to low power mode in chip_{en,dis}able
        can: flexcan: flexcan_open(): fix error path if flexcan_chip_start() fails
        can: flexcan: fix shutdown: first disable chip, then all interrupts
        USB AX88179/178A: Support D-Link DUB-1312
        ...
      c3bebc71
    • Linus Torvalds's avatar
      Merge tag 'regulator-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator · 16e3f539
      Linus Torvalds authored
      Pull regulator fixes from Mark Brown:
       "A couple of fixes here which ensure that regulators using the core
        support for GPIO enables work in all cases by ensuring that helpers
        are used consistently rather than open coding in places and hence not
        having GPIO support in some of them"
      
      * tag 'regulator-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
        regulator: core: Replace direct ops->disable usage
        regulator: core: Replace direct ops->enable usage
      16e3f539
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew Morton) · 3f803abf
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton akpm@linux-foundation.org>:
        mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness
        mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
        MAINTAINERS: add and correct types of some "T:" entries
        MAINTAINERS: use tab for separator
        rapidio/tsi721: fix tasklet termination in dma channel release
        hfsplus: fix remount issue
        zram: avoid null access when fail to alloc meta
        sh: prefix sh-specific "CCR" and "CCR2" by "SH_"
        ocfs2: fix quota file corruption
        drivers/rtc/rtc-s3c.c: fix incorrect way of save/restore of S3C2410_TICNT for TYPE_S3C64XX
        kallsyms: fix absolute addresses for kASLR
        scripts/gen_initramfs_list.sh: fix flags for initramfs LZ4 compression
        mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking
        memcg: reparent charges of children before processing parent
        memcg: fix endless loop in __mem_cgroup_iter_next()
        lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlock
        dma debug: account for cachelines and read-only mappings in overlap tracking
        mm: close PageTail race
        MAINTAINERS: EDAC: add Mauro and Borislav as interim patch collectors
      3f803abf
    • Johannes Weiner's avatar
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · 27329369
      Johannes Weiner authored
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27329369