1. 10 Aug, 2018 16 commits
    • Yonghong Song's avatar
      bpf: fix bpffs non-array map seq_show issue · dc1508a5
      Yonghong Song authored
      In function map_seq_next() of kernel/bpf/inode.c,
      the first key will be the "0" regardless of the map type.
      This works for array. But for hash type, if it happens
      key "0" is in the map, the bpffs map show will miss
      some items if the key "0" is not the first element of
      the first bucket.
      
      This patch fixed the issue by guaranteeing to get
      the first element, if the seq_show is just started,
      by passing NULL pointer key to map_get_next_key() callback.
      This way, no missing elements will occur for
      bpffs hash table show even if key "0" is in the map.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dc1508a5
    • Daniel Borkmann's avatar
      Merge branch 'bpf-veth-xdp-support' · 60afdf06
      Daniel Borkmann authored
      Toshiaki Makita says:
      
      ====================
      This patch set introduces driver XDP for veth.
      Basically this is used in conjunction with redirect action of another XDP
      program.
      
        NIC -----------> veth===veth
       (XDP) (redirect)        (XDP)
      
      In this case xdp_frame can be forwarded to the peer veth without
      modification, so we can expect far better performance than generic XDP.
      
      Envisioned use-cases
      --------------------
      
      * Container managed XDP program
      Container host redirects frames to containers by XDP redirect action, and
      privileged containers can deploy their own XDP programs.
      
      * XDP program cascading
      Two or more XDP programs can be called for each packet by redirecting
      xdp frames to veth.
      
      * Internal interface for an XDP bridge
      When using XDP redirection to create a virtual bridge, veth can be used
      to create an internal interface for the bridge.
      
      Implementation
      --------------
      
      This changeset is making use of NAPI to implement ndo_xdp_xmit and
      XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
      context.
       - patch 1: Export a function needed for veth XDP.
       - patch 2-3: Basic implementation of veth XDP.
       - patch 4-6: Add ndo_xdp_xmit.
       - patch 7-9: Add XDP_TX and XDP_REDIRECT.
       - patch 10: Performance optimization for multi-queue env.
      
      Tests and performance numbers
      -----------------------------
      
      Tested with a simple XDP program which only redirects packets between
      NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
      server has 20 of Xeon Silver 2.20 GHz cores.
      
        pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)
      
      The rightmost veth loads XDP progs and just does DROP or TX. The number
      of packets is measured in the XDP progs. The leftmost pktgen sends
      packets at 37.1 Mpps (almost 25G wire speed).
      
      veth XDP action    Flows    Mpps
      ================================
      DROP                   1    10.6
      DROP                   2    21.2
      DROP                 100    36.0
      TX                     1     5.0
      TX                     2    10.0
      TX                   100    31.0
      
      I also measured netperf TCP_STREAM but was not so great performance due
      to lack of tx/rx checksum offload and TSO, etc.
      
        netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)
      
      Direction         Flows   Gbps
      ==============================
      external->veth        1   20.8
      external->veth        2   23.5
      external->veth      100   23.6
      veth->external        1    9.0
      veth->external        2   17.8
      veth->external      100   22.9
      
      Also tested doing ifup/down or load/unload a XDP program repeatedly
      during processing XDP packets in order to check if enabling/disabling
      NAPI is working as expected, and found no problems.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head, headroom,
        and xdp_buff.data_hard_start.
      
      v7:
      - Introduce xdp_scrub_frame() to clear kernel pointers in xdp_frame and
        use it instead of memset().
      
      v6:
      - Check skb->len only if reallocation is needed.
      - Add __GFP_NOWARN to alloc_page() since it can be triggered by external
        events.
      - Fix sparse warning around EXPORT_SYMBOL.
      
      v5:
      - Fix broken SOBs.
      
      v4:
      - Don't adjust MTU automatically.
      - Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
        Add comments to explain that.
      - Use redirect_info instead of xdp_mem_info for storing no_direct flag
        to avoid per packet copy cost.
      
      v3:
      - Drop skb bulk xmit patch since it makes little performance
        difference. The hotspot in TCP skb xmit at this point is checksum
        computation in skb_segment and packet copy on XDP_REDIRECT due to
        cloned/nonlinear skb.
      - Fix race on closing device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squash NAPI patch with "Add driver XDP" patch.
      - Remove conversion from xdp_frame to skb when NAPI is not enabled.
      - Introduce per-queue XDP ring (patch 8).
      - Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).
      ====================
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      60afdf06
    • Toshiaki Makita's avatar
      veth: Support per queue XDP ring · 638264dc
      Toshiaki Makita authored
      Move XDP and napi related fields from veth_priv to newly created veth_rq
      structure.
      
      When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
      selected by current cpu.
      
      When skbs are enqueued from the peer device, rxq is one to one mapping
      of its peer txq. This way we have a restriction that the number of rxqs
      must not less than the number of peer txqs, but leave the possibility to
      achieve bulk skb xmit in the future because txq lock would make it
      possible to remove rxq ptr_ring lock.
      
      v3:
      - Add extack messages.
      - Fix array overrun in veth_xmit.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      638264dc
    • Toshiaki Makita's avatar
      veth: Add XDP TX and REDIRECT · d1396004
      Toshiaki Makita authored
      This allows further redirection of xdp_frames like
      
       NIC   -> veth--veth -> veth--veth
       (XDP)          (XDP)         (XDP)
      
      The intermediate XDP, redirecting packets from NIC to the other veth,
      reuses xdp_mem_info from NIC so that page recycling of the NIC works on
      the destination veth's XDP.
      In this way return_frame is not fully guarded by NAPI, since another
      NAPI handler on another cpu may use the same xdp_mem_info concurrently.
      Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
      NAPI context.
      
      v8:
      - Don't use xdp_frame pointer address for data_hard_start of xdp_buff.
      
      v4:
      - Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
        xdp_mem_info.
      
      v3:
      - Fix double free when veth_xdp_tx() returns a positive value.
      - Convert xdp_xmit and xdp_redir variables into flags.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d1396004
    • Toshiaki Makita's avatar
      xdp: Helpers for disabling napi_direct of xdp_return_frame · 2539650f
      Toshiaki Makita authored
      We need some mechanism to disable napi_direct on calling
      xdp_return_frame_rx_napi() from some context.
      When veth gets support of XDP_REDIRECT, it will redirects packets which
      are redirected from other devices. On redirection veth will reuse
      xdp_mem_info of the redirection source device to make return_frame work.
      But in this case .ndo_xdp_xmit() called from veth redirection uses
      xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
      is not called directly from the rxq which owns the xdp_mem_info.
      
      This approach introduces a flag in bpf_redirect_info to indicate that
      napi_direct should be disabled even when _rx_napi variant is used as
      well as helper functions to use it.
      
      A NAPI handler who wants to use this flag needs to call
      xdp_set_return_frame_no_direct() before processing packets, and call
      xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
      exiting NAPI.
      
      v4:
      - Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
        avoid per-frame copy cost.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2539650f
    • Toshiaki Makita's avatar
      bpf: Make redirect_info accessible from modules · 0b19cc0a
      Toshiaki Makita authored
      We are going to add kern_flags field in redirect_info for kernel
      internal use.
      In order to avoid function call to access the flags, make redirect_info
      accessible from modules. Also as it is now non-static, add prefix bpf_
      to redirect_info.
      
      v6:
      - Fix sparse warning around EXPORT_SYMBOL.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0b19cc0a
    • Toshiaki Makita's avatar
      veth: Add ndo_xdp_xmit · af87a3aa
      Toshiaki Makita authored
      This allows NIC's XDP to redirect packets to veth. The destination veth
      device enqueues redirected packets to the napi ring of its peer, then
      they are processed by XDP on its peer veth device.
      This can be thought as calling another XDP program by XDP program using
      REDIRECT, when the peer enables driver XDP.
      
      Note that when the peer veth device does not set driver xdp, redirected
      packets will be dropped because the peer is not ready for NAPI.
      
      v4:
      - Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
        Add comments about it and check only MTU.
      
      v2:
      - Drop the part converting xdp_frame into skb when XDP is not enabled.
      - Implement bulk interface of ndo_xdp_xmit.
      - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      af87a3aa
    • Toshiaki Makita's avatar
      veth: Handle xdp_frames in xdp napi ring · 9fc8d518
      Toshiaki Makita authored
      This is preparation for XDP TX and ndo_xdp_xmit.
      This allows napi handler to handle xdp_frames through xdp ring as well
      as sk_buff.
      
      v8:
      - Don't use xdp_frame pointer address to calculate skb->head and
        headroom.
      
      v7:
      - Use xdp_scrub_frame() instead of memset().
      
      v3:
      - Revert v2 change around rings and use a flag to differentiate skb and
        xdp_frame, since bulk skb xmit makes little performance difference
        for now.
      
      v2:
      - Use another ring instead of using flag to differentiate skb and
        xdp_frame. This approach makes bulk skb transmit possible in
        veth_xmit later.
      - Clear xdp_frame feilds in skb->head.
      - Implement adjust_tail.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9fc8d518
    • Toshiaki Makita's avatar
      xdp: Helper function to clear kernel pointers in xdp_frame · a8d5b4ab
      Toshiaki Makita authored
      xdp_frame has kernel pointers which should not be readable from bpf
      programs. When we want to reuse xdp_frame region but it may be read by
      bpf programs later, we can use this helper to clear kernel pointers.
      This is more efficient than calling memset() for the entire struct.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a8d5b4ab
    • Toshiaki Makita's avatar
      veth: Avoid drops by oversized packets when XDP is enabled · dc224822
      Toshiaki Makita authored
      Oversized packets including GSO packets can be dropped if XDP is
      enabled on receiver side, so don't send such packets from peer.
      
      Drop TSO and SCTP fragmentation features so that veth devices themselves
      segment packets with XDP enabled. Also cap MTU accordingly.
      
      v4:
      - Don't auto-adjust MTU but cap max MTU.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dc224822
    • Toshiaki Makita's avatar
      veth: Add driver XDP · 948d4f21
      Toshiaki Makita authored
      This is the basic implementation of veth driver XDP.
      
      Incoming packets are sent from the peer veth device in the form of skb,
      so this is generally doing the same thing as generic XDP.
      
      This itself is not so useful, but a starting point to implement other
      useful veth XDP features like TX and REDIRECT.
      
      This introduces NAPI when XDP is enabled, because XDP is now heavily
      relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
      enqueues packets to the ring and peer NAPI handler drains the ring.
      
      Currently only one ring is allocated for each veth device, so it does
      not scale on multiqueue env. This can be resolved by allocating rings
      on the per-queue basis later.
      
      Note that NAPI is not used but netif_rx is used when XDP is not loaded,
      so this does not change the default behaviour.
      
      v6:
      - Check skb->len only when allocation is needed.
      - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
        events.
      
      v3:
      - Fix race on closing the device.
      - Add extack messages in ndo_bpf.
      
      v2:
      - Squashed with the patch adding NAPI.
      - Implement adjust_tail.
      - Don't acquire consumer lock because it is guarded by NAPI.
      - Make poll_controller noop since it is unnecessary.
      - Register rxq_info on enabling XDP rather than on opening the device.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      948d4f21
    • Toshiaki Makita's avatar
      net: Export skb_headers_offset_update · b0768a86
      Toshiaki Makita authored
      This is needed for veth XDP which does skb_copy_expand()-like operation.
      
      v2:
      - Drop skb_copy_header part because it has already been exported now.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b0768a86
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sample-cpumap-lb' · c4c20217
      Daniel Borkmann authored
      Jesper Dangaard Brouer says:
      
      ====================
      Background: cpumap moves the SKB allocation out of the driver code,
      and instead allocate it on the remote CPU, and invokes the regular
      kernel network stack with the newly allocated SKB.
      
      The idea behind the XDP CPU redirect feature, is to use XDP as a
      load-balancer step in-front of regular kernel network stack.  But the
      current sample code does not provide a good example of this.  Part of
      the reason is that, I have implemented this as part of Suricata XDP
      load-balancer.
      
      Given this is the most frequent feature request I get.  This patchset
      implement the same XDP load-balancing as Suricata does, which is a
      symmetric hash based on the IP-pairs + L4-protocol.
      
      The expected setup for the use-case is to reduce the number of NIC RX
      queues via ethtool (as XDP can handle more per core), and via
      smp_affinity assign these RX queues to a set of CPUs, which will be
      handling RX packets.  The CPUs that runs the regular network stack is
      supplied to the sample xdp_redirect_cpu tool by specifying
      the --cpu option multiple times on the cmdline.
      
      I do note that cpumap SKB creation is not feature complete yet, and
      more work is coming.  E.g. given GRO is not implemented yet, do expect
      TCP workloads to be slower.  My measurements do indicate UDP workloads
      are faster.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c4c20217
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_redirect_cpu load balance like Suricata · 1bca4e6b
      Jesper Dangaard Brouer authored
      This implement XDP CPU redirection load-balancing across available
      CPUs, based on the hashing IP-pairs + L4-protocol.  This equivalent to
      xdp-cpu-redirect feature in Suricata, which is inspired by the
      Suricata 'ippair' hashing code.
      
      An important property is that the hashing is flow symmetric, meaning
      that if the source and destination gets swapped then the selected CPU
      will remain the same.  This is helps locality by placing both directions
      of a flows on the same CPU, in a forwarding/routing scenario.
      
      The hashing INITVAL (15485863 the 10^6th prime number) was fairly
      arbitrary choosen, but experiments with kernel tree pktgen scripts
      (pktgen_sample04_many_flows.sh +pktgen_sample05_flow_per_thread.sh)
      showed this improved the distribution.
      
      This patch also change the default loaded XDP program to be this
      load-balancer.  As based on different user feedback, this seems to be
      the expected behavior of the sample xdp_redirect_cpu.
      
      Link: https://github.com/OISF/suricata/commit/796ec08dd7a63Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1bca4e6b
    • Jesper Dangaard Brouer's avatar
      samples/bpf: add Paul Hsieh's (LGPL 2.1) hash function SuperFastHash · 11395686
      Jesper Dangaard Brouer authored
      Adjusted function call API to take an initval. This allow the API
      user to set the initial value, as a seed. This could also be used for
      inputting the previous hash.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      11395686
    • Björn Töpel's avatar
      Revert "xdp: add NULL pointer check in __xdp_return()" · eb91e4d4
      Björn Töpel authored
      This reverts commit 36e0f12b.
      
      The reverted commit adds a WARN to check against NULL entries in the
      mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
      driver) fast path is required to make a paired
      xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
      addition, a driver using a different allocation scheme than the
      default MEM_TYPE_PAGE_SHARED is required to additionally call
      xdp_rxq_info_reg_mem_model.
      
      For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
      that the mem_id_ht rhashtable has a properly inserted allocator id. If
      not, this would be a driver bug. A NULL pointer kernel OOPS is
      preferred to the WARN.
      Suggested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      eb91e4d4
  2. 09 Aug, 2018 24 commits