• William Tu's avatar
    vmxnet3: Add XDP support. · 54f00cce
    William Tu authored
    The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
    
    Background:
    The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
    For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
    mapped to the ring's descriptor. If LRO is enabled and packet size larger
    than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
    the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
    allocated using alloc_page. So for LRO packets, the payload will be in one
    buffer from r0 and multiple from r1, for non-LRO packets, only one
    descriptor in r0 is used for packet size less than 3k.
    
    When receiving a packet, the first descriptor will have the sop (start of
    packet) bit set, and the last descriptor will have the eop (end of packet)
    bit set. Non-LRO packets will have only one descriptor with both sop and
    eop set.
    
    Other than r0 and r1, vmxnet3 dataring is specifically designed for
    handling packets with small size, usually 128 bytes, defined in
    VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
    driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
    order to avoid memory mapping/unmapping overhead. In summary, packet size:
        A. < 128B: use dataring
        B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
        C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
    As a result, the patch adds XDP support for packets using dataring
    and r0 (case A and B), not the large packet size when LRO is enabled.
    
    XDP Implementation:
    When user loads and XDP prog, vmxnet3 driver checks configurations, such
    as mtu, lro, and re-allocate the rx buffer size for reserving the extra
    headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
    associated with every rx queue of the device. Note that when using dataring
    for small packet size, vmxnet3 (front-end driver) doesn't control the
    buffer allocation, as a result we allocate a new page and copy packet
    from the dataring to XDP frame.
    
    The receive side of XDP is implemented for case A and B, by invoking the
    bpf program at vmxnet3_rq_rx_complete and handle its returned action.
    The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
    the ring0 and dataring case separately, and decides the next journey of
    the packet afterward.
    
    For TX, vmxnet3 has split header design. Outgoing packets are parsed
    first and protocol headers (L2/L3/L4) are copied to the backend. The
    rest of the payload are dma mapped. Since XDP_TX does not parse the
    packet protocol, the entire XDP frame is dma mapped for transmission
    and transmitted in a batch. Later on, the frame is freed and recycled
    back to the memory pool.
    
    Performance:
    Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
    core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
    to vmxnet3 device, sending 64B or 512B UDP packet.
    
    VM1 txgen:
    $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
    --forward-mode=txonly --eth-peer=0,<mac addr of vm2>
    option: add "--txonly-multi-flow"
    option: use --txpkts=512 or 64 byte
    
    VM2 running XDP:
    $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
    $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
    options: XDP_DROP, XDP_PASS, XDP_TX
    
    To test REDIRECT to cpu 0, use
    $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
    
    Single core performance comparison with skb-mode.
    64B:      skb-mode -> native-mode
    XDP_DROP: 1.6Mpps -> 2.4Mpps
    XDP_PASS: 338Kpps -> 367Kpps
    XDP_TX:   1.1Mpps -> 2.3Mpps
    REDIRECT-drop: 1.3Mpps -> 2.3Mpps
    
    512B:     skb-mode -> native-mode
    XDP_DROP: 863Kpps -> 1.3Mpps
    XDP_PASS: 275Kpps -> 376Kpps
    XDP_TX:   554Kpps -> 1.2Mpps
    REDIRECT-drop: 659Kpps -> 1.2Mpps
    
    Demo: https://youtu.be/4lm1CSCi78Q
    
    Future work:
    - XDP frag support
    - use napi_consume_skb() instead of dev_kfree_skb_any at unmap
    - stats using u64_stats_t
    - using bitfield macro BIT()
    - optimization for DMA synchronization using actual frame length,
      instead of always max_len
    Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
    Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
    Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    54f00cce
Makefile 1.26 KB