1. 30 Aug, 2018 8 commits
    • Björn Töpel's avatar
      xsk: include XDP meta data in AF_XDP frames · 18baed26
      Björn Töpel authored
      Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
      not include XDP meta data in the data buffers copied out to the user
      application.
      
      In this commit, we check if meta data is available, and if so, it is
      prepended to the frame.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      18baed26
    • Daniel Borkmann's avatar
      Merge branch 'bpf-bpffs-bpftool-dump-with-btf' · 56b48c6a
      Daniel Borkmann authored
      Yonghong Song says:
      
      ====================
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to the
      basic arraymap") and Commit 699c86d6 ("bpf: btf: add pretty print
      for hash/lru_hash maps") added bpffs pretty print for array, hash and
      lru hash maps. The pretty print gives users a structurally formatted
      dump for keys/values which much easy to understand than raw bytes.
      
      This patch set implemented bpffs pretty print support for
      percpu arraymap, percpu hashmap and percpu lru hashmap.
      For complex key/value types, the pretty print here is even more useful
      due to:
      
        . large volumne of data making it even harder to correlate bytes
          to a particular field in a particular cpu.
        . kernel rounds the value size for each cpu to multiple of 8.
          User has to be aware of this otherwise wrong value may be
          derived from cpu 1/2/...
      
      For example, we may have a bpffs pretty print like below:
         43602: {
              cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
              cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
              cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
              cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
         }
      for a percpu map.
      
      This patch also added percpu formatted print on bpftool. For example,
      bpftool may print like below:
          {
              "key": 0,
              "values": [{
                      "cpu": 0,
                      "value": {
                          "ui32": 0,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 1,
                      "value": {
                          "ui32": 1,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 2,
                      "value": {
                          "ui32": 2,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 3,
                      "value": {
                          "ui32": 3,
                          "ui16": 0,
                      }
                  }
              ]
          }
      
      Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash
      in kernel. Patch #2 added the test case in tools bpf selftest test_btf.
      Patch #3 added percpu map btf based dump.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      56b48c6a
    • Yonghong Song's avatar
      tools/bpf: bpftool: add btf percpu map formated dump · 1a86ad89
      Yonghong Song authored
      The btf pretty print is added to percpu arraymap,
      percpu hashmap and percpu lru hashmap.
      For each <key, value> pair, the following will be
      added to plain/json output:
      
         {
             "key": <pretty_print_key>,
             "values": [{
                   "cpu": 0,
                   "value": <pretty_print_value_on_cpu0>
                },{
                   "cpu": 1,
                   "value": <pretty_print_value_on_cpu1>
                },{
                ....
                },{
                   "cpu": n,
                   "value": <pretty_print_value_on_cpun>
                }
             ]
         }
      
      For example, the following could be part of plain or json formatted
      output:
          {
              "key": 0,
              "values": [{
                      "cpu": 0,
                      "value": {
                          "ui32": 0,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 1,
                      "value": {
                          "ui32": 1,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 2,
                      "value": {
                          "ui32": 2,
                          "ui16": 0,
                      }
                  },{
                      "cpu": 3,
                      "value": {
                          "ui32": 3,
                          "ui16": 0,
                      }
                  }
              ]
          }
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1a86ad89
    • Yonghong Song's avatar
      tools/bpf: add bpffs percpu map pretty print tests in test_btf · 6493ebf7
      Yonghong Song authored
      The bpf selftest test_btf is extended to test bpffs
      percpu map pretty print for percpu array, percpu hash and
      percpu lru hash.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6493ebf7
    • Yonghong Song's avatar
      bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash · c7b27c37
      Yonghong Song authored
      Added bpffs pretty print for percpu arraymap, percpu hashmap
      and percpu lru hashmap.
      
      For each map <key, value> pair, the format is:
         <key_value>: {
      	cpu0: <value_on_cpu0>
      	cpu1: <value_on_cpu1>
      	...
      	cpun: <value_on_cpun>
         }
      
      For example, on my VM, there are 4 cpus, and
      for test_btf test in the next patch:
         cat /sys/fs/bpf/pprint_test_percpu_hash
      
      You may get:
         ...
         43602: {
      	cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
         }
         72847: {
      	cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
         }
         ...
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c7b27c37
    • Alexei Starovoitov's avatar
      Merge branch 'verifier-liveness-simplification' · 234dbe3d
      Alexei Starovoitov authored
      Edward Cree says:
      
      ====================
      The first patch is a simplification of register liveness tracking by using
       a separate parentage chain for each register and stack slot, thus avoiding
       the need for logic to handle callee-saved registers when applying read
       marks.  In the future this idea may be extended to form use-def chains.
      The second patch adds information about misc/zero data on the stack to the
       state dumps emitted to the log at various points; this information was
       found essential in debugging the first patch, and may be useful elsewhere.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      234dbe3d
    • Edward Cree's avatar
      bpf/verifier: display non-spill stack slot types in print_verifier_state · 8efea21d
      Edward Cree authored
      If a stack slot does not hold a spilled register (STACK_SPILL), then each
       of its eight bytes could potentially have a different slot_type.  This
       information can be important for debugging, and previously we either did
       not print anything for the stack slot, or just printed fp-X=0 in the case
       where its first byte was STACK_ZERO.
      Instead, print eight characters with either 0 (STACK_ZERO), m (STACK_MISC)
       or ? (STACK_INVALID) for any stack slot which is neither STACK_SPILL nor
       entirely STACK_INVALID.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8efea21d
    • Edward Cree's avatar
      bpf/verifier: per-register parent pointers · 679c782d
      Edward Cree authored
      By giving each register its own liveness chain, we elide the skip_callee()
       logic.  Instead, each register's parent is the state it inherits from;
       both check_func_call() and prepare_func_exit() automatically connect
       reg states to the correct chain since when they copy the reg state across
       (r1-r5 into the callee as args, and r0 out as the return value) they also
       copy the parent pointer.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      679c782d
  2. 29 Aug, 2018 15 commits
    • Alexei Starovoitov's avatar
      Merge branch 'AF_XDP-zerocopy-for-i40e' · 29b5e0f3
      Alexei Starovoitov authored
      Björn Töpel says:
      
      ====================
      This patch set introduces zero-copy AF_XDP support for Intel's i40e
      driver. In the first preparatory patch we also add support for
      XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
      redirect them. This was a ToDo from the first AF_XDP zero-copy patch
      set from early June. Special thanks to Alex Duyck and Jesper Dangaard
      Brouer for reviewing earlier versions of this patch set.
      
      The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
      that in the interest of time, to get an AF_XDP zero-copy implementation
      out there for people to try, some code paths have been copied from the
      XDP path to the zero-copy path. It is out goal to merge the two paths
      in later patch sets.
      
      In contrast to the implementation from beginning of June, this patch
      set does not require any extra HW queues for AF_XDP zero-copy
      TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
      AF_XDP zero-copy TX.
      
      Jeff, given that most of changes are in i40e, it is up to you how you
      would like to route these patches. The set is tagged bpf-next, but
      if taking it via the Intel driver tree is easier, let us know.
      
      We have run some benchmarks on a dual socket system with two Broadwell
      E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
      cores which gives a total of 28, but only two cores are used in these
      experiments. One for TR/RX and one for the user space application. The
      memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
      8192MB and with 8 of those DIMMs in the system we have 64 GB of total
      memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
      NIC is Intel I40E 40Gbit/s using the i40e driver.
      
      Below are the results in Mpps of the I40E NIC benchmark runs for 64
      and 1500 byte packets, generated by a commercial packet generator HW
      outputing packets at full 40 Gbit/s line rate. The results are with
      retpoline and all other spectre and meltdown fixes, so these results
      are not comparable to the ones from the zero-copy patch set in June.
      
      AF_XDP performance 64 byte packets.
      Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
      rxdrop       2.6        8.2         15.0
      txpush       2.2        -           21.9
      l2fwd        1.7        2.3         11.3
      
      AF_XDP performance 1500 byte packets:
      Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
      rxdrop       2.0        3.3         3.3
      l2fwd        1.3        1.7         3.1
      
      XDP performance on our system as a base line:
      
      64 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      18.4M  0
      
      1500 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      3.3M    0
      
      The structure of the patch set is as follows:
      
      Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
      Patches 2-4: Preparatory patches to common xsk and net code
      Patches 5-7: Preparatory patches to i40e driver code for RX
      Patch 8: i40e zero-copy support for RX
      Patch 9: Preparatory patch to i40e driver code for TX
      Patch 10: i40e zero-copy support for TX
      Patch 11: Add flags to sample application to force zero-copy/copy mode
      ====================
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29b5e0f3
    • Björn Töpel's avatar
      samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock · 58c50ae4
      Björn Töpel authored
      The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
      mode.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      58c50ae4
    • Magnus Karlsson's avatar
      i40e: add AF_XDP zero-copy Tx support · 1328dcdd
      Magnus Karlsson authored
      This patch adds zero-copy Tx support for AF_XDP sockets. It implements
      the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
      NAPI context. This means pulling egress packets from the Tx ring,
      placing the frames on the NIC HW descriptor ring and completing sent
      frames back to the application via the completion ring.
      
      The regular XDP Tx ring is used for AF_XDP as well. This rationale for
      this is as follows: XDP_REDIRECT guarantees mutual exclusion between
      different NAPI contexts based on CPU id. In other words, a netdev can
      XDP_REDIRECT to another netdev with a different NAPI context, since
      the operation is bound to a specific core and each core has its own
      hardware ring.
      
      As the AF_XDP Tx action is running in the same NAPI context and using
      the same ring, it will also be protected from XDP_REDIRECT actions
      with the exact same mechanism.
      
      As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
      i40e_xsk.c.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1328dcdd
    • Magnus Karlsson's avatar
      i40e: move common Tx functions to i40e_txrx_common.h · a96e7472
      Magnus Karlsson authored
      This patch prepares for the upcoming zero-copy Tx functionality, by
      moving common functions and refactor chunks of code into re-usable
      functions, used both by the regular path and zero-copy path.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a96e7472
    • Björn Töpel's avatar
      i40e: add AF_XDP zero-copy Rx support · 0a714186
      Björn Töpel authored
      This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
      allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
      allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
      queue.
      
      All AF_XDP specific functions are added to a new file, i40e_xsk.c.
      
      Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
      will allocate a new buffer and copy the zero-copy frame prior passing
      it to the kernel stack.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0a714186
    • Björn Töpel's avatar
      i40e: move common Rx functions to i40e_txrx_common.h · 20a739db
      Björn Töpel authored
      This patch prepares for the upcoming zero-copy Rx functionality, by
      moving/changing linkage of common functions, used both by the regular
      path and zero-copy path.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      20a739db
    • Björn Töpel's avatar
      i40e: refactor Rx path for re-use · 6d7aad1d
      Björn Töpel authored
      In this commit, the Rx path is refactored some, as a step torwards the
      introduction AF_XDP Rx zero-copy.
      
      The page re-use counter is moved into the i40e_reuse_rx_page, instead
      of bumping the counter in many places. The Rx buffer page clearing is
      moved for better readability. Lastely, functions to update statistics
      and bump the XDP Tx ring are introduced.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6d7aad1d
    • Björn Töpel's avatar
      i40e: added queue pair disable/enable functions · 123cecd4
      Björn Töpel authored
      Add functions for queue pair enable/disable. Instead of resetting the
      whole device, only the affected queue pair is disabled or enabled.
      
      This plumbing is used in a later commit, when zero-copy AF_XDP support
      is introduced.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      123cecd4
    • Magnus Karlsson's avatar
      net: add napi_if_scheduled_mark_missed · 6c5c9581
      Magnus Karlsson authored
      The function napi_if_scheduled_mark_missed is used to check if the
      NAPI context is scheduled, if so set NAPIF_STATE_MISSED and return
      true. Used by the AF_XDP zero-copy i40e Tx code implementation in
      order to make sure that irq affinity is honored by the napi context.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6c5c9581
    • Björn Töpel's avatar
      xsk: expose xdp_umem_get_{data,dma} to drivers · 90254034
      Björn Töpel authored
      Move the xdp_umem_get_{data,dma} functions to include/net/xdp_sock.h,
      so that the upcoming zero-copy implementation in the Ethernet drivers
      can utilize them.
      
      Also, supply some dummy function implementations for
      CONFIG_XDP_SOCKETS=n configs.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      90254034
    • Björn Töpel's avatar
      xdp: export xdp_rxq_info_unreg_mem_model · dce5bd61
      Björn Töpel authored
      Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
      so it can be used from netdev drivers. Also, add additional checks for
      the memory type.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dce5bd61
    • Björn Töpel's avatar
      xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY · b0d1beef
      Björn Töpel authored
      This commit adds proper MEM_TYPE_ZERO_COPY support for
      convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
      xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
      MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
      make sense to implement a more sophisticated thread-safe alloc/free
      scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
      required in the fast-path.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b0d1beef
    • John Fastabend's avatar
      bpf: use --cgroup in test_suite if supplied · 7d2c6cfc
      John Fastabend authored
      If the user supplies a --cgroup value in the arguments when running
      the test_suite go ahaead and run the self tests there. I use this
      to test with multiple cgroup users.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7d2c6cfc
    • John Fastabend's avatar
      bpf: sockmap test remove shutdown() calls · b5d83fec
      John Fastabend authored
      Currently, we do a shutdown(sk, SHUT_RDWR) on both peer sockets and
      a shutdown on the sender as well. However, this is incorrect and can
      occasionally cause issues if you happen to have bad timing. First
      peer1 or peer2 may still be in use depending on the test and timing.
      Second we really should only be closing the read side and/or write
      side depending on if the test is receiving or sending.
      
      But, really none of this is needed just remove the shutdown calls.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b5d83fec
    • YueHaibing's avatar
      bpf: remove duplicated include from syscall.c · efbaec89
      YueHaibing authored
      Remove duplicated include.
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      efbaec89
  3. 28 Aug, 2018 17 commits