1. 11 Sep, 2018 4 commits
    • Yonghong Song's avatar
      tools/bpf: fix a netlink recv issue · 9d0b3c1f
      Yonghong Song authored
      Commit f7010770 ("tools/bpf: move bpf/lib netlink related
      functions into a new file") introduced a while loop for the
      netlink recv path. This while loop is needed since the
      buffer in recv syscall may not be enough to hold all the
      information and in such cases multiple recv calls are needed.
      
      There is a bug introduced by the above commit as
      the while loop may block on recv syscall if there is no
      more messages are expected. The netlink message header
      flag NLM_F_MULTI is used to indicate that more messages
      are expected and this patch fixed the bug by doing
      further recv syscall only if multipart message is expected.
      
      The patch added another fix regarding to message length of 0.
      When netlink recv returns message length of 0, there will be
      no more messages for returning data so the while loop
      can end.
      
      Fixes: f7010770 ("tools/bpf: move bpf/lib netlink related functions into a new file")
      Reported-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Tested-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9d0b3c1f
    • Alexei Starovoitov's avatar
      Merge branch 'progarray_mapinmap_dump' · 2e2a0c96
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      The support to dump program array and map_in_map maps
      for bpffs and bpftool is added. Patch #1 added bpffs support
      and Patch #2 added bpftool support. Please see
      individual patches for example output.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2e2a0c96
    • Yonghong Song's avatar
      tools/bpf: bpftool: support prog array map and map of maps · ad3338d2
      Yonghong Song authored
      Currently, prog array map and map of maps are not supported
      in bpftool. This patch added the support.
      Different from other map types, for prog array map and
      map of maps, the key returned bpf_get_next_key() may not
      point to a valid value. So for these two map types,
      no error will be printed out when such a scenario happens.
      
      The following is the plain and json dump if btf is not available:
        $ ./bpftool map dump id 10
          key: 08 00 00 00  value: 5c 01 00 00
          Found 1 element
        $ ./bpftool -jp map dump id 10
          [{
              "key": ["0x08","0x00","0x00","0x00"
              ],
              "value": ["0x5c","0x01","0x00","0x00"
              ]
          }]
      
      If the BTF is available, the dump looks below:
        $ ./bpftool map dump id 2
          [{
                  "key": 0,
                  "value": 7
              }
          ]
        $ ./bpftool -jp map dump id 2
          [{
              "key": ["0x00","0x00","0x00","0x00"
              ],
              "value": ["0x07","0x00","0x00","0x00"
              ],
              "formatted": {
                  "key": 0,
                  "value": 7
              }
          }]
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ad3338d2
    • Yonghong Song's avatar
      bpf: add bpffs pretty print for program array map · a7c19db3
      Yonghong Song authored
      Added bpffs pretty print for program array map. For a particular
      array index, if the program array points to a valid program,
      the "<index>: <prog_id>" will be printed out like
         0: 6
      which means bpf program with id "6" is installed at index "0".
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a7c19db3
  2. 07 Sep, 2018 9 commits
    • Yonghong Song's avatar
      tools/bpf: bpftool: add net support · f6f3bac0
      Yonghong Song authored
      Add "bpftool net" support. Networking devices are enumerated
      to dump device index/name associated with xdp progs.
      
      For each networking device, tc classes and qdiscs are enumerated
      in order to check their bpf filters.
      In addition, root handle and clsact ingress/egress are also checked for
      bpf filters.  Not all filter information is printed out. Only ifindex,
      kind, filter name, prog_id and tag are printed out, which are good
      enough to show attachment information. If the filter action
      is a bpf action, its bpf program id, bpf name and tag will be
      printed out as well.
      
      For example,
        $ ./bpftool net
        xdp [
        ifindex 2 devname eth0 prog_id 198
        ]
        tc_filters [
        ifindex 2 kind qdisc_htb name prefix_matcher.o:[cls_prefix_matcher_htb]
                  prog_id 111727 tag d08fe3b4319bc2fd act []
        ifindex 2 kind qdisc_clsact_ingress name fbflow_icmp
                  prog_id 130246 tag 3f265c7f26db62c9 act []
        ifindex 2 kind qdisc_clsact_egress name prefix_matcher.o:[cls_prefix_matcher_clsact]
                  prog_id 111726 tag 99a197826974c876
        ifindex 2 kind qdisc_clsact_egress name cls_fg_dscp
                  prog_id 108619 tag dc4630674fd72dcc act []
        ifindex 2 kind qdisc_clsact_egress name fbflow_egress
                  prog_id 130245 tag 72d2d830d6888d2c
        ]
        $ ./bpftool -jp net
        [{
              "xdp": [{
                      "ifindex": 2,
                      "devname": "eth0",
                      "prog_id": 198
                  }
              ],
              "tc_filters": [{
                      "ifindex": 2,
                      "kind": "qdisc_htb",
                      "name": "prefix_matcher.o:[cls_prefix_matcher_htb]",
                      "prog_id": 111727,
                      "tag": "d08fe3b4319bc2fd",
                      "act": []
                  },{
                      "ifindex": 2,
                      "kind": "qdisc_clsact_ingress",
                      "name": "fbflow_icmp",
                      "prog_id": 130246,
                      "tag": "3f265c7f26db62c9",
                      "act": []
                  },{
                      "ifindex": 2,
                      "kind": "qdisc_clsact_egress",
                      "name": "prefix_matcher.o:[cls_prefix_matcher_clsact]",
                      "prog_id": 111726,
                      "tag": "99a197826974c876"
                  },{
                      "ifindex": 2,
                      "kind": "qdisc_clsact_egress",
                      "name": "cls_fg_dscp",
                      "prog_id": 108619,
                      "tag": "dc4630674fd72dcc",
                      "act": []
                  },{
                      "ifindex": 2,
                      "kind": "qdisc_clsact_egress",
                      "name": "fbflow_egress",
                      "prog_id": 130245,
                      "tag": "72d2d830d6888d2c"
                  }
              ]
          }
        ]
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f6f3bac0
    • Yonghong Song's avatar
      tools/bpf: add more netlink functionalities in lib/bpf · 36f1678d
      Yonghong Song authored
      This patch added a few netlink attribute parsing functions
      and the netlink API functions to query networking links, tc classes,
      tc qdiscs and tc filters. For example, the following API is
      to get networking links:
        int nl_get_link(int sock, unsigned int nl_pid,
                        dump_nlmsg_t dump_link_nlmsg,
                        void *cookie);
      
      Note that when the API is called, the user also provided a
      callback function with the following signature:
        int (*dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb);
      
      The "cookie" is the parameter the user passed to the API and will
      be available for the callback function.
      The "msg" is the information about the result, e.g., ifinfomsg or
      tcmsg. The "tb" is the parsed netlink attributes.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      36f1678d
    • Yonghong Song's avatar
      tools/bpf: move bpf/lib netlink related functions into a new file · f7010770
      Yonghong Song authored
      There are no functionality change for this patch.
      
      In the subsequent patches, more netlink related library functions
      will be added and a separate file is better than cluttering bpf.c.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f7010770
    • Yonghong Song's avatar
      tools/bpf: sync kernel uapi header if_link.h to tools · 52b7b784
      Yonghong Song authored
      Among others, this header will be used later for
      bpftool net support.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      52b7b784
    • Mauricio Vasquez B's avatar
      selftests/bpf/test_progs: do not check errno == 0 · f5bd3948
      Mauricio Vasquez B authored
      The errno man page states: "The value in errno is significant only when
      the return value of the call indicated an error..." then it is not correct
      to check it, it could be different than zero even if the function
      succeeded.
      
      It causes some false positives if errno is set by a previous function.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f5bd3948
    • Mauricio Vasquez B's avatar
    • Jesper Dangaard Brouer's avatar
      xdp: split code for map vs non-map redirect · 47b123ed
      Jesper Dangaard Brouer authored
      The compiler does an efficient job of inlining static C functions.
      Perf top clearly shows that almost everything gets inlined into the
      function call xdp_do_redirect.
      
      The function xdp_do_redirect end-up containing and interleaving the
      map and non-map redirect code.  This is sub-optimal, as it would be
      strange for an XDP program to use both types of redirect in the same
      program. The two use-cases are separate, and interleaving the code
      just cause more instruction-cache pressure.
      
      I would like to stress (again) that the non-map variant bpf_redirect
      is very slow compared to the bpf_redirect_map variant, approx half the
      speed.  Measured with driver i40e the difference is:
      
      - map     redirect: 13,250,350 pps
      - non-map redirect:  7,491,425 pps
      
      For this reason, the function name of the non-map variant of redirect
      have been called xdp_do_redirect_slow.  This hopefully gives a hint
      when using perf, that this is not the optimal XDP redirect operating mode.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      47b123ed
    • Jesper Dangaard Brouer's avatar
      xdp: explicit inline __xdp_map_lookup_elem · 2a68d85f
      Jesper Dangaard Brouer authored
      The compiler chooses to not-inline the function __xdp_map_lookup_elem,
      because it can see that it is used by both Generic-XDP and native-XDP
      do redirect calls (xdp_do_generic_redirect_map and xdp_do_redirect_map).
      
      The compiler cannot know that this is a bad choice, as it cannot know
      that a net device cannot run both XDP modes (Generic or Native) at the
      same time.  Thus, mark this function inline, even-though we normally
      leave this up-to the compiler.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2a68d85f
    • Jesper Dangaard Brouer's avatar
      xdp: unlikely instrumentation for xdp map redirect · e1302542
      Jesper Dangaard Brouer authored
      Notice the compiler generated ASM code layout was suboptimal.  It
      assumed map enqueue errors as the likely case, which is shouldn't.
      It assumed that xdp_do_flush_map() was a likely case, due to maps
      changing between packets, which should be very unlikely.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1302542
  3. 06 Sep, 2018 4 commits
    • Alexei Starovoitov's avatar
      bpf/verifier: fix verifier instability · a9c676bc
      Alexei Starovoitov authored
      Edward Cree says:
      In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access()
      has supplied a reg_type, the other members of the register state are set
      appropriately.  Previously reg.range was set to 0, but as it is in a
      union with reg.map_ptr, which is larger, upper bytes of the latter were
      left in place.  This then caused the memcmp() in regsafe() to fail,
      preventing some branches from being pruned (and occasionally causing the
      same program to take a varying number of processed insns on repeated
      verifier runs).
      
      Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known()
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Debugged-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a9c676bc
    • Taeung Song's avatar
      libbpf: Remove the duplicate checking of function storage · 69495d2a
      Taeung Song authored
      After the commit eac7d845 ("tools: libbpf: don't return '.text'
      as a program for multi-function programs"), bpf_program__next()
      in bpf_object__for_each_program skips the function storage such as .text,
      so eliminate the duplicate checking.
      
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarTaeung Song <treeze.taeung@gmail.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69495d2a
    • Dmitry Safonov's avatar
      netlink: Make groups check less stupid in netlink_bind() · 428f944b
      Dmitry Safonov authored
      As Linus noted, the test for 0 is needless, groups type can follow the
      usual kernel style and 8*sizeof(unsigned long) is BITS_PER_LONG:
      
      > The code [..] isn't technically incorrect...
      > But it is stupid.
      > Why stupid? Because the test for 0 is pointless.
      >
      > Just doing
      >        if (nlk->ngroups < 8*sizeof(groups))
      >                groups &= (1UL << nlk->ngroups) - 1;
      >
      > would have been fine and more understandable, since the "mask by shift
      > count" already does the right thing for a ngroups value of 0. Now that
      > test for zero makes me go "what's special about zero?". It turns out
      > that the answer to that is "nothing".
      [..]
      > The type of "groups" is kind of silly too.
      >
      > Yeah, "long unsigned int" isn't _technically_ wrong. But we normally
      > call that type "unsigned long".
      
      Cleanup my piece of pointlessness.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: netdev@vger.kernel.org
      Fairly-blamed-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      428f944b
    • Vincent Whitchurch's avatar
      packet: add sockopt to ignore outgoing packets · fa788d98
      Vincent Whitchurch authored
      Currently, the only way to ignore outgoing packets on a packet socket is
      via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
      AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
      if the filter run from packet_rcv() would reject them.  So the presence
      of a packet socket on the interface takes away the benefits of
      MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
      packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
      cloned, but the cost for that is much lower.)
      
      Add a socket option to allow AF_PACKET sockets to ignore outgoing
      packets to solve this.  Note that the *BSDs already have something
      similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
      
      The first intended user is lldpd.
      Signed-off-by: default avatarVincent Whitchurch <vincent.whitchurch@axis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa788d98
  4. 05 Sep, 2018 23 commits