1. 02 May, 2018 6 commits
    • John Fastabend's avatar
      bpf: sockmap, fix error handling in redirect failures · abaeb096
      John Fastabend authored
      When a redirect failure happens we release the buffers in-flight
      without calling a sk_mem_uncharge(), the uncharge is called before
      dropping the sock lock for the redirecte, however we missed updating
      the ring start index. When no apply actions are in progress this
      is OK because we uncharge the entire buffer before the redirect.
      But, when we have apply logic running its possible that only a
      portion of the buffer is being redirected. In this case we only
      do memory accounting for the buffer slice being redirected and
      expect to be able to loop over the BPF program again and/or if
      a sock is closed uncharge the memory at sock destruct time.
      
      With an invalid start index however the program logic looks at
      the start pointer index, checks the length, and when seeing the
      length is zero (from the initial release and failure to update
      the pointer) aborts without uncharging/releasing the remaining
      memory.
      
      The fix for this is simply to update the start index. To avoid
      fixing this error in two locations we do a small refactor and
      remove one case where it is open-coded. Then fix it in the
      single function.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      abaeb096
    • John Fastabend's avatar
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend authored
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • John Fastabend's avatar
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend authored
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3cc9a472
    • Alexei Starovoitov's avatar
      Merge branch 'x86-bpf-jit-fixes' · 0f58e58e
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Fix two memory leaks in x86 JIT. For details, please see
      individual patches in this series. Thanks!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0f58e58e
    • Daniel Borkmann's avatar
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann authored
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • Daniel Borkmann's avatar
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann authored
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3aab8884
  2. 01 May, 2018 1 commit
  3. 26 Apr, 2018 2 commits
    • John Fastabend's avatar
      bpf: fix uninitialized variable in bpf tools · 81542556
      John Fastabend authored
      Here the variable cont is used as the saved_pointer for a call to
      strtok_r(). It is safe to use the value uninitialized in this
      context however and the later reference is only ever used if
      the strtok_r is successful. But, 'gcc-5' at least doesn't have all
      this knowledge so initialize cont to NULL. Additionally, do the
      natural NULL check before accessing just for completness.
      
      The warning is the following:
      
      ./bpf/tools/bpf/bpf_dbg.c: In function ‘cmd_load’:
      ./bpf/tools/bpf/bpf_dbg.c:1077:13: warning: ‘cont’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        } else if (matches(subcmd, "pcap") == 0) {
      
      Fixes: fd981e3c "filter: bpf_dbg: add minimal bpf debugger"
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      81542556
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 25eb0ea7
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-04-25
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix to clear the percpu metadata_dst that could otherwise carry
         stale ip_tunnel_info, from William.
      
      2) Fix that reduces the number of passes in x64 JIT with regards to
         dead code sanitation to avoid risk of prog rejection, from Gianluca.
      
      3) Several fixes of sockmap programs, besides others, fixing a double
         page_put() in error path, missing refcount hold for pinned sockmap,
         adding required -target bpf for clang in sample Makefile, from John.
      
      4) Fix to disable preemption in __BPF_PROG_RUN_ARRAY() paths, from Roman.
      
      5) Fix tools/bpf/ Makefile with regards to a lex/yacc build error
         seen on older gcc-5, from John.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25eb0ea7
  4. 25 Apr, 2018 7 commits
    • John Fastabend's avatar
      bpf: fix for lex/yacc build error with gcc-5 · 9c299a32
      John Fastabend authored
      Fix build error found with Ubuntu shipped gcc-5
      
      ~/git/bpf/tools/bpf$ make all
      
      Auto-detecting system features:
      ...                        libbfd: [ OFF ]
      ...        disassembler-four-args: [ OFF ]
      
        CC       bpf_jit_disasm.o
        LINK     bpf_jit_disasm
        CC       bpf_dbg.o
      /home/john/git/bpf/tools/bpf/bpf_dbg.c: In function ‘cmd_load’:
      /home/john/git/bpf/tools/bpf/bpf_dbg.c:1077:13: warning: ‘cont’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        } else if (matches(subcmd, "pcap") == 0) {
                   ^
        LINK     bpf_dbg
        CC       bpf_asm.o
      make: *** No rule to make target `bpf_exp.yacc.o', needed by `bpf_asm'.  Stop.
      
      Fixes: 5a8997f2 ("tools: bpf: respect output directory during build")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9c299a32
    • Dag Moxnes's avatar
      rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qp · 91a82529
      Dag Moxnes authored
      The function rds_ib_setup_qp is calling rds_ib_get_client_data and
      should correspondingly call rds_ib_dev_put. This call was lost in
      the non-error path with the introduction of error handling done in
      commit 3b12f73a ("rds: ib: add error handle")
      Signed-off-by: default avatarDag Moxnes <dag.moxnes@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91a82529
    • Ursula Braun's avatar
      net/smc: keep clcsock reference in smc_tcp_listen_work() · 070204a3
      Ursula Braun authored
      The internal CLC socket should exist till the SMC-socket is released.
      Function tcp_listen_worker() releases the internal CLC socket of a
      listen socket, if an smc_close_active() is called. This function
      is called for the final release(), but it is called for shutdown
      SHUT_RDWR as well. This opens a door for protection faults, if
      socket calls using the internal CLC socket are called for a
      shutdown listen socket.
      
      With the changes of
      commit 3d502067 ("net/smc: simplify wait when closing listen socket")
      there is no need anymore to release the internal CLC socket in
      function tcp_listen_worker((). It is sufficient to release it in
      smc_release().
      
      Fixes: 127f4970 ("net/smc: release clcsock from tcp_listen_worker")
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Reported-by: syzbot+9045fc589fcd196ef522@syzkaller.appspotmail.com
      Reported-by: syzbot+28a2c86cf19c81d871fa@syzkaller.appspotmail.com
      Reported-by: syzbot+9605e6cace1b5efd4a0a@syzkaller.appspotmail.com
      Reported-by: syzbot+cf9012c597c8379d535c@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      070204a3
    • Alexandre Belloni's avatar
      net: phy: allow scanning busses with missing phys · 02a6efca
      Alexandre Belloni authored
      Some MDIO busses will error out when trying to read a phy address with no
      phy present at that address. In that case, probing the bus will fail
      because __mdiobus_register() is scanning the bus for all possible phys
      addresses.
      
      In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
      this address and set the phy ID to 0xffffffff which is then properly
      handled in get_phy_device().
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02a6efca
    • Gianluca Borello's avatar
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello authored
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: default avatarGianluca Borello <g.borello@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1612a981
    • William Tu's avatar
      bpf: clear the ip_tunnel_info. · 5540fbf4
      William Tu authored
      The percpu metadata_dst might carry the stale ip_tunnel_info
      and cause incorrect behavior.  When mixing tests using ipv4/ipv6
      bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
      ipv4's src ip addr as its ipv6 src address, because the previous
      tunnel info does not clean up.  The patch zeros the fields in
      ip_tunnel_info.
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reported-by: default avatarYifeng Sun <pkusunyifeng@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5540fbf4
    • Linus Torvalds's avatar
      Merge branch 'userns-linus' of... · 3be4aaf4
      Linus Torvalds authored
      Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
      
      Pull userns bug fix from Eric Biederman:
       "Just a small fix to properly set the return code on error"
      
      * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        commoncap: Handle memory allocation failure.
      3be4aaf4
  5. 24 Apr, 2018 19 commits
  6. 23 Apr, 2018 5 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-fixes' · b3f8adee
      Daniel Borkmann authored
      John Fastabend says:
      
      ====================
      While testing sockmap with more programs (besides our test programs)
      I found a couple issues.
      
      The attached series fixes an issue where pinned maps were not
      working correctly, blocking sockets returned zero, and an error
      path that when the sock hit an out of memory case resulted in a
      double page_put() while doing ingress redirects.
      
      See individual patches for more details.
      
      v2: Incorporated Daniel's feedback to use map ops for uref put op
          which also fixed the build error discovered in v1.
      v3: rename map_put_uref to map_release_uref
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b3f8adee
    • John Fastabend's avatar
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path · 4fcfdfb8
      John Fastabend authored
      In the case where the socket memory boundary is hit the redirect
      path returns an ENOMEM error. However, before checking for this
      condition the redirect scatterlist buffer is setup with a valid
      page and length. This is never unwound so when the buffers are
      released latter in the error path we do a put_page() and clear
      the scatterlist fields. But, because the initial error happens
      before completing the scatterlist buffer we end up with both the
      original buffer and the redirect buffer pointing to the same page
      resulting in duplicate put_page() calls.
      
      To fix this simply move the initial configuration of the redirect
      scatterlist buffer below the sock memory check.
      
      Found this while running TCP_STREAM test with netperf using Cilium.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4fcfdfb8
    • John Fastabend's avatar
      bpf: sockmap, sk_wait_event needed to handle blocking cases · e20f7334
      John Fastabend authored
      In the recvmsg handler we need to add a wait event to support the
      blocking use cases. Without this we return zero and may confuse
      user applications. In the wait event any data received on the
      sk either via sk_receive_queue or the psock ingress list will
      wake up the sock.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e20f7334
    • John Fastabend's avatar
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend authored
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
    • John Fastabend's avatar
      bpf: sockmap sample use clang flag, -target bpf · 4dfe1bb9
      John Fastabend authored
      Per Documentation/bpf/bpf_devel_QA.txt add the -target flag to the
      sockmap Makefile. Relevant text quoted here,
      
         Otherwise, you can use bpf target. Additionally, you _must_ use
         bpf target when:
      
       - Your program uses data structures with pointer or long / unsigned
         long types that interface with BPF helpers or context data
         structures. Access into these structures is verified by the BPF
         verifier and may result in verification failures if the native
         architecture is not aligned with the BPF architecture, e.g. 64-bit.
         An example of this is BPF_PROG_TYPE_SK_MSG require '-target bpf'
      
      Fixes: 69e8cc13 ("bpf: sockmap sample program")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4dfe1bb9