1. 30 Mar, 2020 11 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_sk_assign' · c58b1558
      Alexei Starovoitov authored
      Joe Stringer says:
      
      ====================
      Introduce a new helper that allows assigning a previously-found socket
      to the skb as the packet is received towards the stack, to cause the
      stack to guide the packet towards that socket subject to local routing
      configuration. The intention is to support TProxy use cases more
      directly from eBPF programs attached at TC ingress, to simplify and
      streamline Linux stack configuration in scale environments with Cilium.
      
      Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any
      existing socket reference associated with the skb. Existing tproxy
      implementations in netfilter get around this restriction by running the
      tproxy logic after ip_rcv_core() in the PREROUTING table. However, this
      is not an option for TC-based logic (including eBPF programs attached at
      TC ingress).
      
      This series introduces the BPF helper bpf_sk_assign() to associate the
      socket with the skb on the ingress path as the packet is passed up the
      stack. The initial patch in the series simply takes a reference on the
      socket to ensure safety, but later patches relax this for listen
      sockets.
      
      To ensure delivery to the relevant socket, we still consult the routing
      table, for full examples of how to configure see the tests in patch #5;
      the simplest form of the route would look like this:
      
        $ ip route add local default dev lo
      
      This series is laid out as follows:
      * Patch 1 extends the eBPF API to add sk_assign() and defines a new
        socket free function to allow the later paths to understand when the
        socket associated with the skb should be kept through receive.
      * Patches 2-3 optimize the receive path to avoid taking a reference on
        listener sockets during receive.
      * Patches 4-5 extends the selftests with examples of the new
        functionality and validation of correct behaviour.
      
      Changes since v4:
      * Fix build with CONFIG_INET disabled
      * Rebase
      
      Changes since v3:
      * Use sock_gen_put() directly instead of sock_edemux() from sock_pfree()
      * Commit message wording fixups
      * Add acks from Martin, Lorenz
      * Rebase
      
      Changes since v2:
      * Add selftests for UDP socket redirection
      * Drop the early demux optimization patch (defer for more testing)
      * Fix check for orphaning after TC act return
      * Tidy up the tests to clean up properly and be less noisy.
      
      Changes since v1:
      * Replace the metadata_dst approach with using the skb->destructor to
        determine whether the socket has been prefetched. This is much
        simpler.
      * Avoid taking a reference on listener sockets during receive
      * Restrict assigning sockets across namespaces
      * Restrict assigning SO_REUSEPORT sockets
      * Fix cookie usage for socket dst check
      * Rebase the tests against test_progs infrastructure
      * Tidy up commit messages
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c58b1558
    • Joe Stringer's avatar
      selftests: bpf: Extend sk_assign tests for UDP · 8a02a170
      Joe Stringer authored
      Add support for testing UDP sk_assign to the existing tests.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-6-joe@wand.net.nz
      8a02a170
    • Lorenz Bauer's avatar
      selftests: bpf: Add test for sk_assign · 2d7824ff
      Lorenz Bauer authored
      Attach a tc direct-action classifier to lo in a fresh network
      namespace, and rewrite all connection attempts to localhost:4321
      to localhost:1234 (for port tests) and connections to unreachable
      IPv4/IPv6 IPs to the local socket (for address tests). Includes
      implementations for both TCP and UDP.
      
      Keep in mind that both client to server and server to client traffic
      passes the classifier.
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-5-joe@wand.net.nzCo-authored-by: default avatarJoe Stringer <joe@wand.net.nz>
      2d7824ff
    • Joe Stringer's avatar
      bpf: Don't refcount LISTEN sockets in sk_assign() · 7ae215d2
      Joe Stringer authored
      Avoid taking a reference on listen sockets by checking the socket type
      in the sk_assign and in the corresponding skb_steal_sock() code in the
      the transport layer, and by ensuring that the prefetch free (sock_pfree)
      function uses the same logic to check whether the socket is refcounted.
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
      7ae215d2
    • Joe Stringer's avatar
      net: Track socket refcounts in skb_steal_sock() · 71489e21
      Joe Stringer authored
      Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
      the determination of whether the socket is reference counted in the case
      where it is prefetched by earlier logic such as early_demux.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
      71489e21
    • Joe Stringer's avatar
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer authored
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
    • Daniel Borkmann's avatar
      bpf, doc: Add John as official reviewer to BPF subsystem · b49e42a2
      Daniel Borkmann authored
      We've added John Fastabend to our weekly BPF patch review rotation over
      last months now where he provided excellent and timely feedback on BPF
      patches. Therefore, add him to the BPF core reviewer team to the MAINTAINERS
      file to reflect that.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/0e9a74933b3f21f4c5b5a3bc7f8e900b39805639.1585556231.git.daniel@iogearbox.net
      b49e42a2
    • KP Singh's avatar
      bpf: btf: Fix arg verification in btf_ctx_access() · f50b49a0
      KP Singh authored
      The bounds checking for the arguments accessed in the BPF program breaks
      when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or
      BPF_MODIFY_RETURN resulting in no check being done for the default case
      (the programs which do not receive the return value of the attached
      function in its arguments) when the index of the argument being accessed
      is equal to the number of arguments (nr_args).
      
      This was a result of a misplaced "else if" block  introduced by the
      Commit 6ba43b76 ("bpf: Attachment verification for
      BPF_MODIFY_RETURN")
      
      Fixes: 6ba43b76 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org
      f50b49a0
    • Jann Horn's avatar
      bpf: Simplify reg_set_min_max_inv handling · 0fc31b10
      Jann Horn authored
      reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(),
      just flipped around. While this makes sense in a cBPF verifier (where ALU
      operations are not symmetric), it does not make sense for eBPF.
      
      Replace reg_set_min_max_inv() with a helper that flips the opcode around,
      then lets reg_set_min_max() do the complicated work.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net
      0fc31b10
    • Jann Horn's avatar
      bpf: Fix tnum constraints for 32-bit comparisons · 604dca5e
      Jann Horn authored
      The BPF verifier tried to track values based on 32-bit comparisons by
      (ab)using the tnum state via 581738a6 ("bpf: Provide better register
      bounds after jmp32 instructions"). The idea is that after a check like
      this:
      
          if ((u32)r0 > 3)
            exit
      
      We can't meaningfully constrain the arithmetic-range-based tracking, but
      we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003).
      However, the implementation from 581738a6 didn't compute the tnum
      constraint based on the fixed operand, but instead derives it from the
      arithmetic-range-based tracking. This means that after the following
      sequence of operations:
      
          if (r0 >= 0x1'0000'0001)
            exit
          if ((u32)r0 > 7)
            exit
      
      The verifier assumed that the lower half of r0 is in the range (0, 0)
      and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus
      causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was
      incorrect. Provide a fixed implementation.
      
      Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net
      604dca5e
    • Daniel Borkmann's avatar
      bpf: Undo incorrect __reg_bound_offset32 handling · f2d67fec
      Daniel Borkmann authored
      Anatoly has been fuzzing with kBdysch harness and reported a hang in
      one of the outcomes:
      
        0: (b7) r0 = 808464432
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
        12: (95) exit
      
        from 8 to 9: safe
      
        from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
        9: safe
      
        from 8 to 9: safe
        verification time 589 usec
        stack depth 0
        processed 17 insns (limit 1000000) [...]
      
      The underlying program was xlated as follows:
      
        # bpftool p d x i 9
         0: (b7) r0 = 808464432
         1: (7f) r0 >>= r0
         2: (14) w0 -= 808464432
         3: (07) r0 += 808464432
         4: (b7) r1 = 808464432
         5: (de) if w1 s<= w0 goto pc+0
         6: (07) r0 += -2144337872
         7: (14) w0 -= -1607454672
         8: (25) if r0 > 0x30303030 goto pc+0
         9: (76) if w0 s>= 0x303030 goto pc+2
        10: (05) goto pc-1
        11: (05) goto pc-1
        12: (95) exit
      
      The verifier rewrote original instructions it recognized as dead code with
      'goto pc-1', but reality differs from verifier simulation in that we're
      actually able to trigger a hang due to hitting the 'goto pc-1' instructions.
      
      Taking different examples to make the issue more obvious: in this example
      we're probing bounds on a completely unknown scalar variable in r1:
      
        [...]
        5: R0_w=inv1 R1_w=inv(id=0) R10=fp0
        5: (18) r2 = 0x4000000000
        7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0
        7: (18) r3 = 0x2000000000
        9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0
        9: (18) r4 = 0x400
        11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0
        11: (18) r5 = 0x200
        13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        13: (2d) if r1 > r2 goto pc+4
         R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        14: (ad) if r1 < r3 goto pc+3
         R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
        15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        15: (2e) if w1 > w4 goto pc+2
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        16: (ae) if w1 < w5 goto pc+1
         R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
        [...]
      
      We're first probing lower/upper bounds via jmp64, later we do a similar
      check via jmp32 and examine the resulting var_off there. After fall-through
      in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked
      bits in the variable section.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds
      in jmp32.
      
      Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                          w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111100000000000000000000000000000000 / 0x7f00000000
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      The lower/upper bounds haven't changed since they have high bits set in
      u64 space and the jmp32 tests can only refine bounds in the low bits.
      
      However, for the var part the expectation would have been 0x7f000007ff
      or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000
      is not correct since it would contradict the earlier probed bounds
      where we know that the result should have been in [0x200,0x400] in u32
      space. Therefore, tests with such info will lead to wrong verifier
      assumptions later on like falsely predicting conditional jumps to be
      always taken, etc.
      
      The issue here is that __reg_bound_offset32()'s implementation from
      commit 581738a6 ("bpf: Provide better register bounds after jmp32
      instructions") makes an incorrect range assumption:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              u64 mask = 0xffffFFFF;
              struct tnum range = tnum_range(reg->umin_value & mask,
                                             reg->umax_value & mask);
              struct tnum lo32 = tnum_cast(reg->var_off, 4);
              struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
      
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the above walk-through example, __reg_bound_offset32() as-is chose
      a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000
      and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as
      well. However, in the umin:0x2000000000 and umax:0x4000000000 range above
      we'd end up with an actual possible interval of [0x0,0xffffffff] for u32
      space instead.
      
      In case of the original reproducer, the situation looked as follows at
      insn 5 for r0:
      
        [...]
        5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
                                     0x30303030           0x13030302f
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0
                                   0x30303030           0x13030302f
        [...]
      
      After the fall-through, we similarly forced the var_off result into
      the wrong range [0x30303030,0x3030302f] suggesting later on that fixed
      bits must only be of 0x30303020 with 0x10000001f unknowns whereas such
      assumption can only be made when both bounds in hi32 range match.
      
      Originally, I was thinking to fix this by moving reg into a temp reg and
      use proper coerce_reg_to_size() helper on the temp reg where we can then
      based on that define the range tnum for later intersection:
      
        static void __reg_bound_offset32(struct bpf_reg_state *reg)
        {
              struct bpf_reg_state tmp = *reg;
              struct tnum lo32, hi32, range;
      
              coerce_reg_to_size(&tmp, 4);
              range = tnum_range(tmp.umin_value, tmp.umax_value);
              lo32 = tnum_cast(reg->var_off, 4);
              hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
              reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
        }
      
      In the case of the concrete example, this gives us a more conservative unknown
      section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                                   w1 <= 0x400        and w1 >= 0x200:
      
        max: 0b100000000000000000000000000000000000000 / 0x4000000000
        var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
        min: 0b010000000000000000000000000000000000000 / 0x2000000000
      
      However, above new __reg_bound_offset32() has no effect on refining the
      knowledge of the register contents. Meaning, if the bounds in hi32 range
      mismatch we'll get the identity function given the range reg spans
      [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary
      or it again with the hi32.
      
      Likewise, if the bounds in hi32 range match, then we mask both bounds
      with 0xffffffff, use the resulting umin/umax for the range to later
      intersect the lo32 with it. However, _prior_ called __reg_bound_offset()
      did already such intersection on the full reg and we therefore would only
      repeat the same operation on the lo32 part twice.
      
      Given this has no effect and the original commit had false assumptions,
      this patch reverts the code entirely which is also more straight forward
      for stable trees: apparently 581738a6 got auto-selected by Sasha's
      ML system and misclassified as a fix, so it got sucked into v5.4 where
      it should never have landed. A revert is low-risk also from a user PoV
      since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU
      to generate jmp32 instructions. A proper bounds refinement would need a
      significantly more complex approach which is currently being worked, but
      no stable material [0]. Hence revert is best option for stable. After the
      revert, the original reported program gets rejected as follows:
      
        1: (7f) r0 >>= r0
        2: (14) w0 -= 808464432
        3: (07) r0 += 808464432
        4: (b7) r1 = 808464432
        5: (de) if w1 s<= w0 goto pc+0
         R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
        6: (07) r0 += -2144337872
        7: (14) w0 -= -1607454672
        8: (25) if r0 > 0x30303030 goto pc+0
         R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0
        9: (76) if w0 s>= 0x303030 goto pc+2
         R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0
        10: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 11 insns (limit 1000000) [...]
      
        [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/
      
      Fixes: 581738a6 ("bpf: Provide better register bounds after jmp32 instructions")
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net
      f2d67fec
  2. 29 Mar, 2020 12 commits
  3. 28 Mar, 2020 15 commits
  4. 26 Mar, 2020 2 commits