• John Fastabend's avatar
    bpf, sockmap: Fix race in ingress receive verdict with redirect to self · c5d2177a
    John Fastabend authored
    A socket in a sockmap may have different combinations of programs attached
    depending on configuration. There can be no programs in which case the socket
    acts as a sink only. There can be a TX program in this case a BPF program is
    attached to sending side, but no RX program is attached. There can be an RX
    program only where sends have no BPF program attached, but receives are hooked
    with BPF. And finally, both TX and RX programs may be attached. Giving us the
    permutations:
    
     None, Tx, Rx, and TxRx
    
    To date most of our use cases have been TX case being used as a fast datapath
    to directly copy between local application and a userspace proxy. Or Rx cases
    and TxRX applications that are operating an in kernel based proxy. The traffic
    in the first case where we hook applications into a userspace application looks
    like this:
    
      AppA  redirect   AppB
       Tx <-----------> Rx
       |                |
       +                +
       TCP <--> lo <--> TCP
    
    In this case all traffic from AppA (after 3whs) is copied into the AppB
    ingress queue and no traffic is ever on the TCP recieive_queue.
    
    In the second case the application never receives, except in some rare error
    cases, traffic on the actual user space socket. Instead the send happens in
    the kernel.
    
               AppProxy       socket pool
           sk0 ------------->{sk1,sk2, skn}
            ^                      |
            |                      |
            |                      v
           ingress              lb egress
           TCP                  TCP
    
    Here because traffic is never read off the socket with userspace recv() APIs
    there is only ever one reader on the sk receive_queue. Namely the BPF programs.
    
    However, we've started to introduce a third configuration where the BPF program
    on receive should process the data, but then the normal case is to push the
    data into the receive queue of AppB.
    
           AppB
           recv()                (userspace)
         -----------------------
           tcp_bpf_recvmsg()     (kernel)
             |             |
             |             |
             |             |
           ingress_msgQ    |
             |             |
           RX_BPF          |
             |             |
             v             v
           sk->receive_queue
    
    This is different from the App{A,B} redirect because traffic is first received
    on the sk->receive_queue.
    
    Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
    queue for any data handled by the BPF rx program and returned with PASS code
    so that it was enqueued on the ingress msg queue. Then if no data exists on
    that queue it checks the socket receive queue. Unfortunately, this is the same
    receive_queue the BPF program is reading data off of. So we get a race. Its
    possible for the recvmsg() hook to pull data off the receive_queue before the
    BPF hook has a chance to read it. It typically happens when an application is
    banging on recv() and getting EAGAINs. Until they manage to race with the RX
    BPF program.
    
    To fix this we note that before this patch at attach time when the socket is
    loaded into the map we check if it needs a TX program or just the base set of
    proto bpf hooks. Then it uses the above general RX hook regardless of if we
    have a BPF program attached at rx or not. This patch now extends this check to
    handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
    race when an RX program is attached we use a new hook that is nearly identical
    to the old one except now we do not let the recv() call skip the RX BPF program.
    Now only the BPF program pulls data from sk->receive_queue and recv() only
    pulls data from the ingress msgQ post BPF program handling.
    
    With this resolved our AppB from above has been up and running for many hours
    without detecting any errors. We do this by correlating counters in RX BPF
    events and the AppB to ensure data is never skipping the BPF program. Selftests,
    was not able to detect this because we only run them for a short period of time
    on well ordered send/recvs so we don't get any of the noise we see in real
    application environments.
    
    Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
    Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Tested-by: default avatarJussi Maki <joamaki@gmail.com>
    Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
    Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
    c5d2177a
tcp_bpf.c 14.5 KB