• Daniel Borkmann's avatar
    bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
    Daniel Borkmann authored
    As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
    for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
    transparent to applications. In Cilium we make use of these hooks [0] in
    order to enable E-W load balancing for existing Kubernetes service types
    for all Cilium managed nodes in the cluster. Those backends can be local
    or remote. The main advantage of this approach is that it operates as close
    as possible to the socket, and therefore allows to avoid packet-based NAT
    given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
    
    This also allows to expose NodePort services on loopback addresses in the
    host namespace, for example. As another advantage, this also efficiently
    blocks bind requests for applications in the host namespace for exposed
    ports. However, one missing item is that we also need to perform reverse
    xlation for inet{,6}_getname() hooks such that we can return the service
    IP/port tuple back to the application instead of the remote peer address.
    
    The vast majority of applications does not bother about getpeername(), but
    in a few occasions we've seen breakage when validating the peer's address
    since it returns unexpectedly the backend tuple instead of the service one.
    Therefore, this trivial patch allows to customise and adds a getpeername()
    as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
    to address this situation.
    
    Simple example:
    
      # ./cilium/cilium service list
      ID   Frontend     Service Type   Backend
      1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
    
    Before; curl's verbose output example, no getpeername() reverse xlation:
    
      # curl --verbose 1.2.3.4
      * Rebuilt URL to: 1.2.3.4/
      *   Trying 1.2.3.4...
      * TCP_NODELAY set
      * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
      > GET / HTTP/1.1
      > Host: 1.2.3.4
      > User-Agent: curl/7.58.0
      > Accept: */*
      [...]
    
    After; with getpeername() reverse xlation:
    
      # curl --verbose 1.2.3.4
      * Rebuilt URL to: 1.2.3.4/
      *   Trying 1.2.3.4...
      * TCP_NODELAY set
      * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
      > GET / HTTP/1.1
      >  Host: 1.2.3.4
      > User-Agent: curl/7.58.0
      > Accept: */*
      [...]
    
    Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
    peer to the context similar as in inet{,6}_getname() fashion, but API-wise
    this is suboptimal as it always enforces programs having to test for ctx->peer
    which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
    Similarly, the checked return code is on tnum_range(1, 1), but if a use case
    comes up in future, it can easily be changed to return an error code instead.
    Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
    
      [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
    Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
    Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
    1b66d253
syscall.c 97.1 KB