1. 26 Jan, 2023 2 commits
    • Jakub Sitnicki's avatar
      inet: Add IP_LOCAL_PORT_RANGE socket option · 91d0b78c
      Jakub Sitnicki authored
      Users who want to share a single public IP address for outgoing connections
      between several hosts traditionally reach for SNAT. However, SNAT requires
      state keeping on the node(s) performing the NAT.
      
      A stateless alternative exists, where a single IP address used for egress
      can be shared between several hosts by partitioning the available ephemeral
      port range. In such a setup:
      
      1. Each host gets assigned a disjoint range of ephemeral ports.
      2. Applications open connections from the host-assigned port range.
      3. Return traffic gets routed to the host based on both, the destination IP
         and the destination port.
      
      An application which wants to open an outgoing connection (connect) from a
      given port range today can choose between two solutions:
      
      1. Manually pick the source port by bind()'ing to it before connect()'ing
         the socket.
      
         This approach has a couple of downsides:
      
         a) Search for a free port has to be implemented in the user-space. If
            the chosen 4-tuple happens to be busy, the application needs to retry
            from a different local port number.
      
            Detecting if 4-tuple is busy can be either easy (TCP) or hard
            (UDP). In TCP case, the application simply has to check if connect()
            returned an error (EADDRNOTAVAIL). That is assuming that the local
            port sharing was enabled (REUSEADDR) by all the sockets.
      
              # Assume desired local port range is 60_000-60_511
              s = socket(AF_INET, SOCK_STREAM)
              s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
              s.bind(("192.0.2.1", 60_000))
              s.connect(("1.1.1.1", 53))
              # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
              # Application must retry with another local port
      
            In case of UDP, the network stack allows binding more than one socket
            to the same 4-tuple, when local port sharing is enabled
            (REUSEADDR). Hence detecting the conflict is much harder and involves
            querying sock_diag and toggling the REUSEADDR flag [1].
      
         b) For TCP, bind()-ing to a port within the ephemeral port range means
            that no connecting sockets, that is those which leave it to the
            network stack to find a free local port at connect() time, can use
            the this port.
      
            IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
            will be skipped during the free port search at connect() time.
      
      2. Isolate the app in a dedicated netns and use the use the per-netns
         ip_local_port_range sysctl to adjust the ephemeral port range bounds.
      
         The per-netns setting affects all sockets, so this approach can be used
         only if:
      
         - there is just one egress IP address, or
         - the desired egress port range is the same for all egress IP addresses
           used by the application.
      
         For TCP, this approach avoids the downsides of (1). Free port search and
         4-tuple conflict detection is done by the network stack:
      
           system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
      
           s = socket(AF_INET, SOCK_STREAM)
           s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
           s.bind(("192.0.2.1", 0))
           s.connect(("1.1.1.1", 53))
           # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
      
        For UDP this approach has limited applicability. Setting the
        IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
        port being shared with other connected UDP sockets.
      
        Hence relying on the network stack to find a free source port, limits the
        number of outgoing UDP flows from a single IP address down to the number
        of available ephemeral ports.
      
      To put it another way, partitioning the ephemeral port range between hosts
      using the existing Linux networking API is cumbersome.
      
      To address this use case, add a new socket option at the SOL_IP level,
      named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
      ephemeral port range for each socket individually.
      
      The option can be used only to narrow down the per-netns local port
      range. If the per-socket range lies outside of the per-netns range, the
      latter takes precedence.
      
      UAPI-wise, the low and high range bounds are passed to the kernel as a pair
      of u16 values in host byte order packed into a u32. This avoids pointer
      passing.
      
        PORT_LO = 40_000
        PORT_HI = 40_511
      
        s = socket(AF_INET, SOCK_STREAM)
        v = struct.pack("I", PORT_HI << 16 | PORT_LO)
        s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
        s.bind(("127.0.0.1", 0))
        s.getsockname()
        # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
        # if there is a free port. EADDRINUSE otherwise.
      
      [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116Reviewed-by: default avatarMarek Majkowski <marek@cloudflare.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      91d0b78c
    • Randy Dunlap's avatar
      net: Kconfig: fix spellos · 6a7a2c18
      Randy Dunlap authored
      Fix spelling in net/ Kconfig files.
      (reported by codespell)
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: coreteam@netfilter.org
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a7a2c18
  2. 25 Jan, 2023 17 commits
  3. 24 Jan, 2023 21 commits