• Daniel Borkmann's avatar
    ipvlan: disallow userns cap_net_admin to change global mode/flags · 1a0db05a
    Daniel Borkmann authored
    BugLink: https://bugs.launchpad.net/bugs/1822271
    
    [ Upstream commit 7cc9f700 ]
    
    When running Docker with userns isolation e.g. --userns-remap="default"
    and spawning up some containers with CAP_NET_ADMIN under this realm, I
    noticed that link changes on ipvlan slave device inside that container
    can affect all devices from this ipvlan group which are in other net
    namespaces where the container should have no permission to make changes
    to, such as the init netns, for example.
    
    This effectively allows to undo ipvlan private mode and switch globally to
    bridge mode where slaves can communicate directly without going through
    hostns, or it allows to switch between global operation mode (l2/l3/l3s)
    for everyone bound to the given ipvlan master device. libnetwork plugin
    here is creating an ipvlan master and ipvlan slave in hostns and a slave
    each that is moved into the container's netns upon creation event.
    
    * In hostns:
    
      # ip -d a
      [...]
      8: cilium_host@bond0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
         link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
         ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
         inet 10.41.0.1/32 scope link cilium_host
           valid_lft forever preferred_lft forever
      [...]
    
    * Spawn container & change ipvlan mode setting inside of it:
    
      # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
      9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c
    
      # docker exec -ti client ip -d a
      [...]
      10: cilium0@if4: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
             valid_lft forever preferred_lft forever
    
      # docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2
    
      # docker exec -ti client ip -d a
      [...]
      10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
             valid_lft forever preferred_lft forever
    
    * In hostns (mode switched to l2):
    
      # ip -d a
      [...]
      8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.41.0.1/32 scope link cilium_host
             valid_lft forever preferred_lft forever
      [...]
    
    Same l3 -> l2 switch would also happen by creating another slave inside
    the container's network namespace when specifying the existing cilium0
    link to derive the actual (bond0) master:
    
      # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2
    
      # docker exec -ti client ip -d a
      [...]
      2: cilium1@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
             valid_lft forever preferred_lft forever
    
    * In hostns:
    
      # ip -d a
      [...]
      8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
          ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
          inet 10.41.0.1/32 scope link cilium_host
             valid_lft forever preferred_lft forever
      [...]
    
    One way to mitigate it is to check CAP_NET_ADMIN permissions of
    the ipvlan master device's ns, and only then allow to change
    mode or flags for all devices bound to it. Above two cases are
    then disallowed after the patch.
    Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
    Acked-by: default avatarJuerg Haefliger <juerg.haefliger@canonical.com>
    Signed-off-by: default avatarKleber Sacilotto de Souza <kleber.souza@canonical.com>
    1a0db05a
ipvlan_main.c 20.3 KB