• David Ahern's avatar
    ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
    David Ahern authored
    Nik reported a bug with pcpu dst cache when nexthop objects are
    used illustrated by the following:
        $ ip netns add foo
        $ ip -netns foo li set lo up
        $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
        $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
        $ ip li add veth1 type veth peer name veth2
        $ ip li set veth1 up
        $ ip addr add 2001:db8:10::1/64 dev veth1
        $ ip li set dev veth2 netns foo
        $ ip -netns foo li set veth2 up
        $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
        $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
        $ ip -6 route add 2001:db8:11::1/128 nhid 100
    
        Create a pcpu entry on cpu 0:
        $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
    
        Re-add the route entry:
        $ ip -6 ro del 2001:db8:11::1
        $ ip -6 route add 2001:db8:11::1/128 nhid 100
    
        Route get on cpu 0 returns the stale pcpu:
        $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
        RTNETLINK answers: Network is unreachable
    
        While cpu 1 works:
        $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
        2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
    
    Conversion of FIB entries to work with external nexthop objects
    missed an important difference between IPv4 and IPv6 - how dst
    entries are invalidated when the FIB changes. IPv4 has a per-network
    namespace generation id (rt_genid) that is bumped on changes to the FIB.
    Checking if a dst_entry is still valid means comparing rt_genid in the
    rtable to the current value of rt_genid for the namespace.
    
    IPv6 also has a per network namespace counter, fib6_sernum, but the
    count is saved per fib6_node. With the per-node counter only dst_entries
    based on fib entries under the node are invalidated when changes are
    made to the routes - limiting the scope of invalidations. IPv6 uses a
    reference in the rt6_info, 'from', to track the corresponding fib entry
    used to create the dst_entry. When validating a dst_entry, the 'from'
    is used to backtrack to the fib6_node and check the sernum of it to the
    cookie passed to the dst_check operation.
    
    With the inline format (nexthop definition inline with the fib6_info),
    dst_entries cached in the fib6_nh have a 1:1 correlation between fib
    entries, nexthop data and dst_entries. With external nexthops, IPv6
    looks more like IPv4 which means multiple fib entries across disparate
    fib6_nodes can all reference the same fib6_nh. That means validation
    of dst_entries based on external nexthops needs to use the IPv4 format
    - the per-network namespace counter.
    
    Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
    rt6_get_cookie to return sernum if it is set and update dst_check for
    IPv6 to look for sernum set and based the check on it if so. Finally,
    rt6_get_pcpu_route needs to validate the cached entry before returning
    a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
    __mkroute_output for IPv4).
    
    This problem only affects routes using the new, external nexthops.
    
    Thanks to the kbuild test robot for catching the IS_ENABLED needed
    around rt_genid_ipv6 before I sent this out.
    
    Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
    Reported-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
    Signed-off-by: default avatarDavid Ahern <dsahern@kernel.org>
    Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
    Tested-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    8f34e53b
net_namespace.h 12.1 KB