• Jakub Kicinski's avatar
    net: store netdevs in an xarray · 759ab1ed
    Jakub Kicinski authored
    Iterating over the netdev hash table for netlink dumps is hard.
    Dumps are done in "chunks" so we need to save the position
    after each chunk, so we know where to restart from. Because
    netdevs are stored in a hash table we remember which bucket
    we were in and how many devices we dumped.
    
    Since we don't hold any locks across the "chunks" - devices may
    come and go while we're dumping. If that happens we may miss
    a device (if device is deleted from the bucket we were in).
    We indicate to user space that this may have happened by setting
    NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
    if it sees that. Somehow I doubt most user space gets this right..
    
    To illustrate let's look at an example:
    
                   System state:
      start:       # [A, B, C]
      del:  B      # [A, C]
    
    with the hash table we may dump [A, B], missing C completely even
    tho it existed both before and after the "del B".
    
    Add an xarray and use it to allocate ifindexes. This way we
    can iterate ifindexes in order, without the worry that we'll
    skip one. We may still generate a dump of a state which "never
    existed", for example for a set of values and sequence of ops:
    
                   System state:
      start:       # [A, B]
      add:  C      # [A, C, B]
      del:  B      # [A, C]
    
    we may generate a dump of [A], if C got an index between A and B.
    System has never been in such state. But I'm 90% sure that's perfectly
    fine, important part is that we can't _miss_ devices which exist before
    and after. User space which wants to mirror kernel's state subscribes
    to notifications and does periodic dumps so it will know that C exists
    from the notification about its creation or from the next dump
    (next dump is _guaranteed_ to include C, if it doesn't get removed).
    
    To avoid any perf regressions keep the hash table for now. Most
    net namespaces have very few devices and microbenchmarking 1M lookups
    on Skylake I get the following results (not counting loopback
    to number of devs):
    
     #devs | hash |  xa  | delta
        2  | 18.3 | 20.1 | + 9.8%
       16  | 18.3 | 20.1 | + 9.5%
       64  | 18.3 | 26.3 | +43.8%
      128  | 20.4 | 26.3 | +28.6%
      256  | 20.0 | 26.4 | +32.1%
     1024  | 26.6 | 26.7 | + 0.2%
     8192  |541.3 | 33.5 | -93.8%
    
    No surprises since the hash table has 256 entries.
    The microbenchmark scans indexes in order, if the pattern is more
    random xa starts to win at 512 devices already. But that's a lot
    of devices, in practice.
    Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
    Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    759ab1ed
dev.c 290 KB