1. 20 Jul, 2021 32 commits
    • David S. Miller's avatar
      Merge branch 'veth-flexible-channel-numbers' · 542bb396
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      veth: more flexible channels number configuration
      
      XDP setups can benefit from multiple veth RX/TX queues. Currently
      veth allow setting such number only at creation time via the
      'numrxqueues' and 'numtxqueues' parameters.
      
      This series introduces support for the ethtool set_channel operation
      and allows configuring the queue number via a new module parameter.
      
      The veth default configuration is not changed.
      
      Finally self-tests are updated to check the new features, with both
      valid and invalid arguments.
      
      This iteration is a rebase of the most recent RFC, it does not provide
      a module parameter to configure the default number of queues, but I
      think could be worthy
      
      RFC v1 -> RFC v2:
       - report more consistent 'combined' count
       - make set_channel as resilient as possible to errors
       - drop module parameter - but I would still consider it.
       - more self-tests
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      542bb396
    • David S. Miller's avatar
      Merge branch 'bridge-vlan-multicast' · 2c080404
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: multicast: add vlan support
      
      This patchset adds initial per-vlan multicast support, most of the code
      deals with moving to multicast context pointers from bridge/port pointers.
      That allows us to switch them with the per-vlan contexts when a multicast
      packet is being processed and vlan multicast snooping has been enabled.
      That is controlled by a global bridge option added in patch 06 which is
      off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
      that this option can change only under RTNL and doesn't require
      multicast_lock, so we need to be careful when retrieving mcast contexts
      in parallel. For packet processing they are switched only once in
      br_multicast_rcv() and then used until the packet has been processed.
      For the most part we need these contexts only to read config values and
      check if they are disabled. The global mcast state which is maintained
      consists of querier and router timers, the rest are config options.
      The port mcast state which is maintained consists of query timer and
      link to router port list if it's ever marked as a router port. Port
      multicast contexts _must_ be used only with their respective global
      contexts, that is a bridge port's mcast context must be used only with
      bridge's global mcast context and a vlan/port's mcast context must be
      used only with that vlan's global mcast context due to the router port
      lists. This way a bridge port can be marked as a router in multiple
      vlans, but might not be a router in some other vlan. Also this allows us
      to have per-vlan querier elections, per-vlan queries and basically the
      whole multicast state becomes per-vlan when the option is enabled.
      One of the hardest parts is synchronization with vlan's memory
      management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
      which is changed only under multicast_lock. When a vlan is being
      destroyed first that flag is removed under the lock, then the multicast
      context is torn down which includes waiting for any outstanding context
      timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
      it must be checked first if the contexts are vlan and the multicast_lock
      has been acquired. That is done by all IGMP/MLD packet processing
      functions and timers. When processing a packet we have RCU so the vlan
      memory won't be freed, but if the flag is missing we must not process it.
      The timers are synchronized in the same way with the addition of waiting
      for them to finish in case they are running after removing the flag
      under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
      snooping requires vlan filtering to be enabled, if it's disabled then
      snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
      controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
      vlan disabled checks. We need both flags because one is controlled by
      user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
      for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
      Since the latter is also used for synchronization between the multicast
      and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
      rely on it when checking if a vlan context is disabled. The multicast
      fast-path has 3 new bit tests on the cache-hot bridge flags field, I
      didn't observe any measurable difference. I haven't forced either
      context options to be always disabled when the other type is enabled
      because the state consists of timers which either expire (router) or
      don't affect the normal operation. Some options, like the mcast querier
      one, won't be allowed to change for the disabled context type, that will
      come with a future patch-set which adds per-vlan querier control.
      
      Another important addition is the global vlan options, so far we had
      only per bridge/port vlan options but in order to control vlan multicast
      snooping globally we need to add a new type of global vlan options.
      They can be changed only on the bridge device and are dumped only when a
      special flag is set in the dump request. The first global option is vlan
      mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
      private flag. It can be set only on master vlan entries. There will be
      many more global vlan options in the future both for multicast config
      and other per-vlan options (e.g. STP).
      
      There's a lot of room for improvements, I'll do some of the initial
      ones but splitting the state to different contexts opens the door
      for a lot more. Also any new multicast options become vlan-supported with
      very little to no effort by using the same contexts.
      
      Short patch description:
        patches 01-04: initial mcast context add, no functional changes
        patch      05: adds vlan mcast init and control helpers and uses them on
                       vlan create/destroy
        patch      06: adds a global bridge mcast vlan snooping knob (default
                       off)
        patches 07-08: add a helper for users which must derive the contexts
                       based on current bridge and vlan options (e.g. timers)
        patch      09: adds checks for disabled vlan contexts in packet
                       processing and timers
        patch      10: adds support for per-vlan querier and tagged queries
        patch      11: adds router port vlan id in the notifications
        patches 12-14: add global vlan options support (change, dump, notify)
        patch      15: adds per-vlan global mcast snooping control
      
      Future patch-sets which build on this one (in order):
       - vlan state mcast handling
       - user-space mdb contexts (currently only the bridge contexts are used
         there)
       - all bridge multicast config options added per-vlan global and per
         vlan/port
       - iproute2 support for all the new uAPIs
       - selftests
      
      This set has been stress-tested (deleting/adding ports/vlans while changing
      vlan mcast snooping while processing IGMP/MLD packets), and also has
      passed all bridge self-tests. I'm sending this set as early as possible
      since there're a few more related sets that should go in the same
      release to get proper and full mcast vlan snooping support.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c080404
    • Paolo Abeni's avatar
      selftests: net: veth: add tests for set_channel · 1ec2230f
      Paolo Abeni authored
      Simple functional test for the newly exposed features.
      
      Also add an optional stress test for the channel number
      update under flood.
      
      RFC v1 -> RFC v2:
       - add the stress test
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec2230f
    • Paolo Abeni's avatar
      veth: create by default nr_possible_cpus queues · 9d3684c2
      Paolo Abeni authored
      This allows easier XDP usage. The number of default active
      queues is not changed: 1 RX and 1 TX so that this does
      not introduce overhead on the datapath for queue selection.
      
      v1 -> v2:
       - drop the module parameter, force default to nr_possible_cpus - Toke
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d3684c2
    • Paolo Abeni's avatar
      veth: implement support for set_channel ethtool op · 4752eeb3
      Paolo Abeni authored
      This change implements the set_channel() ethtool operation,
      preserving the current defaults values and allowing up set
      the number of queues in the range set ad device creation
      time.
      
      The update operation tries hard to leave the device in a
      consistent status in case of errors.
      
      RFC v1 -> RFC v2:
       - don't flip device status on set_channel()
       - roll-back the changes if possible on error - Jackub
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4752eeb3
    • Paolo Abeni's avatar
      veth: factor out initialization helper · dedd53c5
      Paolo Abeni authored
      Extract in simpler helpers the code to enable and disable a
      range of xdp/napi instance, with the common property that
      "disable" helpers can't fail.
      
      Will be used by the next patch. No functional change intended.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dedd53c5
    • Paolo Abeni's avatar
      veth: always report zero combined channels · f7918b79
      Paolo Abeni authored
      veth get_channel currently reports for channels being both RX/TX and
      combined. As Jakub noted:
      
      """
      ethtool man page is relatively clear, unfortunately the kernel code
      is not and few read the man page. A channel is approximately an IRQ,
      not a queue, and IRQ can't be dedicated and combined simultaneously
      """
      
      This patch changes the information exposed by veth_get_channels,
      setting max_combined to zero, being more consistent with the above
      statement. The ethtool_channels is always cleared by the caller, we just
      need to avoid setting the 'combined' fields.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7918b79
    • Vasily Averin's avatar
      memcg: enable accounting for scm_fp_list objects · 2c6ad20b
      Vasily Averin authored
      unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
      Each such send call forces kernel to allocate up to 2Kb memory for
      struct scm_fp_list.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c6ad20b
    • Vasily Averin's avatar
      memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation · 1b51d827
      Vasily Averin authored
      Author: Andrey Ryabinin <aryabinin@virtuozzo.com>
      
      The size of the ip_tunnel_prl structs allocation is controllable from
      user-space, thus it's better to avoid spam in dmesg if allocation failed.
      Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
      accounting. Allocation is temporary and limited by 4GB.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b51d827
    • Vasily Averin's avatar
      memcg: enable accounting for VLAN group array · a89893dd
      Vasily Averin authored
      vlan array consume up to 8 pages of memory per net device.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a89893dd
    • Vasily Averin's avatar
      memcg: enable accounting for inet_bin_bucket cache · 990c74e3
      Vasily Averin authored
      net namespace can create up to 64K tcp and dccp ports and force kernel
      to allocate up to several megabytes of memory per netns
      for inet_bind_bucket objects.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      990c74e3
    • Vasily Averin's avatar
      memcg: enable accounting for IP address and routing-related objects · 6126891c
      Vasily Averin authored
      An netadmin inside container can use 'ip a a' and 'ip r a'
      to assign a large number of ipv4/ipv6 addresses and routing entries
      and force kernel to allocate megabytes of unaccounted memory
      for long-lived per-netdevice related kernel objects:
      'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
      'struct rt6_info', 'struct fib_rules' and ip_fib caches.
      
      These objects can be manually removed, though usually they lives
      in memory till destroy of its net namespace.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      One of such objects is the 'struct fib6_node' mostly allocated in
      net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
      
       write_lock_bh(&table->tb6_lock);
       err = fib6_add(&table->tb6_root, rt, info, mxc);
       write_unlock_bh(&table->tb6_lock);
      
      In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
      kmem cache. The proper memory cgroup still cannot be found due to the
      incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
      
      Obsoleted in_interrupt() does not describe real execution context properly.
      >From include/linux/preempt.h:
      
       The following macros are deprecated and should not be used in new code:
       in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled
      
      To verify the current execution context new macro should be used instead:
       in_task()	- We're in task context
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6126891c
    • Vasily Averin's avatar
      memcg: enable accounting for net_device and Tx/Rx queues · c948f51c
      Vasily Averin authored
      Container netadmin can create a lot of fake net devices,
      then create a new net namespace and repeat it again and again.
      Net device can request the creation of up to 4096 tx and rx queues,
      and force kernel to allocate up to several tens of megabytes memory
      per net device.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c948f51c
    • David S. Miller's avatar
      Merge branch 'bridge-vlan-multicast' · 2967eed9
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: multicast: add vlan support
      
      This patchset adds initial per-vlan multicast support, most of the code
      deals with moving to multicast context pointers from bridge/port pointers.
      That allows us to switch them with the per-vlan contexts when a multicast
      packet is being processed and vlan multicast snooping has been enabled.
      That is controlled by a global bridge option added in patch 06 which is
      off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
      that this option can change only under RTNL and doesn't require
      multicast_lock, so we need to be careful when retrieving mcast contexts
      in parallel. For packet processing they are switched only once in
      br_multicast_rcv() and then used until the packet has been processed.
      For the most part we need these contexts only to read config values and
      check if they are disabled. The global mcast state which is maintained
      consists of querier and router timers, the rest are config options.
      The port mcast state which is maintained consists of query timer and
      link to router port list if it's ever marked as a router port. Port
      multicast contexts _must_ be used only with their respective global
      contexts, that is a bridge port's mcast context must be used only with
      bridge's global mcast context and a vlan/port's mcast context must be
      used only with that vlan's global mcast context due to the router port
      lists. This way a bridge port can be marked as a router in multiple
      vlans, but might not be a router in some other vlan. Also this allows us
      to have per-vlan querier elections, per-vlan queries and basically the
      whole multicast state becomes per-vlan when the option is enabled.
      One of the hardest parts is synchronization with vlan's memory
      management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
      which is changed only under multicast_lock. When a vlan is being
      destroyed first that flag is removed under the lock, then the multicast
      context is torn down which includes waiting for any outstanding context
      timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
      it must be checked first if the contexts are vlan and the multicast_lock
      has been acquired. That is done by all IGMP/MLD packet processing
      functions and timers. When processing a packet we have RCU so the vlan
      memory won't be freed, but if the flag is missing we must not process it.
      The timers are synchronized in the same way with the addition of waiting
      for them to finish in case they are running after removing the flag
      under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
      snooping requires vlan filtering to be enabled, if it's disabled then
      snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
      controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
      vlan disabled checks. We need both flags because one is controlled by
      user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
      for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
      Since the latter is also used for synchronization between the multicast
      and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
      rely on it when checking if a vlan context is disabled. The multicast
      fast-path has 3 new bit tests on the cache-hot bridge flags field, I
      didn't observe any measurable difference. I haven't forced either
      context options to be always disabled when the other type is enabled
      because the state consists of timers which either expire (router) or
      don't affect the normal operation. Some options, like the mcast querier
      one, won't be allowed to change for the disabled context type, that will
      come with a future patch-set which adds per-vlan querier control.
      
      Another important addition is the global vlan options, so far we had
      only per bridge/port vlan options but in order to control vlan multicast
      snooping globally we need to add a new type of global vlan options.
      They can be changed only on the bridge device and are dumped only when a
      special flag is set in the dump request. The first global option is vlan
      mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
      private flag. It can be set only on master vlan entries. There will be
      many more global vlan options in the future both for multicast config
      and other per-vlan options (e.g. STP).
      
      There's a lot of room for improvements, I'll do some of the initial
      ones but splitting the state to different contexts opens the door
      for a lot more. Also any new multicast options become vlan-supported with
      very little to no effort by using the same contexts.
      
      Short patch description:
        patches 01-04: initial mcast context add, no functional changes
        patch      05: adds vlan mcast init and control helpers and uses them on
                       vlan create/destroy
        patch      06: adds a global bridge mcast vlan snooping knob (default
                       off)
        patches 07-08: add a helper for users which must derive the contexts
                       based on current bridge and vlan options (e.g. timers)
        patch      09: adds checks for disabled vlan contexts in packet
                       processing and timers
        patch      10: adds support for per-vlan querier and tagged queries
        patch      11: adds router port vlan id in the notifications
        patches 12-14: add global vlan options support (change, dump, notify)
        patch      15: adds per-vlan global mcast snooping control
      
      Future patch-sets which build on this one (in order):
       - vlan state mcast handling
       - user-space mdb contexts (currently only the bridge contexts are used
         there)
       - all bridge multicast config options added per-vlan global and per
         vlan/port
       - iproute2 support for all the new uAPIs
       - selftests
      
      This set has been stress-tested (deleting/adding ports/vlans while changing
      vlan mcast snooping while processing IGMP/MLD packets), and also has
      passed all bridge self-tests. I'm sending this set as early as possible
      since there're a few more related sets that should go in the same
      release to get proper and full mcast vlan snooping support.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2967eed9
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: add mcast snooping control · 9dee572c
      Nikolay Aleksandrov authored
      Add a new global vlan option which controls whether multicast snooping
      is enabled or disabled for a single vlan. It controls the vlan private
      flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dee572c
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: notify when global options change · 9aba624d
      Nikolay Aleksandrov authored
      Add support for global options notifications. They use only RTM_NEWVLAN
      since global options can only be set and are contained in a separate
      vlan global options attribute. Notifications are compressed in ranges
      where possible, i.e. the sequential vlan options are equal.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aba624d
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: add support for dumping global vlan options · 743a53d9
      Nikolay Aleksandrov authored
      Add a new vlan options dump flag which causes only global vlan options
      to be dumped. The dumps are done only with bridge devices, ports are
      ignored. They support vlan compression if the options in sequential
      vlans are equal (currently always true).
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      743a53d9
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: add support for global options · 47ecd2db
      Nikolay Aleksandrov authored
      We can have two types of vlan options depending on context:
       - per-device vlan options (split in per-bridge and per-port)
       - global vlan options
      
      The second type wasn't supported in the bridge until now, but we need
      them for per-vlan multicast support, per-vlan STP support and other
      options which require global vlan context. They are contained in the global
      bridge vlan context even if the vlan is not configured on the bridge device
      itself. This patch adds initial netlink attributes and support for setting
      these global vlan options, they can only be set (RTM_NEWVLAN) and the
      operation must use the bridge device. Since there are no such options yet
      it shouldn't have any functional effect.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47ecd2db
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: include router port vlan id in notifications · 1e9ca456
      Nikolay Aleksandrov authored
      Use the port multicast context to check if the router port is a vlan and
      in case it is include its vlan id in the notification.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e9ca456
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: add vlan querier and query support · 615cc23e
      Nikolay Aleksandrov authored
      Add basic vlan context querier support, if the contexts passed to
      multicast_alloc_query are vlan then the query will be tagged. Also
      handle querier start/stop of vlan contexts.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      615cc23e
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: check if should use vlan mcast ctx · 4cdd0d10
      Nikolay Aleksandrov authored
      Add helpers which check if the current bridge/port multicast context
      should be used (i.e. they're not disabled) and use them for Rx IGMP/MLD
      processing, timers and new group addition. It is important for vlans to
      disable processing of timer/packet after the multicast_lock is obtained
      if the vlan context doesn't have BR_VLFLAG_MCAST_ENABLED. There are two
      cases when that flag is missing:
       - if the vlan is getting destroyed it will be removed and timers will
         be stopped
       - if the vlan mcast snooping is being disabled
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cdd0d10
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: use the port group to port context helper · eb1593a0
      Nikolay Aleksandrov authored
      We need to use the new port group to port context helper in places where
      we cannot pass down the proper context (i.e. functions that can be
      called by timers or outside the packet snooping paths).
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb1593a0
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: add helper to get port mcast context from port group · 74edfd48
      Nikolay Aleksandrov authored
      Add br_multicast_pg_to_port_ctx() which returns the proper port multicast
      context from either port or vlan based on bridge option and vlan flags.
      As the comment inside explains the locking is a bit tricky, we rely on
      the fact that BR_VLFLAG_MCAST_ENABLED requires multicast_lock to change
      and we also require it to be held to call that helper. If we find the
      vlan under rcu and it still has the flag then we can be sure it will be
      alive until we unlock multicast_lock which should be enough.
      Note that the context might change from vlan to bridge between different
      calls to this helper as the mcast vlan knob requires only rtnl so it should
      be used carefully and for read-only/check purposes.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74edfd48
    • Nikolay Aleksandrov's avatar
      net: bridge: add vlan mcast snooping knob · f4b7002a
      Nikolay Aleksandrov authored
      Add a global knob that controls if vlan multicast snooping is enabled.
      The proper contexts (vlan or bridge-wide) will be chosen based on the knob
      when processing packets and changing bridge device state. Note that
      vlans have their individual mcast snooping enabled by default, but this
      knob is needed to turn on bridge vlan snooping. It is disabled by
      default. To enable the knob vlan filtering must also be enabled, it
      doesn't make sense to have vlan mcast snooping without vlan filtering
      since that would lead to inconsistencies. Disabling vlan filtering will
      also automatically disable vlan mcast snooping.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4b7002a
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: add vlan state initialization and control · 7b54aaaf
      Nikolay Aleksandrov authored
      Add helpers to enable/disable vlan multicast based on its flags, we need
      two flags because we need to know if the vlan has multicast enabled
      globally (user-controlled) and if it has it enabled on the specific device
      (bridge or port). The new private vlan flags are:
       - BR_VLFLAG_MCAST_ENABLED: locally enabled multicast on the device, used
         when removing a vlan, toggling vlan mcast snooping and controlling
         single vlan (kernel-controlled, valid under RTNL and multicast_lock)
       - BR_VLFLAG_GLOBAL_MCAST_ENABLED: globally enabled multicast for the
         vlan, used to control the bridge-wide vlan mcast snooping for a
         single vlan (user-controlled, can be checked under any context)
      
      Bridge vlan contexts are created with multicast snooping enabled by
      default to be in line with the current bridge snooping defaults. In
      order to actually activate per vlan snooping and context usage a
      bridge-wide knob will be added later which will default to disabled.
      If that knob is enabled then automatically all vlan snooping will be
      enabled. All vlan contexts are initialized with the current bridge
      multicast context defaults.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b54aaaf
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: add global and per-port multicast context · 613d61db
      Nikolay Aleksandrov authored
      Add global and per-port vlan multicast context, only initialized but
      still not used. No functional changes intended.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      613d61db
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: use multicast contexts instead of bridge or port · adc47037
      Nikolay Aleksandrov authored
      Pass multicast context pointers to multicast functions instead of bridge/port.
      This would make it easier later to switch these contexts to their per-vlan
      versions. The patch is basically search and replace, no functional changes.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc47037
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: factor out bridge multicast context · d3d065c0
      Nikolay Aleksandrov authored
      Factor out the bridge's global multicast context into a separate
      structure which will later be used for per-vlan global context.
      No functional changes intended.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3d065c0
    • Nikolay Aleksandrov's avatar
      net: bridge: multicast: factor out port multicast context · 9632233e
      Nikolay Aleksandrov authored
      Factor out the port's multicast context into a separate structure which
      will later be shared for per-port,vlan context. No functional changes
      intended. We need the structure even if bridge multicast is not defined
      to pass down as pointer to forwarding functions.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9632233e
    • Kurt Kanzenbach's avatar
      Revert "igc: Export LEDs" · edd2e9d5
      Kurt Kanzenbach authored
      This reverts commit cf833182.
      
      There are better Linux interfaces to export the different LED modes
      and blinking reasons.
      
      Revert this patch for now and come up with better solution later.
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Link: https://lore.kernel.org/r/20210719101640.16047-1-kurt@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      edd2e9d5
    • Eric Dumazet's avatar
      net/tcp_fastopen: remove tcp_fastopen_ctx_lock · e93abb84
      Eric Dumazet authored
      Remove the (per netns) spinlock in favor of xchg() atomic operations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Link: https://lore.kernel.org/r/20210719101107.3203943-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e93abb84
    • Yajun Deng's avatar
      netlink: Deal with ESRCH error in nlmsg_notify() · fef773fc
      Yajun Deng authored
      Yonghong Song report:
      The bpf selftest tc_bpf failed with latest bpf-next.
      The following is the command to run and the result:
      $ ./test_progs -n 132
      [   40.947571] bpf_testmod: loading out-of-tree module taints kernel.
      test_tc_bpf:PASS:test_tc_bpf__open_and_load 0 nsec
      test_tc_bpf:PASS:bpf_tc_hook_create(BPF_TC_INGRESS) 0 nsec
      test_tc_bpf:PASS:bpf_tc_hook_create invalid hook.attach_point 0 nsec
      test_tc_bpf_basic:PASS:bpf_obj_get_info_by_fd 0 nsec
      test_tc_bpf_basic:PASS:bpf_tc_attach 0 nsec
      test_tc_bpf_basic:PASS:handle set 0 nsec
      test_tc_bpf_basic:PASS:priority set 0 nsec
      test_tc_bpf_basic:PASS:prog_id set 0 nsec
      test_tc_bpf_basic:PASS:bpf_tc_attach replace mode 0 nsec
      test_tc_bpf_basic:PASS:bpf_tc_query 0 nsec
      test_tc_bpf_basic:PASS:handle set 0 nsec
      test_tc_bpf_basic:PASS:priority set 0 nsec
      test_tc_bpf_basic:PASS:prog_id set 0 nsec
      libbpf: Kernel error message: Failed to send filter delete notification
      test_tc_bpf_basic:FAIL:bpf_tc_detach unexpected error: -3 (errno 3)
      test_tc_bpf:FAIL:test_tc_internal ingress unexpected error: -3 (errno 3)
      
      The failure seems due to the commit
          cfdf0d9a ("rtnetlink: use nlmsg_notify() in rtnetlink_send()")
      
      Deal with ESRCH error in nlmsg_notify() even the report variable is zero.
      Reported-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Link: https://lore.kernel.org/r/20210719051816.11762-1-yajun.deng@linux.devSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fef773fc
  2. 18 Jul, 2021 1 commit
  3. 17 Jul, 2021 7 commits
    • Liu Jian's avatar
      igmp: Add ip_mc_list lock in ip_check_mc_rcu · 23d2b940
      Liu Jian authored
      I got below panic when doing fuzz test:
      
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 4056 Comm: syz-executor.3 Tainted: G    B             5.14.0-rc1-00195-gcff5c4254439-dirty #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      Call Trace:
      dump_stack_lvl+0x7a/0x9b
      panic+0x2cd/0x5af
      end_report.cold+0x5a/0x5a
      kasan_report+0xec/0x110
      ip_check_mc_rcu+0x556/0x5d0
      __mkroute_output+0x895/0x1740
      ip_route_output_key_hash_rcu+0x2d0/0x1050
      ip_route_output_key_hash+0x182/0x2e0
      ip_route_output_flow+0x28/0x130
      udp_sendmsg+0x165d/0x2280
      udpv6_sendmsg+0x121e/0x24f0
      inet6_sendmsg+0xf7/0x140
      sock_sendmsg+0xe9/0x180
      ____sys_sendmsg+0x2b8/0x7a0
      ___sys_sendmsg+0xf0/0x160
      __sys_sendmmsg+0x17e/0x3c0
      __x64_sys_sendmmsg+0x9e/0x100
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x462eb9
      Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8
       48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48>
       3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f3df5af1c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9
      RDX: 0000000000000312 RSI: 0000000020001700 RDI: 0000000000000007
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3df5af26bc
      R13: 00000000004c372d R14: 0000000000700b10 R15: 00000000ffffffff
      
      It is one use-after-free in ip_check_mc_rcu.
      In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection.
      But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock.
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23d2b940
    • David S. Miller's avatar
      Merge branch 'vmxnet3-version-6' · ab0441b4
      David S. Miller authored
      Ronak Doshi says:
      
      ====================
      vmxnet3: upgrade to version 6
      
      vmxnet3 emulation has recently added several new features which includes
      increase in queues supported, remove power of 2 limitation on queues,
      add RSS for ESP IPv6, etc. This patch series extends the vmxnet3 driver
      to leverage these new features.
      
      Compatibility is maintained using existing vmxnet3 versioning mechanism as
      follows:
      - new features added to vmxnet3 emulation are associated with new vmxnet3
         version viz. vmxnet3 version 6.
      - emulation advertises all the versions it supports to the driver.
      - during initialization, vmxnet3 driver picks the highest version number
      supported by both the emulation and the driver and configures emulation
      to run at that version.
      
      In particular, following changes are introduced:
      
      Patch 1:
        This patch introduces utility macros for vmxnet3 version 6 comparison
        and updates Copyright information.
      
      Patch 2:
        This patch adds support to increase maximum Tx/Rx queues from 8 to 32.
      
      Patch 3:
        This patch removes the limitation of power of 2 on the queues.
      
      Patch 4:
        Uses existing get_rss_hash_opts and set_rss_hash_opts methods to add
        support for ESP IPv6 RSS.
      
      Patch 5:
        This patch reports correct RSS hash type based on the type of RSS
        performed.
      
      Patch 6:
        This patch updates maximum configurable mtu to 9190.
      
      Patch 7:
        With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
        with this patch, the driver can configure emulation to run at vmxnet3
        version 6.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab0441b4
    • Ronak Doshi's avatar
      vmxnet3: update to version 6 · ce2639ad
      Ronak Doshi authored
      With all vmxnet3 version 6 changes incorporated in the vmxnet3 driver,
      the driver can configure emulation to run at vmxnet3 version 6, provided
      the emulation advertises support for version 6.
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarGuolin Yang <gyang@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce2639ad
    • Ronak Doshi's avatar
      vmxnet3: increase maximum configurable mtu to 9190 · 8c5663e4
      Ronak Doshi authored
      This patch increases the maximum configurable mtu to 9190
      to accommodate jumbo packets of overlay traffic.
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarGuolin Yang <gyang@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c5663e4
    • Ronak Doshi's avatar
      vmxnet3: set correct hash type based on rss information · b3973bb4
      Ronak Doshi authored
      As vmxnet3 supports IP/TCP/UDP RSS, this patch sets appropriate
      hash type based on the type of RSS performed.
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarGuolin Yang <gyang@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3973bb4
    • Ronak Doshi's avatar
      vmxnet3: add support for ESP IPv6 RSS · 79d124bb
      Ronak Doshi authored
      Vmxnet3 version 4 added support for ESP RSS. However, only IPv4 was
      supported. With vmxnet3 version 6, this patch enables RSS for ESP
      IPv6 packets as well.
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarGuolin Yang <gyang@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79d124bb
    • Ronak Doshi's avatar
      vmxnet3: remove power of 2 limitation on the queues · 15ccf2f4
      Ronak Doshi authored
      With version 6, vmxnet3 relaxes the restriction on queues to
      be power of two. This is helpful in cases (Edge VM) where
      vcpus are less than 8 and device requires more than 4 queues.
      Signed-off-by: default avatarRonak Doshi <doshir@vmware.com>
      Acked-by: default avatarGuolin Yang <gyang@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15ccf2f4