1. 12 Dec, 2022 40 commits
    • Jakub Kicinski's avatar
      Merge branch 'bridge-mcast-extensions-for-evpn' · 8150f0cf
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      bridge: mcast: Extensions for EVPN
      
      tl;dr
      =====
      
      This patchset creates feature parity between user space and the kernel
      and allows the former to install and replace MDB port group entries with
      a source list and associated filter mode. This is required for EVPN use
      cases where multicast state is not derived from snooped IGMP/MLD
      packets, but instead derived from EVPN routes exchanged by the control
      plane in user space.
      
      Background
      ==========
      
      IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols
      in that they add support for source-specific multicast. That is, hosts
      can advertise interest in listening to a particular multicast address
      only from specific source addresses or from all sources except for
      specific source addresses.
      
      In kernel 5.10 [3][4], the bridge driver gained the ability to snoop
      IGMPv3/MLDv2 packets and install corresponding MDB port group entries.
      For example, a snooped IGMPv3 Membership Report that contains a single
      MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1,
      192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these
      entries:
      
       # bridge -d mdb show
       dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel  blocked
       dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel  blocked
       dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel  blocked
       dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel  blocked
       dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel
      
      While the kernel can install and replace entries with a filter mode and
      source list, user space cannot. It can only add EXCLUDE entries with an
      empty source list, which is sufficient for IGMPv2/MLDv1, but not for
      IGMPv3/MLDv2.
      
      Use cases where the multicast state is not derived from snooped packets,
      but instead derived from routes exchanged by the user space control
      plane require feature parity between user space and the kernel in terms
      of MDB configuration. Such a use case is detailed in the next section.
      
      Motivation
      ==========
      
      RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that
      allows NVE switches in the EVPN network to advertise and learn
      reachability information for unicast MAC addresses. Traffic destined to
      a unicast MAC address can therefore be selectively forwarded to a single
      NVE switch behind which the MAC is located.
      
      The same is not true for IP multicast traffic. Such traffic is simply
      flooded as BUM to all NVE switches in the broadcast domain (BD),
      regardless if a switch has interested receivers for the multicast stream
      or not. This is especially problematic for overlay networks that make
      heavy use of multicast.
      
      The issue is addressed by RFC 9251 [7] that defines a "Selective
      Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in
      the EVPN network to advertise multicast streams that they are interested
      in. This is done by having each switch suppress IGMP/MLD packets from
      being transmitted to the NVE network and instead communicate the
      information over BGP to other switches.
      
      As far as the bridge driver is concerned, the above means that the
      multicast state (i.e., {multicast address, group timer, filter-mode,
      (source records)}) for the VXLAN bridge port is not populated by the
      kernel from snooped IGMP/MLD packets (they are suppressed), but instead
      by user space. Specifically, by the routing daemon that is exchanging
      EVPN routes with other NVE switches.
      
      Changes are obviously also required in the VXLAN driver, but they are
      the subject of future patchsets. See the "Future work" section.
      
      Implementation
      ==============
      
      The user interface is extended to allow user space to specify the filter
      mode of the MDB port group entry and its source list. Replace support is
      also added so that user space would not need to remove an entry and
      re-add it only to edit its source list or filter mode, as that would
      result in packet loss. Example usage:
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \
      	source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00
      
      The netlink interface is extended with a few new attributes in the
      RTM_NEWMDB request message:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_SET_ENTRY ]
      	struct br_mdb_entry
      [ MDBA_SET_ENTRY_ATTRS ]
      	[ MDBE_ATTR_SOURCE ]
      		struct in_addr / struct in6_addr
      	[ MDBE_ATTR_SRC_LIST ]		// new
      		[ MDBE_SRC_LIST_ENTRY ]
      			[ MDBE_SRCATTR_ADDRESS ]
      				struct in_addr / struct in6_addr
      		[ ...]
      	[ MDBE_ATTR_GROUP_MODE ]	// new
      		u8
      	[ MDBE_ATTR_RTPORT ]		// new
      		u8
      
      No changes are required in RTM_NEWMDB responses and notifications, as
      all the information can already be dumped by the kernel today.
      
      Testing
      =======
      
      Tested with existing bridge multicast selftests: bridge_igmp.sh,
      bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh,
      bridge_vlan_mcast.sh.
      
      In addition, added many new test cases for existing as well as for new
      MDB functionality.
      
      Patchset overview
      =================
      
      Patches #1-#8 are non-functional preparations for the core changes in
      later patches.
      
      Patches #9-#10 allow user space to install (*, G) entries with a source
      list and associated filter mode. Specifically, patch #9 adds the
      necessary kernel plumbing and patch #10 exposes the new functionality to
      user space via a few new attributes.
      
      Patch #11 allows user space to specify the routing protocol of new MDB
      port group entries so that a routing daemon could differentiate between
      entries installed by it and those installed by an administrator.
      
      Patch #12 allows user space to replace MDB port group entries. This is
      useful, for example, when user space wants to add a new source to a
      source list. Instead of deleting a (*, G) entry and re-adding it with an
      extended source list (which would result in packet loss), user space can
      simply replace the current entry.
      
      Patches #13-#14 add tests for existing MDB functionality as well as for
      all new functionality added in this patchset.
      
      Future work
      ===========
      
      The VXLAN driver will need to be extended with an MDB so that it could
      selectively forward IP multicast traffic to NVE switches with interested
      receivers instead of simply flooding it to all switches as BUM.
      
      The idea is to reuse the existing MDB interface for the VXLAN driver in
      a similar way to how the FDB interface is shared between the bridge and
      VXLAN drivers.
      
      From command line perspective, configuration will look as follows:
      
       # bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \
      	filter_mode exclude source_list 198.50.100.1,198.50.100.2
      
       # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
      	filter_mode include source_list 198.50.100.3,198.50.100.4 \
      	dst 192.0.2.1 dst_port 4789 src_vni 2
      
       # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
      	filter_mode exclude source_list 198.50.100.1,198.50.100.2 \
      	dst 192.0.2.2 dst_port 4789 src_vni 2
      
      Where the first command is enabled by this set, but the next two will be
      the subject of future work.
      
      From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages will
      be extended to the VXLAN driver. This means that a few new attributes
      will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for
      these messages will need to move to net/core/rtnetlink.c. The rtnetlink
      code will call into the appropriate driver based on the ifindex
      specified in the ancillary header.
      
      iproute2 patches can be found here [9].
      
      Changelog
      =========
      
      Since v1 [10]:
      
      * Patch #12: Remove extack from br_mdb_replace_group_sg().
      * Patch #12: Change 'nlflags' to u16 and move it after 'filter_mode' to
        pack the structure.
      
      Since RFC [11]:
      
      * Patch #6: New patch.
      * Patch #9: Use an array instead of a list to store source entries.
      * Patch #10: Use an array instead of list to store source entries.
      * Patch #10: Drop br_mdb_config_attrs_fini().
      * Patch #11: Reject protocol for host entries.
      * Patch #13: New patch.
      * Patch #14: New patch.
      
      [1] https://datatracker.ietf.org/doc/html/rfc3376
      [2] https://www.rfc-editor.org/rfc/rfc3810
      [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8
      [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87
      [5] https://datatracker.ietf.org/doc/html/rfc7432
      [6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
      [7] https://datatracker.ietf.org/doc/html/rfc9251
      [8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
      [9] https://github.com/idosch/iproute2/commits/submit/mdb_v1
      [10] https://lore.kernel.org/netdev/20221208152839.1016350-1-idosch@nvidia.com/
      [11] https://lore.kernel.org/netdev/20221018120420.561846-1-idosch@nvidia.com/
      ====================
      
      Link: https://lore.kernel.org/r/20221210145633.1328511-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8150f0cf
    • Ido Schimmel's avatar
      selftests: forwarding: Add bridge MDB test · b6d00da0
      Ido Schimmel authored
      Add a selftests that includes the following test cases:
      
      1. Configuration tests. Both valid and invalid configurations are
         tested across all entry types (e.g., L2, IPv4).
      
      2. Forwarding tests. Both host and port group entries are tested across
         all entry types.
      
      3. Interaction between user installed MDB entries and IGMP / MLD control
         packets.
      
      Example output:
      
      INFO: # Host entries configuration tests
      TEST: Common host entries configuration tests (IPv4)                [ OK ]
      TEST: Common host entries configuration tests (IPv6)                [ OK ]
      TEST: Common host entries configuration tests (L2)                  [ OK ]
      
      INFO: # Port group entries configuration tests - (*, G)
      TEST: Common port group entries configuration tests (IPv4 (*, G))   [ OK ]
      TEST: Common port group entries configuration tests (IPv6 (*, G))   [ OK ]
      TEST: IPv4 (*, G) port group entries configuration tests            [ OK ]
      TEST: IPv6 (*, G) port group entries configuration tests            [ OK ]
      
      INFO: # Port group entries configuration tests - (S, G)
      TEST: Common port group entries configuration tests (IPv4 (S, G))   [ OK ]
      TEST: Common port group entries configuration tests (IPv6 (S, G))   [ OK ]
      TEST: IPv4 (S, G) port group entries configuration tests            [ OK ]
      TEST: IPv6 (S, G) port group entries configuration tests            [ OK ]
      
      INFO: # Port group entries configuration tests - L2
      TEST: Common port group entries configuration tests (L2 (*, G))     [ OK ]
      TEST: L2 (*, G) port group entries configuration tests              [ OK ]
      
      INFO: # Forwarding tests
      TEST: IPv4 host entries forwarding tests                            [ OK ]
      TEST: IPv6 host entries forwarding tests                            [ OK ]
      TEST: L2 host entries forwarding tests                              [ OK ]
      TEST: IPv4 port group "exclude" entries forwarding tests            [ OK ]
      TEST: IPv6 port group "exclude" entries forwarding tests            [ OK ]
      TEST: IPv4 port group "include" entries forwarding tests            [ OK ]
      TEST: IPv6 port group "include" entries forwarding tests            [ OK ]
      TEST: L2 port entries forwarding tests                              [ OK ]
      
      INFO: # Control packets tests
      TEST: IGMPv3 MODE_IS_INCLUE tests                                   [ OK ]
      TEST: MLDv2 MODE_IS_INCLUDE tests                                   [ OK ]
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b6d00da0
    • Ido Schimmel's avatar
      selftests: forwarding: Rename bridge_mdb test · f9923a67
      Ido Schimmel authored
      The test is only concerned with host MDB entries and not with MDB
      entries as a whole. Rename the test to reflect that.
      
      Subsequent patches will add a more general test that will contain the
      test cases for host MDB entries and remove the current test.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f9923a67
    • Ido Schimmel's avatar
      bridge: mcast: Support replacement of MDB port group entries · 61f21835
      Ido Schimmel authored
      Now that user space can specify additional attributes of port group
      entries such as filter mode and source list, it makes sense to allow
      user space to atomically modify these attributes by replacing entries
      instead of forcing user space to delete the entries and add them back.
      
      Replace MDB port group entries when the 'NLM_F_REPLACE' flag is
      specified in the netlink message header.
      
      When a (*, G) entry is replaced, update the following attributes: Source
      list, state, filter mode, protocol and flags. If the entry is temporary
      and in EXCLUDE mode, reset the group timer to the group membership
      interval. If the entry is temporary and in INCLUDE mode, reset the
      source timers of associated sources to the group membership interval.
      
      Examples:
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static     0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static     0.00
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static     0.00
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp     0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp     0.00
       dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp     0.00
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61f21835
    • Ido Schimmel's avatar
      bridge: mcast: Allow user space to specify MDB entry routing protocol · 1d7b66a7
      Ido Schimmel authored
      Add the 'MDBE_ATTR_RTPORT' attribute to allow user space to specify the
      routing protocol of the MDB port group entry. Enforce a minimum value of
      'RTPROT_STATIC' to prevent user space from using protocol values that
      should only be set by the kernel (e.g., 'RTPROT_KERNEL'). Maintain
      backward compatibility by defaulting to 'RTPROT_STATIC'.
      
      The protocol is already visible to user space in RTM_NEWMDB responses
      and notifications via the 'MDBA_MDB_EATTR_RTPROT' attribute.
      
      The routing protocol allows a routing daemon to distinguish between
      entries configured by it and those configured by the administrator. Once
      MDB flush is supported, the protocol can be used as a criterion
      according to which the flush is performed.
      
      Examples:
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto kernel
       Error: integer out of range.
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto static
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent proto zebra
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent source_list 198.51.100.1,198.51.100.2 filter_mode include proto 250
      
       # bridge -d mdb show
       dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.2 permanent filter_mode include proto 250
       dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.1 permanent filter_mode include proto 250
       dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode include source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto 250
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto static
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1d7b66a7
    • Ido Schimmel's avatar
      bridge: mcast: Allow user space to add (*, G) with a source list and filter mode · 6afaae6d
      Ido Schimmel authored
      Add new netlink attributes to the RTM_NEWMDB request that allow user
      space to add (*, G) with a source list and filter mode.
      
      The RTM_NEWMDB message can already dump such entries (created by the
      kernel) so there is no need to add dump support. However, the message
      contains a different set of attributes depending if it is a request or a
      response. The naming and structure of the new attributes try to follow
      the existing ones used in the response.
      
      Request:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_SET_ENTRY ]
      	struct br_mdb_entry
      [ MDBA_SET_ENTRY_ATTRS ]
      	[ MDBE_ATTR_SOURCE ]
      		struct in_addr / struct in6_addr
      	[ MDBE_ATTR_SRC_LIST ]		// new
      		[ MDBE_SRC_LIST_ENTRY ]
      			[ MDBE_SRCATTR_ADDRESS ]
      				struct in_addr / struct in6_addr
      		[ ...]
      	[ MDBE_ATTR_GROUP_MODE ]	// new
      		u8
      
      Response:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_MDB ]
      	[ MDBA_MDB_ENTRY ]
      		[ MDBA_MDB_ENTRY_INFO ]
      			struct br_mdb_entry
      		[ MDBA_MDB_EATTR_TIMER ]
      			u32
      		[ MDBA_MDB_EATTR_SOURCE ]
      			struct in_addr / struct in6_addr
      		[ MDBA_MDB_EATTR_RTPROT ]
      			u8
      		[ MDBA_MDB_EATTR_SRC_LIST ]
      			[ MDBA_MDB_SRCLIST_ENTRY ]
      				[ MDBA_MDB_SRCATTR_ADDRESS ]
      					struct in_addr / struct in6_addr
      				[ MDBA_MDB_SRCATTR_TIMER ]
      					u8
      			[...]
      		[ MDBA_MDB_EATTR_GROUP_MODE ]
      			u8
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6afaae6d
    • Ido Schimmel's avatar
      bridge: mcast: Add support for (*, G) with a source list and filter mode · b1c8fec8
      Ido Schimmel authored
      In preparation for allowing user space to add (*, G) entries with a
      source list and associated filter mode, add the necessary plumbing to
      handle such requests.
      
      Extend the MDB configuration structure with a currently empty source
      array and filter mode that is currently hard coded to EXCLUDE.
      
      Add the source entries and the corresponding (S, G) entries before
      making the new (*, G) port group entry visible to the data path.
      
      Handle the creation of each source entry in a similar fashion to how it
      is created from the data path in response to received Membership
      Reports: Create the source entry, arm the source timer (if needed), add
      a corresponding (S, G) forwarding entry and finally mark the source
      entry as installed (by user space).
      
      Add the (S, G) entry by populating an MDB configuration structure and
      calling br_mdb_add_group_sg() as if a new entry is created by user
      space, with the sole difference that the 'src_entry' field is set to
      make sure that the group timer of such entries is never armed.
      
      Note that it is not currently possible to add more than 32 source
      entries to a port group entry. If this proves to be a problem we can
      either increase 'PG_SRC_ENT_LIMIT' or avoid forcing a limit on entries
      created by user space.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1c8fec8
    • Ido Schimmel's avatar
      bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source · 079afd66
      Ido Schimmel authored
      User space will soon be able to install a (*, G) with a source list,
      prompting the creation of a (S, G) entry for each source.
      
      In this case, the group timer of the (S, G) entry should never be set.
      
      Solve this by adding a new field to the MDB configuration structure that
      denotes whether the (S, G) corresponds to a source or not.
      
      The field will be set in a subsequent patch where br_mdb_add_group_sg()
      is called in order to create a (S, G) entry for each user provided
      source.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      079afd66
    • Ido Schimmel's avatar
      bridge: mcast: Add a flag for user installed source entries · a01ecb17
      Ido Schimmel authored
      There are a few places where the bridge driver differentiates between
      (S, G) entries installed by the kernel (in response to Membership
      Reports) and those installed by user space. One of them is when deleting
      an (S, G) entry corresponding to a source entry that is being deleted.
      
      While user space cannot currently add a source entry to a (*, G), it can
      add an (S, G) entry that later corresponds to a source entry created by
      the reception of a Membership Report. If this source entry is later
      deleted because its source timer expired or because the (*, G) entry is
      being deleted, the bridge driver will not delete the corresponding (S,
      G) entry if it was added by user space as permanent.
      
      This is going to be a problem when the ability to install a (*, G) with
      a source list is exposed to user space. In this case, when user space
      installs the (*, G) as permanent, then all the (S, G) entries
      corresponding to its source list will also be installed as permanent.
      When user space deletes the (*, G), all the source entries will be
      deleted and the expectation is that the corresponding (S, G) entries
      will be deleted as well.
      
      Solve this by introducing a new source entry flag denoting that the
      entry was installed by user space. When the entry is deleted, delete the
      corresponding (S, G) entry even if it was installed by user space as
      permanent, as the flag tells us that it was installed in response to the
      source entry being created.
      
      The flag will be set in a subsequent patch where source entries are
      created in response to user requests.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a01ecb17
    • Ido Schimmel's avatar
      bridge: mcast: Expose __br_multicast_del_group_src() · 083e3534
      Ido Schimmel authored
      Expose __br_multicast_del_group_src() which is symmetric to
      br_multicast_new_group_src() and does not remove the installed {S, G}
      forwarding entry, unlike br_multicast_del_group_src().
      
      The function will be used in the error path when user space was able to
      add a new source entry, but failed to install a corresponding forwarding
      entry.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      083e3534
    • Ido Schimmel's avatar
      bridge: mcast: Expose br_multicast_new_group_src() · fd0c6961
      Ido Schimmel authored
      Currently, new group source entries are only created in response to
      received Membership Reports. Subsequent patches are going to allow user
      space to install (*, G) entries with a source list.
      
      As a preparatory step, expose br_multicast_new_group_src() so that it
      could later be invoked from the MDB code (i.e., br_mdb.c) that handles
      RTM_NEWMDB messages.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd0c6961
    • Ido Schimmel's avatar
      bridge: mcast: Add a centralized error path · 160dd931
      Ido Schimmel authored
      Subsequent patches will add memory allocations in br_mdb_config_init()
      as the MDB configuration structure will include a linked list of source
      entries. This memory will need to be freed regardless if br_mdb_add()
      succeeded or failed.
      
      As a preparation for this change, add a centralized error path where the
      memory will be freed.
      
      Note that br_mdb_del() already has one error path and therefore does not
      require any changes.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      160dd931
    • Ido Schimmel's avatar
      bridge: mcast: Place netlink policy before validation functions · 1870a2d3
      Ido Schimmel authored
      Subsequent patches are going to add additional validation functions and
      netlink policies. Some of these functions will need to perform parsing
      using nla_parse_nested() and the new policies.
      
      In order to keep all the policies next to each other, move the current
      policy to before the validation functions.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1870a2d3
    • Ido Schimmel's avatar
      bridge: mcast: Split (*, G) and (S, G) addition into different functions · 6ff1e68e
      Ido Schimmel authored
      When the bridge is using IGMP version 3 or MLD version 2, it handles the
      addition of (*, G) and (S, G) entries differently.
      
      When a new (S, G) port group entry is added, all the (*, G) EXCLUDE
      ports need to be added to the port group of the new entry. Similarly,
      when a new (*, G) EXCLUDE port group entry is added, the port needs to
      be added to the port group of all the matching (S, G) entries.
      
      Subsequent patches will create more differences between both entry
      types. Namely, filter mode and source list can only be specified for (*,
      G) entries.
      
      Given the current and future differences between both entry types,
      handle the addition of each entry type in a different function, thereby
      avoiding the creation of one complex function.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6ff1e68e
    • Ido Schimmel's avatar
      bridge: mcast: Do not derive entry type from its filter mode · b63e3065
      Ido Schimmel authored
      Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries
      cannot be set from user space. Instead, it is set by the kernel
      according to the entry type: (*, G) entries are treated as EXCLUDE and
      (S, G) entries are treated as INCLUDE. This allows the kernel to derive
      the entry type from its filter mode.
      
      Subsequent patches will allow user space to set the filter mode of (*,
      G) entries, making the current assumption incorrect.
      
      As a preparation, remove the current assumption and instead determine
      the entry type from its key, which is a more direct way.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b63e3065
    • Jiapeng Chong's avatar
      qlcnic: Clean up some inconsistent indenting · 02abf84a
      Jiapeng Chong authored
      No functional modification involved.
      
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c:714 qlcnic_validate_ring_count() warn: inconsistent indenting.
      
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3419Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20221212055813.91154-1-jiapeng.chong@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      02abf84a
    • Tirthendu Sarkar's avatar
      i40e: allow toggling loopback mode via ndo_set_features callback · b1746fba
      Tirthendu Sarkar authored
      Add support for NETIF_F_LOOPBACK. This feature can be set via:
      $ ethtool -K eth0 loopback <on|off>
      
      This sets the MAC Tx->Rx loopback.
      
      This feature is used for the xsk selftests, and might have other uses
      too.
      Signed-off-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Tested-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20221209185553.2520088-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1746fba
    • Jakub Kicinski's avatar
      Merge branch 'net-add-iff_no_addrconf-to-prevent-ipv6-addrconf' · 2a78dd22
      Jakub Kicinski authored
      Xin Long says:
      
      ====================
      net: add IFF_NO_ADDRCONF to prevent ipv6 addrconf
      
      This patchset adds IFF_NO_ADDRCONF flag for dev->priv_flags
      to prevent ipv6 addrconf, as Jiri Pirko's suggestion.
      
      For Bonding it changes to use this flag instead of IFF_SLAVE
      flag in Patch 1, and for Teaming and Net Failover it sets
      this flag before calling dev_open() in Patch 2 and 3.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1670599241.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2a78dd22
    • Xin Long's avatar
      net: failover: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf · cb54d392
      Xin Long authored
      Similar to Bonding and Team, to prevent ipv6 addrconf with
      IFF_NO_ADDRCONF in slave_dev->priv_flags for slave ports
      is also needed in net failover.
      
      Note that dev_open(slave_dev) is called in .slave_register,
      which is called after the IFF_NO_ADDRCONF flag is set in
      failover_slave_register().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cb54d392
    • Xin Long's avatar
      net: team: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf · 0aa64df3
      Xin Long authored
      This patch is to use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf
      for Team port. This flag will be set in team_port_enter(), which
      is called before dev_open(), and cleared in team_port_leave(),
      called after dev_close() and the err path in team_port_add().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0aa64df3
    • Xin Long's avatar
      net: add IFF_NO_ADDRCONF and use it in bonding to prevent ipv6 addrconf · 8a321cf7
      Xin Long authored
      Currently, in bonding it reused the IFF_SLAVE flag and checked it
      in ipv6 addrconf to prevent ipv6 addrconf.
      
      However, it is not a proper flag to use for no ipv6 addrconf, for
      bonding it has to move IFF_SLAVE flag setting ahead of dev_open()
      in bond_enslave(). Also, IFF_MASTER/SLAVE are historical flags
      used in bonding and eql, as Jiri mentioned, the new devices like
      Team, Failover do not use this flag.
      
      So as Jiri suggested, this patch adds IFF_NO_ADDRCONF in priv_flags
      of the device to indicate no ipv6 addconf, and uses it in bonding
      and moves IFF_SLAVE flag setting back to its original place.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a321cf7
    • Uladzislau Koshchanka's avatar
      lib: packing: replace bit_reverse() with bitrev8() · 1280d4b7
      Uladzislau Koshchanka authored
      Remove bit_reverse() function.  Instead use bitrev8() from linux/bitrev.h +
      bitshift.  Reduces code-repetition.
      Signed-off-by: default avatarUladzislau Koshchanka <koshchanka@gmail.com>
      Link: https://lore.kernel.org/r/20221210004423.32332-1-koshchanka@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1280d4b7
    • Kurt Kanzenbach's avatar
      dt-bindings: net: dsa: hellcreek: Sync DSA maintainers · 93e637a3
      Kurt Kanzenbach authored
      The current DSA maintainers are Florian Fainelli, Andrew Lunn and Vladimir
      Oltean. Update the hellcreek binding accordingly.
      Signed-off-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20221212081546.6916-1-kurt@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      93e637a3
    • Yunsheng Lin's avatar
      net: tso: inline tso_count_descs() · d7b061b8
      Yunsheng Lin authored
      tso_count_descs() is a small function doing simple calculation,
      and tso_count_descs() is used in fast path, so inline it to
      reduce the overhead of calls.
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Link: https://lore.kernel.org/r/20221212032426.16050-1-linyunsheng@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d7b061b8
    • Vladimir Oltean's avatar
      net: dsa: don't call ptp_classify_raw() if switch doesn't provide RX timestamping · 8f18655c
      Vladimir Oltean authored
      ptp_classify_raw() is not exactly cheap, since it invokes a BPF program
      for every skb in the receive path. For switches which do not provide
      ds->ops->port_rxtstamp(), running ptp_classify_raw() provides precisely
      nothing, so check for the presence of the function pointer first, since
      that is much cheaper.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Link: https://lore.kernel.org/r/20221209175840.390707-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8f18655c
    • Jakub Kicinski's avatar
      Merge branch 'trace-points-for-mv88e6xxx' · cd2aafa2
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Trace points for mv88e6xxx
      
      While testing Hans Schultz' attempt at offloading MAB on mv88e6xxx:
      https://patchwork.kernel.org/project/netdevbpf/cover/20221205185908.217520-1-netdev@kapio-technology.com/
      I noticed that he still didn't get rid of the huge log spam caused by
      ATU and VTU violations, even if we discussed about this:
      https://patchwork.kernel.org/project/netdevbpf/cover/20221112203748.68995-1-netdev@kapio-technology.com/#25091076
      
      It seems unlikely he's going to ever do this, so here is my own stab at
      converting those messages to trace points. This is IMO an improvement
      regardless of whether Hans' work with MAB lands or not, especially the
      VTU violations which were quite annoying to me as well.
      
      A small sample of before:
      
      $ ./bridge_locked_port.sh lan1 lan2 lan3 lan4
      [  114.465272] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
      [  119.550508] mv88e6xxx_g1_vtu_prob_irq_thread_fn: 34 callbacks suppressed
      [  120.369586] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
      [  120.473658] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
      [  125.535209] mv88e6xxx_g1_vtu_prob_irq_thread_fn: 21 callbacks suppressed
      [  125.535243] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
      [  126.174558] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
      [  130.234055] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 fid 3 portvec 4 spid 2
      [  130.338193] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 fid 3 portvec 4 spid 2
      [  134.626099] mv88e6xxx_g1_atu_prob_irq_thread_fn: 38 callbacks suppressed
      [  134.626132] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 fid 3 portvec 4 spid 2
      
      and after:
      
      $ trace-cmd record -e mv88e6xxx ./bridge_locked_port.sh lan1 lan2 lan3 lan4
      $ trace-cmd report
         irq/35-moxtet-60    [001]    93.929734: mv88e6xxx_vtu_miss_violation: dev d0032004.mdio-mii:10 spid 9 vid 100
         irq/35-moxtet-60    [001]    94.183209: mv88e6xxx_vtu_miss_violation: dev d0032004.mdio-mii:10 spid 9 vid 100
         irq/35-moxtet-60    [001]   101.865545: mv88e6xxx_vtu_miss_violation: dev d0032004.mdio-mii:10 spid 9 vid 100
         irq/35-moxtet-60    [001]   121.831261: mv88e6xxx_vtu_member_violation: dev d0032004.mdio-mii:10 spid 9 vid 100
         irq/35-moxtet-60    [001]   122.371238: mv88e6xxx_vtu_member_violation: dev d0032004.mdio-mii:10 spid 9 vid 100
         irq/35-moxtet-60    [001]   148.452932: mv88e6xxx_atu_miss_violation: dev d0032004.mdio-mii:10 spid 2 portvec 0x4 addr 00:01:02:03:04:01 fid 0
      
      v1 at:
      https://patchwork.kernel.org/project/netdevbpf/cover/20221207233954.3619276-1-vladimir.oltean@nxp.com/
      ====================
      
      Link: https://lore.kernel.org/r/20221209172817.371434-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd2aafa2
    • Vladimir Oltean's avatar
      net: dsa: mv88e6xxx: replace VTU violation prints with trace points · 9e3d9ae5
      Vladimir Oltean authored
      It is possible to trigger these VTU violation messages very easily,
      it's only necessary to send packets with an unknown VLAN ID to a port
      that belongs to a VLAN-aware bridge.
      
      Do a similar thing as for ATU violation messages, and hide them in the
      kernel's trace buffer.
      
      New usage model:
      
      $ trace-cmd list | grep mv88e6xxx
      mv88e6xxx
      mv88e6xxx:mv88e6xxx_vtu_miss_violation
      mv88e6xxx:mv88e6xxx_vtu_member_violation
      $ trace-cmd report
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSaeed Mahameed <saeed@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e3d9ae5
    • Vladimir Oltean's avatar
      net: dsa: mv88e6xxx: replace ATU violation prints with trace points · 8646384d
      Vladimir Oltean authored
      In applications where the switch ports must perform 802.1X based
      authentication and are therefore locked, ATU violation interrupts are
      quite to be expected as part of normal operation. The problem is that
      they currently spam the kernel log, even if rate limited.
      
      Create a series of trace points, all derived from the same event class,
      which log these violations to the kernel's trace buffer, which is both
      much faster and much easier to ignore than printing to a serial console.
      
      New usage model:
      
      $ trace-cmd list | grep mv88e6xxx
      mv88e6xxx
      mv88e6xxx:mv88e6xxx_atu_full_violation
      mv88e6xxx:mv88e6xxx_atu_miss_violation
      mv88e6xxx:mv88e6xxx_atu_member_violation
      $ trace-cmd record -e mv88e6xxx sleep 10
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSaeed Mahameed <saeed@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8646384d
    • Hans J. Schultz's avatar
      net: dsa: mv88e6xxx: read FID when handling ATU violations · 4bf24ad0
      Hans J. Schultz authored
      When an ATU violation occurs, the switch uses the ATU FID register to
      report the FID of the MAC address that incurred the violation. It would
      be good for the driver to know the FID value for purposes such as
      logging and CPU-based authentication.
      
      Up until now, the driver has been calling the mv88e6xxx_g1_atu_op()
      function to read ATU violations, but that doesn't do exactly what we
      want, namely it calls mv88e6xxx_g1_atu_fid_write() with FID 0.
      (side note, the documentation for the ATU Get/Clear Violation command
      says that writes to the ATU FID register have no effect before the
      operation starts, it's only that we disregard the value that this
      register provides once the operation completes)
      
      So mv88e6xxx_g1_atu_fid_write() is not what we want, but rather
      mv88e6xxx_g1_atu_fid_read(). However, the latter doesn't exist, we need
      to write it.
      
      The remainder of mv88e6xxx_g1_atu_op() except for
      mv88e6xxx_g1_atu_fid_write() is still needed, namely to send a
      GET_CLR_VIOLATION command to the ATU. In principle we could have still
      kept calling mv88e6xxx_g1_atu_op(), but the MDIO writes to the ATU FID
      register are pointless, but in the interest of doing less CPU work per
      interrupt, write a new function called mv88e6xxx_g1_read_atu_violation()
      and call it.
      
      The FID will be the port default FID as set by mv88e6xxx_port_set_fid()
      if the VID from the packet cannot be found in the VTU. Otherwise it is
      the FID derived from the VTU entry associated with that VID.
      Signed-off-by: default avatarHans J. Schultz <netdev@kapio-technology.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4bf24ad0
    • Vladimir Oltean's avatar
      net: dsa: mv88e6xxx: remove ATU age out violation print · 8a1786b7
      Vladimir Oltean authored
      Currently, the MV88E6XXX_PORT_ASSOC_VECTOR_INT_AGE_OUT bit (interrupt on
      age out) is not enabled by the driver, and as a result, the print for
      age out violations is dead code.
      
      Remove it until there is some way for this to be triggered.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a1786b7
    • Jakub Kicinski's avatar
      Merge tag 'for-net-next-2022-12-12' of... · 4cc58a08
      Jakub Kicinski authored
      Merge tag 'for-net-next-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth-next pull request for net-next:
      
       - Add a new VID/PID 0489/e0f2 for MT7922
       - Add Realtek RTL8852BE support ID 0x0cb8:0xc559
       - Add a new PID/VID 13d3/3549 for RTL8822CU
       - Add support for broadcom BCM43430A0 & BCM43430A1
       - Add CONFIG_BT_HCIBTUSB_POLL_SYNC
       - Add CONFIG_BT_LE_L2CAP_ECRED
       - Add support for CYW4373A0
       - Add support for RTL8723DS
       - Add more device IDs for WCN6855
       - Add Broadcom BCM4377 family PCIe Bluetooth
      
      * tag 'for-net-next-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (51 commits)
        Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to complete
        Bluetooth: ISO: Avoid circular locking dependency
        Bluetooth: RFCOMM: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: hci_core: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: hci_bcsp: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: hci_h5: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: hci_ll: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: hci_qca: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: btusb: don't call kfree_skb() under spin_lock_irqsave()
        Bluetooth: btintel: Fix missing free skb in btintel_setup_combined()
        Bluetooth: hci_conn: Fix crash on hci_create_cis_sync
        Bluetooth: btintel: Fix existing sparce warnings
        Bluetooth: btusb: Fix existing sparce warning
        Bluetooth: btusb: Fix new sparce warnings
        Bluetooth: btusb: Add a new PID/VID 13d3/3549 for RTL8822CU
        Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0cb8:0xc559
        dt-bindings: net: realtek-bluetooth: Add RTL8723DS
        Bluetooth: btusb: Add a new VID/PID 0489/e0f2 for MT7922
        dt-bindings: bluetooth: broadcom: add BCM43430A0 & BCM43430A1
        Bluetooth: hci_bcm4377: Fix missing pci_disable_device() on error in bcm4377_probe()
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20221212222322.1690780-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4cc58a08
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · 95d1815f
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.
      
      2) Add DATA_SENT state to SCTP connection tracking helper, from
         Sriram Yagnaraman.
      
      3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.
      
      4) Add bitmask support for ipset, from Vishwanath Pai.
      
      5) Handle icmpv6 redirects as RELATED, from Florian Westphal.
      
      6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
         from Li Qiong.
      
      7) A large batch of IPVS updates to replace timer-based estimators by
         kthreads to scale up wrt. CPUs and workload (millions of estimators).
      
      Julian Anastasov says:
      
      	This patchset implements stats estimation in kthread context.
      It replaces the code that runs on single CPU in timer context every 2
      seconds and causing latency splats as shown in reports [1], [2], [3].
      The solution targets setups with thousands of IPVS services,
      destinations and multi-CPU boxes.
      
      	Spread the estimation on multiple (configured) CPUs and multiple
      time slots (timer ticks) by using multiple chains organized under RCU
      rules.  When stats are not needed, it is recommended to use
      run_estimation=0 as already implemented before this change.
      
      RCU Locking:
      
      - As stats are now RCU-locked, tot_stats, svc and dest which
      hold estimator structures are now always freed from RCU
      callback. This ensures RCU grace period after the
      ip_vs_stop_estimator() call.
      
      Kthread data:
      
      - every kthread works over its own data structure and all
      such structures are attached to array. For now we limit
      kthreads depending on the number of CPUs.
      
      - even while there can be a kthread structure, its task
      may not be running, eg. before first service is added or
      while the sysctl var is set to an empty cpulist or
      when run_estimation is set to 0 to disable the estimation.
      
      - the allocated kthread context may grow from 1 to 50
      allocated structures for timer ticks which saves memory for
      setups with small number of estimators
      
      - a task and its structure may be released if all
      estimators are unlinked from its chains, leaving the
      slot in the array empty
      
      - every kthread data structure allows limited number
      of estimators. Kthread 0 is also used to initially
      calculate the max number of estimators to allow in every
      chain considering a sub-100 microsecond cond_resched
      rate. This number can be from 1 to hundreds.
      
      - kthread 0 has an additional job of optimizing the
      adding of estimators: they are first added in
      temp list (est_temp_list) and later kthread 0
      distributes them to other kthreads. The optimization
      is based on the fact that newly added estimator
      should be estimated after 2 seconds, so we have the
      time to offload the adding to chain from controlling
      process to kthread 0.
      
      - to add new estimators we use the last added kthread
      context (est_add_ktid). The new estimators are linked to
      the chains just before the estimated one, based on add_row.
      This ensures their estimation will start after 2 seconds.
      If estimators are added in bursts, common case if all
      services and dests are initially configured, we may
      spread the estimators to more chains and as result,
      reducing the initial delay below 2 seconds.
      
      Many thanks to Jiri Wiesner for his valuable comments
      and for spending a lot of time reviewing and testing
      the changes on different platforms with 48-256 CPUs and
      1-8 NUMA nodes under different cpufreq governors.
      
      The new IPVS estimators do not use workqueue infrastructure
      because:
      
      - The estimation can take long time when using multiple IPVS rules (eg.
        millions estimator structures) and especially when box has multiple
        CPUs due to the for_each_possible_cpu usage that expects packets from
        any CPU. With est_nice sysctl we have more control how to prioritize the
        estimation kthreads compared to other processes/kthreads that have
        latency requirements (such as servers). As a benefit, we can see these
        kthreads in top and decide if we will need some further control to limit
        their CPU usage (max number of structure to estimate per kthread).
      
      - with kthreads we run code that is read-mostly, no write/lock
        operations to process the estimators in 2-second intervals.
      
      - work items are one-shot: as estimators are processed every
        2 seconds, they need to be re-added every time. This again
        loads the timers (add_timer) if we use delayed works, as there are
        no kthreads to do the timings.
      
      [1] Report from Yunhong Jiang:
          https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
      [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
      [3] Report from Dust:
          https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
        ipvs: run_estimation should control the kthread tasks
        ipvs: add est_cpulist and est_nice sysctl vars
        ipvs: use kthreads for stats estimation
        ipvs: use u64_stats_t for the per-cpu counters
        ipvs: use common functions for stats allocation
        ipvs: add rcu protection to stats
        netfilter: flowtable: add a 'default' case to flowtable datapath
        netfilter: conntrack: set icmpv6 redirects as RELATED
        netfilter: ipset: Add support for new bitmask parameter
        netfilter: conntrack: merge ipv4+ipv6 confirm functions
        netfilter: conntrack: add sctp DATA_SENT state
        netfilter: nft_inner: fix IS_ERR() vs NULL check
      ====================
      
      Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      95d1815f
    • Luiz Augusto von Dentz's avatar
      Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to complete · 7aca0ac4
      Luiz Augusto von Dentz authored
      This make sure HCI_OP_WRITE_AUTH_PAYLOAD_TO completes before notifying
      the encryption change just as is done with HCI_OP_READ_ENC_KEY_SIZE.
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      7aca0ac4
    • Luiz Augusto von Dentz's avatar
      Bluetooth: ISO: Avoid circular locking dependency · 241f5193
      Luiz Augusto von Dentz authored
      This attempts to avoid circular locking dependency between sock_lock
      and hdev_lock:
      
      WARNING: possible circular locking dependency detected
      6.0.0-rc7-03728-g18dd8ab0a783 #3 Not tainted
      ------------------------------------------------------
      kworker/u3:2/53 is trying to acquire lock:
      ffff888000254130 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}, at:
      iso_conn_del+0xbd/0x1d0
      but task is already holding lock:
      ffffffff9f39a080 (hci_cb_list_lock){+.+.}-{3:3}, at:
      hci_le_cis_estabilished_evt+0x1b5/0x500
      which lock already depends on the new lock.
      the existing dependency chain (in reverse order) is:
      -> #2 (hci_cb_list_lock){+.+.}-{3:3}:
             __mutex_lock+0x10e/0xfe0
             hci_le_remote_feat_complete_evt+0x17f/0x320
             hci_event_packet+0x39c/0x7d0
             hci_rx_work+0x2bf/0x950
             process_one_work+0x569/0x980
             worker_thread+0x2a3/0x6f0
             kthread+0x153/0x180
             ret_from_fork+0x22/0x30
      -> #1 (&hdev->lock){+.+.}-{3:3}:
             __mutex_lock+0x10e/0xfe0
             iso_connect_cis+0x6f/0x5a0
             iso_sock_connect+0x1af/0x710
             __sys_connect+0x17e/0x1b0
             __x64_sys_connect+0x37/0x50
             do_syscall_64+0x43/0x90
             entry_SYSCALL_64_after_hwframe+0x62/0xcc
      -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
             __lock_acquire+0x1b51/0x33d0
             lock_acquire+0x16f/0x3b0
             lock_sock_nested+0x32/0x80
             iso_conn_del+0xbd/0x1d0
             iso_connect_cfm+0x226/0x680
             hci_le_cis_estabilished_evt+0x1ed/0x500
             hci_event_packet+0x39c/0x7d0
             hci_rx_work+0x2bf/0x950
             process_one_work+0x569/0x980
             worker_thread+0x2a3/0x6f0
             kthread+0x153/0x180
             ret_from_fork+0x22/0x30
      other info that might help us debug this:
      Chain exists of:
        sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> &hdev->lock --> hci_cb_list_lock
       Possible unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(hci_cb_list_lock);
                                     lock(&hdev->lock);
                                     lock(hci_cb_list_lock);
        lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
       *** DEADLOCK ***
      4 locks held by kworker/u3:2/53:
       #0: ffff8880021d9130 ((wq_completion)hci0#2){+.+.}-{0:0}, at:
       process_one_work+0x4ad/0x980
       #1: ffff888002387de0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0},
       at: process_one_work+0x4ad/0x980
       #2: ffff888001ac0070 (&hdev->lock){+.+.}-{3:3}, at:
       hci_le_cis_estabilished_evt+0xc3/0x500
       #3: ffffffff9f39a080 (hci_cb_list_lock){+.+.}-{3:3}, at:
       hci_le_cis_estabilished_evt+0x1b5/0x500
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      241f5193
    • Yang Yingliang's avatar
      Bluetooth: RFCOMM: don't call kfree_skb() under spin_lock_irqsave() · 0ba18967
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 81be03e0 ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      0ba18967
    • Yang Yingliang's avatar
      Bluetooth: hci_core: don't call kfree_skb() under spin_lock_irqsave() · 39c1eb6f
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 9238f36a ("Bluetooth: Add request cmd_complete and cmd_status functions")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      39c1eb6f
    • Yang Yingliang's avatar
      Bluetooth: hci_bcsp: don't call kfree_skb() under spin_lock_irqsave() · 7b503e33
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      7b503e33
    • Yang Yingliang's avatar
      Bluetooth: hci_h5: don't call kfree_skb() under spin_lock_irqsave() · 383630cc
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 43eb12d7 ("Bluetooth: Fix/implement Three-wire reliable packet sending")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      383630cc
    • Yang Yingliang's avatar
      Bluetooth: hci_ll: don't call kfree_skb() under spin_lock_irqsave() · 8f458f78
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 166d2f6a ("[Bluetooth] Add UART driver for Texas Instruments' BRF63xx chips")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      8f458f78
    • Yang Yingliang's avatar
      Bluetooth: hci_qca: don't call kfree_skb() under spin_lock_irqsave() · df4cfc91
      Yang Yingliang authored
      It is not allowed to call kfree_skb() from hardware interrupt
      context or with interrupts being disabled. So replace kfree_skb()
      with dev_kfree_skb_irq() under spin_lock_irqsave().
      
      Fixes: 0ff252c1 ("Bluetooth: hciuart: Add support QCA chipset for UART")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      df4cfc91