- 09 Feb, 2021 5 commits
-
-
Amit Cohen authored
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. To avoid such cases, previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, this behavior is controlled by sysctl. With the above mentioned behavior, it is possible to know from user-space if the route was offloaded, but if the offload fails there is no indication to user-space. Following a failure, a routing daemon will wait indefinitely for a notification that will never come. This patch adds an "offload_failed" indication to IPv6 routes, so that users will have better visibility into the offload process. 'struct fib6_info' is extended with new field that indicates if route offload failed. Note that the new field is added using unused bit and therefore there is no need to increase struct size. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Amit Cohen authored
Add the value '2' to 'fib_notify_on_flag_change' to allow sending notifications only for failed route installation. Separate value is added for such notifications because there are less of them, so they do not impact performance and some users will find them more important. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Amit Cohen authored
After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. To avoid such cases, previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, this behavior is controlled by sysctl. With the above mentioned behavior, it is possible to know from user-space if the route was offloaded, but if the offload fails there is no indication to user-space. Following a failure, a routing daemon will wait indefinitely for a notification that will never come. This patch adds an "offload_failed" indication to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias', and 'struct fib_rt_info' are extended with new field that indicates if route offload failed. Note that the new field is added using unused bit and therefore there is no need to increase structs size. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Amit Cohen authored
The flag indicates to user space that route offload failed. Previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, but if the offload fails there is no indication to user-space. The flag will be used in subsequent patches by netdevsim and mlxsw to indicate to user space that route offload failed, so that users will have better visibility into the offload process. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
mlx5-updates-2021-02-04 Vlad Buslov says: ================= Implement support for VF tunneling Abstract Currently, mlx5 only supports configuration with tunnel endpoint IP address on uplink representor. Remove implicit and explicit assumptions of tunnel always being terminated on uplink and implement necessary infrastructure for configuring tunnels on VF representors and updating rules on such tunnels according to routing changes. SW TC model From TC perspective VF tunnel configuration requires two rules in both directions: TX rules 1. Rule that redirects packets from UL to VF rep that has the tunnel endpoint IP address: $ tc -s filter show dev enp8s0f0 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 16:c9:a0:2d:69:2c src_mac 0c:42:a1:58:ab:e4 eth_type ipv4 ip_flags nofrag in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen index 3 ref 1 bind 1 installed 377 sec used 0 sec Action statistics: Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 114096 bytes 952 pkt backlog 0b 0p requeues 0 cookie 878fa48d8c423fc08c3b6ca599b50a97 no_percpu used_hw_stats delayed 2. Rule that decapsulates the tunneled flow and redirects to destination VF representor: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed RX rules 1. Rule that encapsulates the tunneled flow and redirects packets from source VF rep to tunnel device: $ tc -s filter show dev enp8s0f0_1 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 0a:40:bd:30:89:99 src_mac ca:2e:a7:3f:f5:0f eth_type ipv4 ip_tos 0/0x3 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key set src_ip 7.7.7.5 dst_ip 7.7.7.1 key_id 98 dst_port 4789 nocsum ttl 64 pipe index 1 ref 1 bind 1 installed 411 sec used 411 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 no_percpu used_hw_stats delayed action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen index 1 ref 1 bind 1 installed 411 sec used 0 sec Action statistics: Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 5615833 bytes 4028 pkt backlog 0b 0p requeues 0 cookie bb406d45d343bf7ade9690ae80c7cba4 no_percpu used_hw_stats delayed 2. Rule that redirects from tunnel device to UL rep: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed HW offloads model For hardware offload the goal is to mach packet on both rules without exposing it to software on tunnel endpoint VF. In order to achieve this for tx, TC implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification rule to overwrite packet source port to the value of tunnel VF. Eswitch code is modified to recirculate such packets after source port value is changed, which allows second tx rules to match. For rx path indirect table infrastructure is used to allow fully processing VF tunnel traffic in hardware. To implement such pipeline driver needs to program the hardware after matching on UL rule to overwrite source vport from UL to tunnel VF and recirculate the packet to the root table to allow matching on the rule installed on tunnel VF. For this, indirect table matches all encapsulated traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by the miss rule. Such configuration will cause packet to appear on VF representor instead of VF itself if packet has been matches by indirect table rule based on tunnel parameters but missed on second rule (after recirculation). Handle such case by marking packets processed by indirect table with special 0xFFF value in reg_c1 and extending slow table with additional flow group that matches on reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF mark). When creating offloads fdb tables, install one rule per VF vport to match on recirculated miss packets and redirect them to appropriate VF vport. Routing events In order to support routing changes and migration of tunnel device between different endpoint VFs, implement routing infrastructure and update it with FIB events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel rule is attached to a routing entry, which is shared for rules of same tunnel. On FIB event the work is scheduled to delete/recreate all rules of affected tunnel. Note: only vxlan tunnel type is supported by this series. =================
-
- 08 Feb, 2021 9 commits
-
-
Heiner Kallweit authored
It is likely that this is a leftover from T3 driver heritage. cxgb4 uses the PCI core VPD access code that handles detection of VPD capabilities. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Looking through patchwork I don't see that there was any consensus to use switchdev notifiers only in case of netlink provided port flags but not sysfs (as a sort of deprecation, punishment or anything like that), so we should probably keep the user interface consistent in terms of functionality. http://patchwork.ozlabs.org/project/netdev/patch/20170605092043.3523-3-jiri@resnulli.us/ http://patchwork.ozlabs.org/project/netdev/patch/20170608064428.4785-3-jiri@resnulli.us/ Fixes: 3922285d ("net: bridge: Add support for offloading port attributes") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Phil Sutter authored
Kernel's key folding basically consists of shifting away least significant zero bits in mask and masking the resulting value with (divisor - 1). Test for u32's 'sample' option to behave identical. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Xin Long authored
In rxrpc_open_socket(), now it's using sock_create_kern() and kernel_bind() to create a udp tunnel socket, and other kernel APIs to set up it. These code can be replaced with udp tunnel APIs udp_sock_create() and setup_udp_tunnel_sock(), and it'll simplify rxrpc_open_socket(). Note that with this patch, the udp tunnel socket will always bind to a random port if transport is not provided by users, which is suggested by David Howells, thanks! Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Duyck authored
In order to access the suboordinate dev for a device we should be holding the rtnl_lock when outside of the transmit path. The existing code was not doing that for the sysfs dump function and as a result we were open to a possible race. To resolve that take the rtnl lock prior to accessing the sb_dev field of the Tx queue and release it after we have retrieved the tc for the queue. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wengjianfeng authored
The variable r is defined at the beginning and initialized to 0 until the function returns r, and the variable r is not reassigned.Therefore, we do not need to define the variable r, just return 0 directly at the end of the function. Signed-off-by: wengjianfeng <wengjianfeng@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yang Li authored
Eliminate the following coccicheck warning: ./tools/testing/selftests/net/so_txtime.c:199:3-4: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Andrea Mayer authored
The set of required attributes for a given SRv6 behavior is identified using a bitmap stored in an unsigned long, since the initial design of SRv6 networking in Linux. Recently the same approach has been used for identifying the optional attributes. However, the number of attributes supported by SRv6 behaviors depends on the size of the unsigned long type which changes with the architecture. Indeed, on a 64-bit architecture, an SRv6 behavior can support up to 64 attributes while on a 32-bit architecture it can support at most 32 attributes. To fool-proof the processing of SRv6 behaviors we verify, at compile time, that the set of all supported SRv6 attributes can be encoded into a bitmap stored in an unsigned long. Otherwise, kernel build fails forcing developers to reconsider adding a new attribute or extend the total number of supported attributes by the SRv6 behaviors. Moreover, we replace all patterns (1 << i) with the macro SEG6_F_ATTR(i) in order to address potential overflow issues caused by 32-bit signed arithmetic. Thanks to Colin Ian King for catching the overflow problem, providing a solution and inspiring this patch. Thanks to Jakub Kicinski for his useful suggestions during the design of this patch. v2: - remove the SEG6_LOCAL_MAX_SUPP which is not strictly needed: it can be derived from the unsigned long type. Thanks to David Ahern for pointing it out. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210206170934.5982-1-andrea.mayer@uniroma2.itSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.open-mesh.org/linux-mergeJakub Kicinski authored
Simon Wunderlich says: ==================== This feature/cleanup patchset is an updated version of the pull request of Feb 2nd (batadv-next-pullrequest-20210202) and includes the following patches: - Bump version strings, by Simon Wunderlich (added commit log) - Drop publication years from copyright info, by Sven Eckelmann (replaced the previous patch which updated copyright years, as per our discussion) - Avoid sizeof on flexible structure, by Sven Eckelmann (unchanged) - Fix names for kernel-doc blocks, by Sven Eckelmann (unchanged) * tag 'batadv-next-pullrequest-20210208' of git://git.open-mesh.org/linux-merge: batman-adv: Fix names for kernel-doc blocks batman-adv: Avoid sizeof on flexible structure batman-adv: Drop publication years from copyright info batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20210208165938.13262-1-sw@simonwunderlich.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 07 Feb, 2021 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueJakub Kicinski authored
Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-02-05 This series contains updates to ice driver only. Jake adds adds reporting of timeout length during devlink flash and implements support to report devlink info regarding the version of firmware that is stored (downloaded) to the device, but is not yet active. ice_devlink_info_get will report "stored" versions when there is no pending flash update. Version info includes the UNDI Option ROM, the Netlist module, and the fw.bundle_id. Gustavo A. R. Silva replaces a one-element array to flexible-array member. Bruce utilizes flex_array_size() helper and removes dead code on a check for a condition that can't occur. v2: * removed security revision implementation, and re-ordered patches to account for this removal * squashed patches implementing ice_read_flash_module to avoid patches refactoring the implementation of a previous patch in the series * modify ice_devlink_info_get to always report "stored" versions instead of only reporting them when a pending flash update is ready. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: remove dead code ice: use flex_array_size where possible ice: Replace one-element array with flexible-array member ice: display stored UNDI firmware version via devlink info ice: display stored netlist versions via devlink info ice: display some stored NVM versions via devlink info ice: introduce function for reading from flash modules ice: cache NVM module bank information ice: introduce context struct for info report ice: create flash_info structure and separate NVM version ice: report timeout length for erasing during devlink flash ==================== Link: https://lore.kernel.org/r/20210206044101.636242-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 06 Feb, 2021 25 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextJakub Kicinski authored
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Remove indirection and use nf_ct_get() instead from nfnetlink_log and nfnetlink_queue, from Florian Westphal. 2) Add weighted random twos choice least-connection scheduling for IPVS, from Darby Payne. 3) Add a __hash placeholder in the flow tuple structure to identify the field to be included in the rhashtable key hash calculation. 4) Add a new nft_parse_register_load() and nft_parse_register_store() to consolidate register load and store in the core. 5) Statify nft_parse_register() since it has no more module clients. 6) Remove redundant assignment in nft_cmp, from Colin Ian King. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: remove redundant assignment of variable err netfilter: nftables: statify nft_parse_register() netfilter: nftables: add nft_parse_register_store() and use it netfilter: nftables: add nft_parse_register_load() and use it netfilter: flowtable: add hash offset field to tuple ipvs: add weighted random twos choice algorithm netfilter: ctnetlink: remove get_ct indirection ==================== Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Heiner Kallweit authored
There's no benefit in trying to disable interrupts if NAPI is scheduled already. This allows us to save a PCI write in this case. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/78c7f2fb-9772-1015-8c1d-632cbdff253f@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Xie He authored
The "dev_has_header" function, recently added in commit d5496990 ("net/packet: fix packet receive on L3 devices without visible hard header"), is more accurate as criteria for determining whether a device exposes the LL header to upper layers, because in addition to dev->header_ops, it also checks for dev->header_ops->create. When transmitting an skb on a device, dev_hard_header can be called to generate an LL header. dev_hard_header will only generate a header if dev->header_ops->create is present. Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20210205224124.21345-1-xie.he.0141@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Alex Elder says: ==================== net: ipa: a mix of small improvements Version 2 of this series restructures a couple of the changed functions (in patches 1 and 2) to avoid blocks of indented code by returning early when possible, as suggested by Jakub. The description of the first patch was changed as a result, to better reflect what the updated patch does. It also fixes one spot I identified when updating the code, where gsi_channel_stop() was doing the wrong thing on error. The original description for this series is below. This series contains a sort of unrelated set of code cleanups. The first two are things I wanted to do in a series that updated some NAPI code recently. I didn't want to change things in a way that affected existing testing so I set these aside for later (i.e., now). The third makes a change to event ring handling that's similar to what was done a while back for channels. There's little benefit to cacheing the current state of an event ring, so with this we'll just fetch the state from hardware whenever we need it. The fourth patch removes the definitions of two unused symbols. The fifth replaces a count that is always 0 or 1 with a Boolean. The sixth removes a build-time validation check that doesn't really provide benefit. And the last one fixes a problem (in two spots) that could cause a build-time check to fail "bogusly". ==================== Link: https://lore.kernel.org/r/20210205221100.1738-1-elder@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
It's possible that the length passed to ipa_header_size_encoded() is larger than what can be represented by the HDR_LEN field alone (starting with IPA v4.5). If we attempted that, u32_encode_bits() would trigger a build-time error. Avoid this problem by masking off high-order bits of the value encoded as the lower portion of the header length. The same sort of problem exists in ipa_metadata_offset_encoded(), so implement the same fix there. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
There is a build-time check that the packet status structure is a multiple of 4 bytes in size. It's not clear where that constraint comes from, but the structure defines what hardware provides so its definition won't change. Get rid of the check; it adds no value. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
The count argument to ipa_endpoint_replenish() is only ever 0 or 1, and always will be (because we always handle each receive buffer in a single transaction). Rename the argument to be add_one and change it to be Boolean. Update the function description to reflect the current code. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
We do not support inter-EE channel or event ring commands. Inter-EE interrupts are disabled (and never re-enabled) for all channels and event rings, so we have no need for the GSI registers that clear those interrupt conditions. So remove their definitions. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
An event ring's state only needs to be known when it is allocated, reset, or deallocated. We check an event ring's state both before and after performing an event ring control command that changes its state. These are only issued at startup and shutdown, so there is very little value in caching the state. Stop recording a copy of the channel's last known state, and instead fetch the true state from hardware whenever it's needed. In such cases, *do* record the state in a local variable, in case an error message reports it (so the value reported is the value seen). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
When stopping a channel, gsi_channel_stop() will ensure NAPI polling is complete when it calls napi_disable(). So there is no need to call napi_synchronize() in that case. Move the call to napi_synchronize() out of __gsi_channel_stop() and into gsi_channel_suspend(), so it's only used where needed. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alex Elder authored
Move the mutex calls out of gsi_channel_stop_retry() and into __gsi_channel_stop(), to make the latter more semantically similar to __gsi_channel_start(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Vladimir Oltean says: ==================== LAG offload for Ocelot DSA switches This patch series reworks the ocelot switchdev driver such that it could share the same implementation for LAG offload as the felix DSA driver. Testing has been done in the following topology: +----------------------------------+ | Board 1 br0 | | +---------+ | | / \ | | | | | | | bond0 | | | +-----+ | | | / \ | | eno0 swp0 swp1 swp2 | +---|--------|-------|-------|-----+ | | | | +--------+ | | Cable | | Cable| |Cable Cable | | +--------+ | | | | | | +---|--------|-------|-------|-----+ | eno0 swp0 swp1 swp2 | | | \ / | | | +-----+ | | | bond0 | | | | | | \ / | | +---------+ | | Board 2 br0 | +----------------------------------+ The same script can be run on both Board 1 and Board 2 to set this up: ip link del bond0 ip link add bond0 type bond mode balance-xor miimon 1 OR ip link add bond0 type bond mode 802.3ad ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up ip link del br0 ip link add br0 type bridge ip link set bond0 master br0 ip link set swp0 master br0 Then traffic can be tested between eno0 of Board 1 and eno0 of Board 2. ==================== Link: https://lore.kernel.org/r/20210205220221.255646-1-olteanv@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The ocelot switch has been supporting LAG offload since its initial commit, however felix could not make use of that, due to lack of a LAG abstraction in DSA. Now that we have that, let's forward DSA's calls towards the ocelot library, who will deal with setting up the bonding. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Given the following topology, and focusing only on Box A: Box A +----------------------------------+ | Board 1 br0 | | +---------+ | | / \ | | | | | | | bond0 | | | +-----+ | |192.168.1.1 | / \ | | eno0 swp0 swp1 swp2 | +---|--------|-------|-------|-----+ | | | | +--------+ | | Cable | | Cable| |Cable Cable | | +--------+ | | | | | | +---|--------|-------|-------|-----+ | eno0 swp0 swp1 swp2 | |192.168.1.2 | \ / | | | +-----+ | | | bond0 | | | | | | \ / | | +---------+ | | Board 2 br0 | +----------------------------------+ Box B The assisted_learning_on_cpu_port logic will see that swp0 is bridged with a "foreign interface" (bond0) and will therefore install all addresses learnt by the software bridge towards bond0 (including the address of eno0 on Box B) as static addresses towards the CPU port. But that's not what we want - bond0 is not really a "foreign interface" but one we can offload including L2 forwarding from/towards it. So we need to refine our logic for assisted learning such that, whenever we see an address learnt on a non-DSA interface, we search through the tree for any port that offloads that non-DSA interface. Some confusion might arise as to why we search through the whole tree instead of just the local switch returned by dsa_slave_dev_lower_find. Or a different angle of the same confusion: why does dsa_slave_dev_lower_find(br_dev) return a single dp that's under br_dev instead of the whole list of bridged DSA ports? To answer the second question, it should be enough to install the static FDB entry on the CPU port of a single switch in the tree, because dsa_port_fdb_add uses DSA_NOTIFIER_FDB_ADD which ensures that all other switches in the tree get notified of that address, and add the entry themselves using dsa_towards_port(). This should help understand the answer to the first question: the port returned by dsa_slave_dev_lower_find may not be on the same switch as the ports that offload the LAG. Nonetheless, if the driver implements .crosschip_lag_join and .crosschip_bridge_join as mv88e6xxx does, there still isn't any reason for trapping addresses learnt on the remote LAG towards the CPU, and we should prevent that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
At present there is an issue when ocelot is offloading a bonding interface, but one of the links of the physical ports goes down. Traffic keeps being hashed towards that destination, and of course gets dropped on egress. Monitor the netdev notifier events emitted by the bonding driver for changes in the physical state of lower interfaces, to determine which ports are active and which ones are no longer. Then extend ocelot_get_bond_mask to return either the configured bonding interfaces, or the active ones, depending on a boolean argument. The code that does rebalancing only needs to do so among the active ports, whereas the bridge forwarding mask and the logical port IDs still need to look at the permanently bonded ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
It makes it a bit easier to read and understand the code that deals with balancing the 16 aggregation codes among the ports in a certain LAG. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
We can now simplify the implementation by always using ocelot_get_bond_mask to look up the other ports that are offloading the same bonding interface as us. In ocelot_set_aggr_pgids, the code had a way to uniquely iterate through LAGs. We need to achieve the same behavior by marking each LAG as visited, which we do now by using a temporary 32-bit "visited" bitmask. This is ok and we do not need dynamic memory allocation, because we know that this switch architecture will not have more than 32 ports (the PGID port masks are 32-bit anyway). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The setup of logical port IDs is done in two places: from the inconclusively named ocelot_setup_lag and from ocelot_port_lag_leave, a function that also calls ocelot_setup_lag (which apparently does an incomplete setup of the LAG). To improve this situation, we can rename ocelot_setup_lag into ocelot_setup_logical_port_ids, and drop the "lag" argument. It will now set up the logical port IDs of all switch ports, which may be just slightly more inefficient but more maintainable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The index of the LAG is equal to the logical port ID that all the physical port members have, which is further equal to the index of the first physical port that is a member of the LAG. The code gets a bit carried away with logic like this: if (a == b) c = a; else c = b; which can be simplified, of course, into: c = b; (with a being port, b being lp, c being lag) This further makes the "lp" variable redundant, since we can use "lag" everywhere where "lp" (logical port) was used. So instead of a "c = b" assignment, we can do a complete deletion of b. Only one comment here: if (bond_mask) { lp = __ffs(bond_mask); ocelot->lags[lp] = 0; } lp was clobbered before, because it was used as a temporary variable to hold the new smallest port ID from the bond. Now that we don't have "lp" any longer, we'll just avoid the temporary variable and zeroize the bonding mask directly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Since this code should be called from pure switchdev as well as from DSA, we must find a way to determine the bonding mask not by looking directly at the net_device lowers of the bonding interface, since those could have different private structures. We keep a pointer to the bonding upper interface, if present, in struct ocelot_port. Then the bonding mask becomes the bitwise OR of all ports that have the same bonding upper interface. This adds a duplication of functionality with the current "lags" array, but the duplication will be short-lived, since further patches will remove the latter completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
IPv6 header information is not currently part of the entropy source for the 4-bit aggregation code used for LAG offload, even though it could be. The hardware reference manual says about these fields: ANA::AGGR_CFG.AC_IP6_TCPUDP_PORT_ENA Use IPv6 TCP/UDP port when calculating aggregation code. Configure identically for all ports. Recommended value is 1. ANA::AGGR_CFG.AC_IP6_FLOW_LBL_ENA Use IPv6 flow label when calculating AC. Configure identically for all ports. Recommended value is 1. Integration with the xmit_hash_policy of the bonding interface is TBD. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Since switchdev/DSA exposes network interfaces that fulfill many of the same user space expectations that dedicated NICs do, it makes sense to not deny bonding interfaces with a bonding policy that we cannot offload, but instead allow the bonding driver to select the egress interface in software. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Make ocelot's net device event handler more streamlined by structuring it in a similar way with others. The inspiration here was dsa_slave_netdevice_event. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
ocelot_netdevice_port_event treats a single event, NETDEV_CHANGEUPPER. So we can remove the check for the type of event, and rename the function to be more suggestive, since there already is a function with a very similar name of ocelot_netdevice_event. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Vladimir Oltean says: ==================== Automatically manage DSA master interface state This patch series adds code that makes DSA open the master interface automatically whenever one user interface gets opened, either by the user, or by various networking subsystems: netconsole, nfsroot. With that in place, we can remove some of the places in the network stack where DSA-specific code was sprinkled. ==================== Link: https://lore.kernel.org/r/20210205133713.4172846-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-