- 26 Oct, 2021 40 commits
-
-
David S. Miller authored
Russell King says: ==================== Introduce supported interfaces bitmap This series introduces a new bitmap to allow us to indicate which phy_interface_t modes are supported. Currently, phylink will call ->validate with PHY_INTERFACE_MODE_NA to request all link mode capabilities from the MAC driver before choosing an interface to use. This leads in some cases to some rather hairly code. This can be simplified if phylink is aware of the interface modes that the MAC supports, and it can instead walk those modes, calling ->validate for each one, and combining the results. This series merely introduces the support; there is no change of behaviour until MAC drivers populate their supported_interfaces bitmap. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King (Oracle) authored
If the network device supplies a supported interface bitmap, we can use that during phylink's validation to simplify MAC drivers in two ways by using the supported_interfaces bitmap to: 1. reject unsupported interfaces before calling into the MAC driver. 2. generate the set of all supported link modes across all supported interfaces (used mainly for SFP, but also some 10G PHYs.) Suggested-by: Sean Anderson <sean.anderson@seco.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King authored
Add a phy_interface_t bitmap so the MAC driver can specifiy which PHY interface modes it supports. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King (Oracle) authored
Add support for a bitmap for phy interface modes, which includes: - a macro to declare the interface bitmap - an inline helper to zero the interface bitmap - an inline helper to detect an empty interface bitmap - inline helpers to do a bitwise AND and OR operations on two interface bitmaps Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vladimir Oltean says: ==================== DSA preparations for FDB isolation between bridges This series makes 2 small changes to DSA's SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, which will make it possible to offer switch drivers a stable association between a FDB entry and a bridge device in a future series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Now that we guarantee that SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events have finished executing by the time we leave our bridge upper interface, we've established a stronger boundary condition for how long the dsa_slave_switchdev_event_work() might run. As such, it is no longer possible for DSA slave interfaces to become unregistered, since they are still bridge ports. So delete the unnecessary dev_hold() and dev_put(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
DSA is preparing to offer switch drivers an API through which they can associate each FDB entry with a struct net_device *bridge_dev. This can be used to perform FDB isolation (the FDB lookup performed on the ingress of a standalone, or bridged port, should not find an FDB entry that is present in the FDB of another bridge). In preparation of that work, DSA needs to ensure that by the time we call the switch .port_fdb_add and .port_fdb_del methods, the dp->bridge_dev pointer is still valid, i.e. the port is still a bridge port. This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API requires drivers that must have sleepable context to handle those events to schedule the deferred work themselves. DSA does this through the dsa_owq. It can happen that a port leaves a bridge, del_nbp() flushes the FDB on that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context, DSA schedules its deferred work, but del_nbp() finishes unlinking the bridge as a master from the port before DSA's deferred work is run. Fundamentally, the port must not be unlinked from the bridge until all FDB deletion deferred work items have been flushed. The bridge must wait for the completion of these hardware accesses. An attempt has been made to address this issue centrally in switchdev by making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev level, which would offer implicit synchronization with del_nbp: https://patchwork.kernel.org/project/netdevbpf/cover/20210820115746.3701811-1-vladimir.oltean@nxp.com/ but it seems that any attempt to modify switchdev's behavior and make the events blocking there would introduce undesirable side effects in other switchdev consumers. The most undesirable behavior seems to be that switchdev_deferred_process_work() takes the rtnl_mutex itself, which would be worse off than having the rtnl_mutex taken individually from drivers which is what we have now (except DSA which has removed that lock since commit 0faf890f ("net: dsa: drop rtnl_lock from dsa_slave_switchdev_event_work")). So to offer the needed guarantee to DSA switch drivers, I have come up with a compromise solution that does not require switchdev rework: we already have a hook at the last moment in time when the bridge is still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush the dsa_owq manually from there, which makes all FDB deletions synchronous. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lukas Wunner authored
IFB originally depended on NET_CLS_ACT for traffic redirection. But since v4.5, that may be achieved with NFT_FWD_NETDEV as well. Fixes: 39e6dea2 ("netfilter: nf_tables: add forward expression to the netdev family") Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: <stable@vger.kernel.org> # v4.5+: bcfabee1: netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jeremy Kerr authored
This change allows an extended address struct - struct sockaddr_mctp_ext - to be passed to sendmsg/recvmsg. This allows userspace to specify output ifindex and physical address information (for sendmsg) or receive the input ifindex/physaddr for incoming messages (for recvmsg). This is typically used by userspace for MCTP address discovery and assignment operations. The extended addressing facility is conditional on a new sockopt: MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before the kernel will consume/populate the extended address data. Includes a fix for an uninitialised var: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nathan Chancellor authored
Clang warns: drivers/net/ethernet/asix/ax88796c_main.c:851:24: error: address of array 'ax_local->phydev->advertising' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion] if (ax_local->phydev->advertising && ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ ~~ advertising cannot be NULL here if ax_local is not NULL, which cannot happen due to the check in ax88796c_probe(). Remove the check. Link: https://github.com/ClangBuiltLinux/linux/issues/1492Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nathan Chancellor authored
Clang warns: drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] case SPEED_10: ^ drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through case SPEED_10: ^ break; drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] case DUPLEX_HALF: ^ drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through case DUPLEX_HALF: ^ break; Clang is a little more pedantic than GCC, which permits implicit fallthroughs to cases that contain just break or return. Clang's version is more in line with the kernel's own stance in deprecated.rst, which states that all switch/case blocks must end in either break, fallthrough, continue, goto, or return. Add the missing breaks to fix the warning. Link: https://github.com/ClangBuiltLinux/linux/issues/1491Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Haiyang Zhang authored
The existing code doesn't allow setting the number of queues while the NIC is down. Update the ethtool handler functions to support setting the number of queues while the NIC is at down state. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Andreas Oetken authored
added support for the redbox supervision frames as defined in the IEC-62439-3:2018. Signed-off-by: Andreas Oetken <andreas.oetken@siemens-energy.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== tcp: tcp_stream_alloc_skb() changes sk_stream_alloc_skb() is only used by TCP. Rename it to tcp_stream_alloc_skb() and apply small optimizations. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Aligning @size argument to 4 bytes is not needed. The header alignment has nothing to do with @size. It really depends on skb->head alignment and MAX_TCP_HEADER. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both IPv4 and IPv6 uses same reserve, no need risking cache line misses to fetch its value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
sk_stream_alloc_skb() is only used by TCP. Rename it to make this clear, and move its declaration to include/net/tcp.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
neigh_output() reads n->nud_state and hh->hh_len locklessly. This is fine, but we need to add annotations and document this. We evaluate skip_cache first to avoid reading these fields if the cache has to by bypassed. syzbot report: BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2 write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1: __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128 neigh_event_send include/net/neighbour.h:444 [inline] neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476 neigh_output include/net/neighbour.h:510 [inline] ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423 dst_output include/net/dst.h:450 [inline] ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline] tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline] tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline] tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421 expire_timers+0x135/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x430 kernel/time/timer.c:1734 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747 __do_softirq+0x12c/0x26e kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline] arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline] acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline] acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline] acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x1a3/0x250 kernel/sched/idle.c:306 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403 secondary_startup_64_no_verify+0xb1/0xbb read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0: neigh_output include/net/neighbour.h:507 [inline] ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221 ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423 dst_output include/net/dst.h:450 [inline] ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126 __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525 ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539 __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405 tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline] tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline] tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064 tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079 tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline] tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626 tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642 call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421 expire_timers+0x135/0x240 kernel/time/timer.c:1466 __run_timers+0x368/0x430 kernel/time/timer.c:1734 run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747 __do_softirq+0x12c/0x26e kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu kernel/softirq.c:636 [inline] irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648 sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097 asm_sysvec_apic_timer_interrupt+0x12/0x20 native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline] arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline] acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline] acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline] acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688 cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x1a3/0x250 kernel/sched/idle.c:306 cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403 rest_init+0xee/0x100 init/main.c:734 arch_call_rest_init+0xa/0xb start_kernel+0x5e4/0x669 init/main.c:1142 secondary_startup_64_no_verify+0xb1/0xbb value changed: 0x20 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ido Schimmel says: ==================== mlxsw: Support multiple RIF MAC prefixes Currently, mlxsw enforces that all the netdevs used as router interfaces (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1). Otherwise, an error is returned to user space with extack. This patchset relaxes the limitation through the use of RIF MAC profiles. A RIF MAC profile is a hardware entity that represents a particular MAC prefix which multiple RIFs can reference. Therefore, the number of possible MAC prefixes is no longer one, but the number of profiles supported by the device. The ability to change the MAC of a particular netdev is useful, for example, for users who use the netdev to connect to an upstream provider that performs MAC filtering. Currently, such users are either forced to negotiate with the provider or change the MAC address of all other netdevs so that they share the same prefix. Patchset overview: Patches #1-#3 are preparations. Patch #4 adds actual support for RIF MAC profiles. Patch #5 exposes RIF MAC profiles as a devlink resource, so that user space has visibility into the maximum number of profiles and current occupancy. Useful for debugging and testing (next 3 patches). Patches #6-#8 add both scale and functional tests. Patch #9 removes tests that validated the previous limitation. It is now covered by patch #6 for devices that support a single profile. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
After adding the previous patches, the constraint that all the router interface MAC addresses have the same prefix is no longer relevant. Remove the test cases that validated that this constraint is honored. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
When all the RIF MAC profiles are in use, test that it is possible to change the MAC of a netdev (i.e., a RIF) when its MAC profile is not shared with other RIFs. Test that replacement fails when the MAC profile is shared. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Verify that MAC profile changes are indeed applied and that packets are forwarded with the correct source MAC. Output example: $ ./rif_mac_profiles.sh TEST: h1->h2: new mac profile [ OK ] TEST: h2->h1: new mac profile [ OK ] TEST: h1->h2: edit mac profile [ OK ] TEST: h2->h1: edit mac profile [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Query the maximum number of supported RIF MAC profiles using devlink-resource and verify that all available MAC profiles can be utilized and that an error is generated when user space tries to exceed this number. Output example in Spectrum-2: $ TESTS='rif_mac_profile' ./resource_scale.sh TEST: 'rif_mac_profile' 4 [ OK ] TEST: 'rif_mac_profile' overflow 5 [ OK ] Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Expose via devlink-resource the maximum number of RIF MAC profiles and their current occupancy, so it can be used for debug and writing generic tests, like in the next patch. Example for Spectrum-2 output: $ devlink resource show pci/0000:06:00.0 ... name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Currently, mlxsw enforces that all the router interfaces (RIFs) have the same MAC prefix. Relax this limitation by using RIF MAC profiles. Each profile is associated with a particular MAC prefix and multiple RIFs can use the same profile. Therefore, the number of possible MAC prefixes is no longer one, but the number of profiles supported by the device. Store the profiles in an IDR and reference count them according to the number of RIFs using them. Associate a RIF with a profile when the RIF is created and remove the association when the RIF is deleted. Change the association following 'NETDEV_CHANGEADDR' events, except when only one RIF is using the profile. In which case, change the MAC prefix of the profile itself instead of associating the RIF with a new profile. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
The next patch will set the MAC profile of a router interface (RIF) as part of its configure() callback. The operation can fail in case the maximum number of profiles was exceeded. Add extack to mlxsw_sp_rif_ops::configure() in order to communicate such failures to user space. In addition, the MAC profile of a RIF can change following a 'NETDEV_CHANGEADDR' notification. Propagate extack to mlxsw_sp_router_port_change_event() so that failures could be communicated in this path as well. No functional changes intended. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Add a resource identifier for maximum RIF MAC profiles so that it could be later used to query the information from firmware. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Danielle Ratson authored
Add MAC profile ID field to RITR register so that it could be used for associating a RIF with a MAC profile ID by a later patch. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Florian Westphal says: ==================== vrf: rework interaction with netfilter/conntrack V2: - fix 'plain integer as null pointer' warning - reword commit message in patch 2 to clarify loss of 'ct set untracked' This patch series aims to solve the to-be-reverted change 09e856d5 ("vrf: Reset skb conntrack connection on VRF rcv") in a different way. Rather than have skbs pass through conntrack and nat hooks twice, suppress conntrack invocation if the conntrack/nat hook is called from the vrf driver. First patch deals with 'incoming connection' case: 1. suppress NAT transformations 2. skip conntrack confirmation NAT and conntrack confirmation is done when ip/ipv6 stack calls the postrouting hook. Second patch deals with local packets: in vrf driver, mark the skbs as 'untracked', so conntrack output hook ignores them. This skips all nat hooks as well. Afterwards, remove the untracked state again so the second round will pick them up. One alternative to the chosen implementation would be to add a 'caller id' field to 'struct nf_hook_state' and then use that, these patches use the more straightforward check of VRF flag on the state->out device. The two patches apply to both net and net-next, i am targeting -next because I think that since snat did not work correctly for so long that we can take the longer route. If you disagree, apply to net at your discretion. The patches apply both with 09e856d5 reverted or still in-place, but only with the revert in place ingress conntrack settings (zone, notrack etc) start working again. I've already submitted selftests for vrf+nfqueue and conntrack+vrf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. This is a problem when NAT rules are configured. To avoid any conntrack involvement in round 1, tag skbs as 'untracked' to prevent conntrack from picking them up. This gets cleared before the packet gets handed to the ip stack so conntrack will be active on the second iteration. One remaining issue is that a rule like output ... oif $vrfname notrack won't propagate to the second round because we can't tell 'notrack set via ruleset' and 'notrack set by vrf driver' apart. However, this isn't a regression: the 'notrack' removal happens instead of unconditional nf_reset_ct(). I'd also like to avoid leaking more vrf specific conditionals into the netfilter infra. For ingress, conntrack has already been done before the packet makes it to the vrf driver, with this patch egress does connection tracking with lower/physical device as well. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. Afterwards, ip stack calls those hooks again. This is a problem when conntrack is used with IP masquerading. masquerading has an internal check that re-validates the output interface to account for route changes. This check will trigger in the vrf case. If the -j MASQUERADE rule matched on the first iteration, then round 2 finds state->out->ifindex != nat->masq_index: the latter is the vrf index, but out->ifindex is the lower device. The packet gets dropped and the conntrack entry is invalidated. This change makes conntrack postrouting skip the nat hooks. Also skip confirmation. This allows the second round (postrouting invocation from ipv4/ipv6) to create nat bindings. This also prevents the second round from seeing packets that had their source address changed by the nat hook. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
Saeed Mahameed says: ==================== mlx5-updates-2021-10-25 Misc updates for mlx5 driver: 1) Misc updates and cleanups: - Don't write directly to netdev->dev_addr, From Jakub Kicinski - Remove unnecessary checks for slow path flag in tc module - Fix unused function warning of mlx5i_flow_type_mask - Bridge, support replacing existing FDB entry 2) Sub Functions, Reduction in memory usage: - Reduce flow counters bulk query buffer size - Implement max_macs devlink parameter - Add devlink vendor params to control Event Queue sizes - Added SF life cycle trace points by Parav/ 3) From Aya, Firmware health buffer reporting improvements - Print health buffer by log level and more missing information - Periodic update of host time to firmware ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jon Maxwell authored
v1: Implement a more general statement as recommended by Eric Dumazet. The sequence number will be advanced, so this check will fix the FIN case and other cases. A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that the write_queue was not empty as determined by tcp_write_queue_empty() but the sk_buff containing the FIN flag had been freed and the socket was zombied in that state. Corresponding pcaps show no FIN from the Linux kernel on the wire. Some instrumentation was added to the kernel and it was found that there is a timing window where tcp_sendmsg() can run after tcp_send_fin(). tcp_sendmsg() will hit an error, for example: 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
↩ 1270 ▹ ▹ goto do_error;↩ tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent. If the other side sends a FIN packet the socket will transition to CLOSING and remain that way until the system is rebooted. Fix this by checking for the FIN flag in the sk_buff and don't free it if that is the case. Testing confirmed that fixed the issue. Fixes: fdfc5c85 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz> Reported-by: Simon Stier <simon.stier@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> -
Jakub Kicinski authored
Jean Sacren says: ==================== Small fixes for true expression checks This series fixes checks of true !rc expression. ==================== Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jean Sacren authored
Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it is always true. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jean Sacren authored
Remove the check of !rc in (!rc && !params.b_granted) since it is always true. We should also use constant 0 for return. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Eric Dumazet says: ==================== tcp: receive path optimizations This series aims to reduce cache line misses in RX path. I am still working on better cache locality in tcp_sock but this will wait few more weeks. ==================== Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
Two kfree_skb() calls must be replaced by consume_skb() for skbs that are not technically dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
RFC 5082 IP_MINTTL option is rarely used on hosts. Add a static key to remove from TCP fast path useless code, and potential cache line miss to fetch inet_sk(sk)->min_ttl Note that once ip4_min_ttl static key has been enabled, it stays enabled until next boot. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
No report yet from KCSAN, yet worth documenting the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-