- 09 Feb, 2023 7 commits
-
-
Jakub Kicinski authored
Merge tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-next-netdev-deadlock This series from Jiri solves a deadlock when removing a network namespace with mlx5 devlink instance being in it. The deadlock is between: 1) mlx5_ib->unregister_netdevice_notifier() AND 2) mlx5_core->devlink_reload->cleanup_net() To slove this introduced mlx5 netdev added/removed events to track uplink netdev to be used for register_netdevice_notifier_dev_net() purposes. * tag 'mlx5-next-netdev-deadlock' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister net/mlx5e: Propagate an internal event in case uplink netdev changes net/mlx5e: Fix trap event handling net/mlx5: Introduce CQE error syndrome ==================== Link: https://lore.kernel.org/r/20230208005626.72930-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yang Li authored
./drivers/net/ethernet/wangxun/libwx/wx_lib.c:683:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3976Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20230208004959.47553-1-yang.lee@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yang Li authored
drivers/net/ethernet/wangxun/libwx/wx_lib.c:1835 wx_setup_all_rx_resources() warn: inconsistent indenting Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3981Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20230208013227.111605-1-yang.lee@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jiri Pirko authored
When removing a network namespace with mlx5 devlink instance being in it, following callchain is performed: cleanup_net (takes down_read(&pernet_ops_rwsem) devlink_pernet_pre_exit() devlink_reload() mlx5_devlink_reload_down() mlx5_unload_one_devl_locked() mlx5_detach_device() del_adev() mlx5r_remove() __mlx5_ib_remove() mlx5_ib_roce_cleanup() mlx5_remove_netdev_notifier() unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem) This deadlocks. Resolve this by converting to register_netdevice_notifier_dev_net() which does not take pernet_ops_rwsem and moves the notifier block around according to netdev it takes as arg. Use previously introduced netdev added/removed events to track uplink netdev to be used for register_netdevice_notifier_dev_net() purposes. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Jiri Pirko authored
Whenever uplink netdev is set/cleared, propagate newly introduced event to inform notifier blocks netdev was added/removed. Move the set() helper to core.c from header, introduce clear() and netdev_added_event_replay() helpers. The last one is going to be called from rdma driver, so export it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Jiri Pirko authored
Current code does not return correct return value from event handler. Fix it by returning NOTIFY_* and propagate err over newly introduce ctx structure. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxJakub Kicinski authored
Saeed Mahameed says: ==================== mlx5-updates-2023-02-07 1) Minor and trivial code Cleanups 2) Minor fixes for net-next 3) From Shay: dynamic FW trace strings update. * tag 'mlx5-updates-2023-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: fw_tracer, Add support for unrecognized string net/mlx5: fw_tracer, Add support for strings DB update event net/mlx5: fw_tracer, allow 0 size string DBs net/mlx5: fw_tracer: Fix debug print net/mlx5: fs, Remove redundant assignment of size net/mlx5: fs_core, Remove redundant variable err net/mlx5: Fix memory leak in error flow of port set buffer net/mlx5e: Remove incorrect debugfs_create_dir NULL check in TLS net/mlx5e: Remove incorrect debugfs_create_dir NULL check in hairpin net/mlx5: fs, Remove redundant vport_number assignment net/mlx5e: Remove redundant code for handling vlan actions net/mlx5e: Don't listen to remove flows event net/mlx5: fw reset: Skip device ID check if PCI link up failed net/mlx5: Remove redundant health work lock mlx5: reduce stack usage in mlx5_setup_tc ==================== Link: https://lore.kernel.org/r/20230208003712.68386-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 08 Feb, 2023 33 commits
-
-
David S. Miller authored
Vladimir Oltean says: ==================== taprio automatic queueMaxSDU and new TXQ selection procedure This patch set addresses 2 design limitations in the taprio software scheduler: 1. Software scheduling fundamentally prioritizes traffic incorrectly, in a way which was inspired from Intel igb/igc drivers and does not follow the inputs user space gives (traffic classes and TC to TXQ mapping). Patch 05/15 handles this, 01/15 - 04/15 are preparations for this work. 2. Software scheduling assumes that the gate for a traffic class closes as soon as the next interval begins. But this isn't true. If consecutive schedule entries have that traffic class gate open, there is no "gate close" event and taprio should keep dequeuing from that TC without interruptions. Patches 06/15 - 15/15 handle this. Patch 10/15 is a generic Qdisc change required for this to work. Future development directions which depend on this patch set are: - Propagating the automatic queueMaxSDU calculation down to offloading device drivers, instead of letting them calculate this, as vsc9959_tas_guard_bands_update() does today. - A software data path for tc-taprio with preemptible traffic and Hold/Release events. v1 at: https://patchwork.kernel.org/project/netdevbpf/cover/20230128010719.2182346-1-vladimir.oltean@nxp.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Improve commit 497cc002 ("taprio: Handle short intervals and large packets") to only perform segmentation when skb->len exceeds what taprio_dequeue() expects. In practice, this will make the biggest difference when a traffic class gate is always open in the schedule. This is because the max_frm_len will be U32_MAX, and such large skb->len values as Kurt reported will be sent just fine unsegmented. What I don't seem to know how to handle is how to make sure that the segmented skbs themselves are smaller than the maximum frame size given by the current queueMaxSDU[tc]. Nonetheless, we still need to drop those, otherwise the Qdisc will hang. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
The majority of the taprio_enqueue()'s function is spent doing TCP segmentation, which doesn't look right to me. Compilers shouldn't have a problem in inlining code no matter how we write it, so move the segmentation logic to a separate function. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
taprio today has a huge problem with small TC gate durations, because it might accept packets in taprio_enqueue() which will never be sent by taprio_dequeue(). Since not much infrastructure was available, a kludge was added in commit 497cc002 ("taprio: Handle short intervals and large packets"), which segmented large TCP segments, but the fact of the matter is that the issue isn't specific to large TCP segments (and even worse, the performance penalty in segmenting those is absolutely huge). In commit a54fc09e ("net/sched: taprio: allow user input of per-tc max SDU"), taprio gained support for queueMaxSDU, which is precisely the mechanism through which packets should be dropped at qdisc_enqueue() if they cannot be sent. After that patch, it was necessary for the user to manually limit the maximum MTU per TC. This change adds the necessary logic for taprio to further limit the values specified (or not specified) by the user to some minimum values which never allow oversized packets to be sent. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
I have one practical reason for doing this and one concerning correctness. The practical reason has to do with a follow-up patch, which aims to mix 2 sources of max_sdu (one coming from the user and the other automatically calculated based on TC gate durations @current link speed). Among those 2 sources of input, we must always select the smaller max_sdu value, but this can change at various link speeds. So the max_sdu coming from the user must be kept separated from the value that is operationally used (the minimum of the 2), because otherwise we overwrite it and forget what the user asked us to do. To solve that, this patch proposes that struct sched_gate_list contains the operationally active max_frm_len, and q->max_sdu contains just what was requested by the user. The reason having to do with correctness is based on the following observation: the admin sched_gate_list becomes operational at a given base_time in the future. Until then, it is inactive and applies no shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't apply either (this is a mechanism to ensure that packets smaller than the largest gate duration for that TC don't hang the port; clearly it makes little sense if the gates are always open). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Vinicius intended taprio to take the L1 overhead into account when estimating packet transmission time through user input, specifically through the qdisc size table (man tc-stab). Something like this: tc qdisc replace dev $eth root stab overhead 24 taprio \ num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 \ sched-entry S 0x7e 9000000 \ sched-entry S 0x82 1000000 \ max-sdu 0 0 0 0 0 0 0 200 \ flags 0x0 clockid CLOCK_TAI Without the overhead being specified, transmission times will be underestimated and will cause late transmissions. For an offloading driver, it might even cause TX hangs if there is no open gate large enough to send the maximum sized packets for that TC (including L1 overhead). Properly knowing the L1 overhead will ensure that we are able to auto-calculate the queueMaxSDU per traffic class just right, and avoid these hangs due to head-of-line blocking. We can't make the stab mandatory due to existing setups, but we can warn the user that it's important with a warning netlink extack. Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Some qdiscs like taprio turn out to be actually pretty reliant on a well configured stab, to not underestimate the skb transmission time (by properly accounting for L1 overhead). In a future change, taprio will need the stab, if configured by the user, to be available at ops->init() time. It will become even more important in upcoming work, when the overhead will be used for the queueMaxSDU calculation that is passed to an offloading driver. However, rcu_assign_pointer(sch->stab, stab) is called right after ops->init(), making it unavailable, and I don't really see a good reason for that. Move it earlier, which nicely seems to simplify the error handling path as well. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
taprio_dequeue_from_txq() looks at the entry->end_time to determine whether the skb will overrun its traffic class gate, as if at the end of the schedule entry there surely is a "gate close" event for it. Hint: maybe there isn't. For each schedule entry, introduce an array of kernel times which actually tracks when in the future will there be an *actual* gate close event for that traffic class, and use that in the guard band overrun calculation. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Currently taprio assumes that the budget for a traffic class expires at the end of the current interval as if the next interval contains a "gate close" event for this traffic class. This is, however, an unfounded assumption. Allow schedule entry intervals to be fused together for a particular traffic class by calculating the budget until the gate *actually* closes. This means we need to keep budgets per traffic class, and we also need to update the budget consumption procedure. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
There is a confusion in terms in taprio which makes what is called "close_time" to be actually used for 2 things: 1. determining when an entry "closes" such that transmitted skbs are never allowed to overrun that time (?!) 2. an aid for determining when to advance and/or restart the schedule using the hrtimer It makes more sense to call this so-called "close_time" "end_time", because it's not clear at all to me what "closes". Future patches will hopefully make better use of the term "to close". This is an absolutely mechanical change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Current taprio code operates on a very simplistic (and incorrect) assumption: that egress scheduling for a traffic class can only take place for the duration of the current interval, or i.o.w., it assumes that at the end of each schedule entry, there is a "gate close" event for all traffic classes. As an example, traffic sent with the schedule below will be jumpy, even though all 8 TC gates are open, so there is absolutely no "gate close" event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in consecutive schedule entries): tc qdisc replace dev veth0 parent root taprio \ num_tc 2 \ map 0 1 \ queues 1@0 1@1 \ base-time 0 \ sched-entry S 0xff 4000000000 \ clockid CLOCK_TAI \ flags 0x0 This qdisc simply does not have what it takes in terms of logic to *actually* compute the durations of traffic classes. Also, it does not recognize the need to use this information on a per-traffic-class basis: it always looks at entry->interval and entry->close_time. This change proposes that each schedule entry has an array called tc_gate_duration[tc]. This holds the information: "for how long will this traffic class gate remain open, starting from *this* schedule entry". If the traffic class gate is always open, that value is equal to the cycle time of the schedule. We'll also need to keep track, for the purpose of queueMaxSDU[tc] calculation, what is the maximum time duration for a traffic class having an open gate. This gives us directly what is the maximum sized packet that this traffic class will have to accept. For everything else it has to qdisc_drop() it in qdisc_enqueue(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Current taprio software implementation is haunted by the shadow of the igb/igc hardware model. It iterates over child qdiscs in increasing order of TXQ index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N. According to discussions with Vinicius, that is the default (perhaps even unchangeable) prioritization scheme used for the NICs that taprio was first written for (igb, igc), and we have a case of two bugs canceling out, resulting in a functional setup on igb/igc, but a less sane one on other NICs. To the best of my understanding, taprio should prioritize based on the traffic class, so it should really dequeue starting with the highest traffic class and going down from there. We get to the TXQ using the tc_to_txq[] netdev property. TXQs within the same TC have the same (strict) priority, so we should pick from them as fairly as we can. We can achieve that by implementing something very similar to q->curband from multiq_dequeue(). Since igb/igc really do have TXQ 0 of higher hardware priority than TXQ 1 etc, we need to preserve the behavior for them as well. We really have no choice, because in txtime-assist mode, taprio is essentially a software scheduler towards offloaded child tc-etf qdiscs, so the TXQ selection really does matter (not all igb TXQs support ETF/SO_TXTIME, says Kurt Kanzenbach). To preserve the behavior, we need a capability bit so that taprio can determine if it's running on igb/igc, or on something else. Because igb doesn't offload taprio at all, we can't piggyback on the qdisc_offload_query_caps() call from taprio_enable_offload(), but instead we need a separate call which is also made for software scheduling. Introduce two static keys to minimize the performance penalty on systems which only have igb/igc NICs, and on systems which only have other NICs. For mixed systems, taprio will have to dynamically check whether to dequeue using one prioritization algorithm or using the other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Simplify taprio_dequeue_from_txq() by noticing that we can goto one call earlier than the previous skb_found label. This is possible because we've unified the treatment of the child->ops->dequeue(child) return call, we always try other TXQs now, instead of abandoning the root dequeue completely if we failed in the peek() case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Future changes will refactor the TXQ selection procedure, and a lot of stuff will become messy, the indentation of the bulk of the dequeue procedure would increase, etc. Break out the bulk of the function into a new one, which knows the TXQ (child qdisc) we should perform a dequeue from. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
This changes the handling of an unlikely condition to not stop dequeuing if taprio failed to dequeue the peeked skb in taprio_dequeue(). I've no idea when this can happen, but the only side effect seems to be that the atomic_sub_return() call right above will have consumed some budget. This isn't a big deal, since either that made us remain without any budget (and therefore, we'd exit on the next peeked skb anyway), or we could send some packets from other TXQs. I'm making this change because in a future patch I'll be refactoring the dequeue procedure to simplify it, and this corner case will have to go away. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
There isn't any code in the network stack which calls taprio_peek(). We only see qdisc->ops->peek() being called on child qdiscs of other classful qdiscs, never from the generic qdisc code. Whereas taprio is never a child qdisc, it is always root. This snippet of a comment from qdisc_peek_dequeued() seems to confirm: /* we can reuse ->gso_skb because peek isn't called for root qdiscs */ Since I've been known to be wrong many times though, I'm not completely removing it, but leaving a stub function in place which emits a warning. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Horatiu Vultur says: ==================== net: micrel: Add support for lan8841 PHY Add support for lan8841 PHY. The first patch add the support for lan8841 PHY which can run at 10/100/1000Mbit. It also has support for other features, but they are not added in this series. The second patch updates the documentation for the dt-bindings which is similar to the ksz9131. v3->v4: - add space between defines and function names - inside lan8841_config_init use only ret variable v2->v3: - reuse ksz9131_config_init - allow only open-drain configuration - change from single patch to a patch series v1->v2: - Remove hardcoded values - Fix typo in commit message ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
The lan8841 has the same bindings as ksz9131, so just reuse the entire section of ksz9131. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
The LAN8841 is completely integrated triple-speed (10BASE-T/ 100BASE-TX/ 1000BASE-T) Ethernet physical layer transceivers for transmission and reception of data on standard CAT-5, as well as CAT-5e and CAT-6, unshielded twisted pair (UTP) cables. The LAN8841 offers the industry-standard GMII/MII as well as the RGMII. Some of the features of the PHY are: - Wake on LAN - Auto-MDIX - IEEE 1588-2008 (V2) - LinkMD Capable diagnosis Currently the patch offers support only for link configuration. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Add flower filter packet statistics. This will just read the TCAM counter of the rule, which mention how many packages were hit by this rule. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueJakub Kicinski authored
Tony Nguyen says: ==================== ice: various virtualization cleanups Jacob Keller says: This series contains a variety of refactors and cleanups in the VF code for the ice driver. Its primary focus is cleanup and simplification of the VF operations and addition of a few new operations that will be required by Scalable IOV, as well as some other refactors needed for the handling of VF subfunctions. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: remove unnecessary virtchnl_ether_addr struct use ice: introduce .irq_close VF operation ice: introduce clear_reset_state operation ice: convert vf_ops .vsi_rebuild to .create_vsi ice: introduce ice_vf_init_host_cfg function ice: add a function to initialize vf entry ice: Pull common tasks into ice_vf_post_vsi_rebuild ice: move ice_vf_vsi_release into ice_vf_lib.c ice: move vsi_type assignment from ice_vsi_alloc to ice_vsi_cfg ice: refactor VSI setup to use parameter structure ice: drop unnecessary VF parameter from several VSI functions ice: fix function comment referring to ice_vsi_alloc ice: Add more usage of existing function ice_get_vf_vsi(vf) ==================== Link: https://lore.kernel.org/r/20230206214813.20107-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Moshe Shemesh authored
The callback devlink_nl_cmd_health_reporter_diagnose_doit() miss devlink_fmsg_free(), which leads to memory leak. Fix it by adding devlink_fmsg_free(). Fixes: e994a75f ("devlink: remove reporter reference counting") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/1675698976-45993-1-git-send-email-moshe@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
James Hershaw authored
Move the nfp_net_get_port_mac_by_hwinfo() check to ahead in the get/set_eeprom() functions to in order to check for a VF netdev, which this function does not support. It is debatable if this is a fix or an enhancement, and we have chosen to go for the latter. It does address a problem introduced by commit 74b4f173 ("nfp: flower: change get/set_eeprom logic and enable for flower reps"). However, the ethtool->len == 0 check avoids the problem manifesting as a run-time bug (NULL pointer dereference of app). Signed-off-by: James Hershaw <james.hershaw@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230206154836.2803995-1-simon.horman@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Petr Machata says: ==================== mlxsw: Misc devlink changes This patchset adjusts mlxsw to recent devlink changes in net-next. Patch #1 removes a devl_param_driverinit_value_set() call that was unnecessary, but now additionally triggers a WARN_ON. Patches #2-#4 are non-functional preparations for the following patches. Patch #5 fixes a use-after-free that is triggered while changing network namespaces. Patch #6 makes mlxsw consistent with netdevsim by having mlxsw register its devlink instance before its sub-objects. It helps us avoid a warning described in the commit message. ==================== Link: https://lore.kernel.org/r/cover.1675692666.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Recent changes made it possible to register the devlink instance before its sub-objects and under the instance lock. Among other things, it allows us to avoid warnings such as this one [1]. The warning is generated because a buggy firmware is generating a health event during driver initialization, before the devlink instance is registered. Move the registration of the devlink instance to the beginning of the initialization flow to avoid such problems. A similar change was implemented in netdevsim in commit 82a3aef2 ("netdevsim: move devlink registration under the instance lock"). [1] WARNING: CPU: 3 PID: 49 at net/devlink/leftover.c:7509 devlink_recover_notify.constprop.0+0xaf/0xc0 [...] Call Trace: <TASK> devlink_health_report+0x45/0x1d0 mlxsw_core_health_event_work+0x24/0x30 [mlxsw_core] process_one_work+0x1db/0x390 worker_thread+0x49/0x3b0 kthread+0xe5/0x110 ret_from_fork+0x1f/0x30 </TASK> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Cited commit added 'DEVLINK_CMD_PARAM_DEL' notifications whenever the network namespace of the devlink instance is changed. Specifically, the notifications are generated after calling reload_down(), but before calling reload_up(). At this stage, the data structures accessed while reading the value of the "acl_region_rehash_interval" devlink parameter are uninitialized, resulting in a use-after-free [1]. Fix by moving the registration and unregistration of the devlink parameter to the TCAM code where it is actually used. This means that the parameter is unregistered during reload_down() and then re-registered during reload_up(), avoiding the use-after-free between these two operations. Reproducer: # ip netns add test123 # devlink dev reload pci/0000:06:00.0 netns test123 [1] BUG: KASAN: use-after-free in mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0 Read of size 4 at addr ffff888162fd37d8 by task devlink/1323 [...] Call Trace: <TASK> dump_stack_lvl+0x95/0xbd print_report+0x181/0x4a1 kasan_report+0xdb/0x200 mlxsw_sp_acl_tcam_vregion_rehash_intrvl_get+0xb2/0xd0 mlxsw_sp_params_acl_region_rehash_intrvl_get+0x32/0x80 devlink_nl_param_fill.constprop.0+0x29a/0x11e0 devlink_param_notify.constprop.0+0xb9/0x250 devlink_notify_unregister+0xbc/0x470 devlink_reload+0x1aa/0x440 devlink_nl_cmd_reload+0x559/0x11b0 genl_family_rcv_msg_doit.isra.0+0x1f8/0x2e0 genl_rcv_msg+0x558/0x7f0 netlink_rcv_skb+0x170/0x440 genl_rcv+0x2d/0x40 netlink_unicast+0x53f/0x810 netlink_sendmsg+0x961/0xe80 __sys_sendto+0x2a4/0x420 __x64_sys_sendto+0xe5/0x1c0 do_syscall_64+0x38/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 7d7e9169 ("devlink: move devlink reload notifications back in between _down() and _up() calls") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Move the initialization and de-initialization code further below in order to avoid forward declarations in the next patch. No functional changes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Move mutex_destroy() to the end to make the function symmetric with mlxsw_sp_acl_tcam_init(). No functional changes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Pair mutex_init() with a mutex_destroy() in the error path. Found during code review. No functional changes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Danielle Ratson authored
The "acl_region_rehash_interval" devlink parameter is a "runtime" parameter, making the call to devl_param_driverinit_value_set() pointless. Before cited commit the function simply returned an error (that was not checked), but now it emits a WARNING [1]. Fix by removing the function call. [1] WARNING: CPU: 0 PID: 7 at net/devlink/leftover.c:10974 devl_param_driverinit_value_set+0x8c/0x90 [...] Call Trace: <TASK> mlxsw_sp2_params_register+0x83/0xb0 [mlxsw_spectrum] __mlxsw_core_bus_device_register+0x5e5/0x990 [mlxsw_core] mlxsw_core_bus_device_register+0x42/0x60 [mlxsw_core] mlxsw_pci_probe+0x1f0/0x230 [mlxsw_pci] local_pci_probe+0x1a/0x40 work_for_cpu_fn+0xf/0x20 process_one_work+0x1db/0x390 worker_thread+0x1d5/0x3b0 kthread+0xe5/0x110 ret_from_fork+0x1f/0x30 </TASK> Fixes: 85fe0b32 ("devlink: make devlink_param_driverinit_value_set() return void") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Add PF driver support for the following: - Viewing the standardized MAC Merge layer counters. - Viewing the standardized Ethernet MAC and RMON counters associated with the pMAC. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230206094531.444988-2-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Add PF driver support for viewing and changing the MAC Merge sublayer parameters, and seeing the verification state machine's current state. The verification handshake with the link partner is driven by hardware. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230206094531.444988-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Yury Norov says: ==================== sched: cpumask: improve on cpumask_local_spread() locality cpumask_local_spread() currently checks local node for presence of i'th CPU, and then if it finds nothing makes a flat search among all non-local CPUs. We can do it better by checking CPUs per NUMA hops. This has significant performance implications on NUMA machines, for example when using NUMA-aware allocated memory together with NUMA-aware IRQ affinity hints. Performance tests from patch 8 of this series for mellanox network driver show: TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on). Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 +-------------------------+-----------+------------------+------------------+ | | BW (Gbps) | TX side CPU util | RX side CPU util | +-------------------------+-----------+------------------+------------------+ | Baseline | 52.3 | 6.4 % | 17.9 % | +-------------------------+-----------+------------------+------------------+ | Applied on TX side only | 52.6 | 5.2 % | 18.5 % | +-------------------------+-----------+------------------+------------------+ | Applied on RX side only | 94.9 | 11.9 % | 27.2 % | +-------------------------+-----------+------------------+------------------+ | Applied on both sides | 95.1 | 8.4 % | 27.3 % | +-------------------------+-----------+------------------+------------------+ Bottleneck in RX side is released, reached linerate (~1.8x speedup). ~30% less cpu util on TX. ==================== Link: https://lore.kernel.org/r/20230121042436.2661843-1-yury.norov@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-