- 18 Nov, 2021 7 commits
-
-
Yunsheng Lin authored
This reverts commit d00e60ee. As reported by Guillaume in [1]: Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT in 32-bit systems, which breaks the bootup proceess when a ethernet driver is using page pool with PP_FLAG_DMA_MAP flag. As we were hoping we had no active consumers for such system when we removed the dma mapping support, and LPAE seems like a common feature for 32 bits system, so revert it. 1. https://www.spinics.net/lists/netdev/msg779890.html Fixes: d00e60ee ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reported-by: "kernelci.org bot" <bot@kernelci.org> Tested-by: "kernelci.org bot" <bot@kernelci.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Teng Qi authored
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port() The if statement: if (port >= DSAF_GE_NUM) return; limits the value of port less than DSAF_GE_NUM (i.e., 8). However, if the value of port is 6 or 7, an array overflow could occur: port_rst_off = dsaf_dev->mac_cb[port]->port_rst_off; because the length of dsaf_dev->mac_cb is DSAF_MAX_PORT_NUM (i.e., 6). To fix this possible array overflow, we first check port and if it is greater than or equal to DSAF_MAX_PORT_NUM, the function returns. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Teng Qi <starmiku1207184332@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Dan Carpenter authored
The user supplies the "count" value to say how big its read buffer is. The rvu_dbg_lmtst_map_table_display() function does not take the "count" into account but instead just copies the whole table, potentially corrupting the user's data. Introduce the "ret" variable to store how many bytes we can copy. Also I changed the type of "off" to size_t to make using min() simpler. Fixes: 0daa55d0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20211117073454.GD5237@kiliSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Lin Ma authored
There are two sites that calls queue_work() after the destroy_workqueue() and lead to possible UAF. The first site is nci_send_cmd(), which can happen after the nci_close_device as below nfcmrvl_nci_unregister_dev | nfc_genl_dev_up nci_close_device | flush_workqueue | del_timer_sync | nci_unregister_device | nfc_get_device destroy_workqueue | nfc_dev_up nfc_unregister_device | nci_dev_up device_del | nci_open_device | __nci_request | nci_send_cmd | queue_work !!! Another site is nci_cmd_timer, awaked by the nci_cmd_work from the nci_send_cmd. ... | ... nci_unregister_device | queue_work destroy_workqueue | nfc_unregister_device | ... device_del | nci_cmd_work | mod_timer | ... | nci_cmd_timer | queue_work !!! For the above two UAF, the root cause is that the nfc_dev_up can race between the nci_unregister_device routine. Therefore, this patch introduce NCI_UNREG flag to easily eliminate the possible race. In addition, the mutex_lock in nci_close_device can act as a barrier. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 6a2968aa ("NFC: basic NCI protocol implementation") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211116152732.19238-1-linma@zju.edu.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Lin Ma authored
There is a potential UAF between the unregistration routine and the NFC netlink operations. The race that cause that UAF can be shown as below: (FREE) | (USE) nfcmrvl_nci_unregister_dev | nfc_genl_dev_up nci_close_device | nci_unregister_device | nfc_get_device nfc_unregister_device | nfc_dev_up rfkill_destory | device_del | rfkill_blocked ... | ... The root cause for this race is concluded below: 1. The rfkill_blocked (USE) in nfc_dev_up is supposed to be placed after the device_is_registered check. 2. Since the netlink operations are possible just after the device_add in nfc_register_device, the nfc_dev_up() can happen anywhere during the rfkill creation process, which leads to data race. This patch reorder these actions to permit 1. Once device_del is finished, the nfc_dev_up cannot dereference the rfkill object. 2. The rfkill_register need to be placed after the device_add of nfc_dev because the parent device need to be created first. So this patch keeps the order but inject device_lock to prevent the data race. Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: be055b2f ("NFC: RFKILL support") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211116152652.19217-1-linma@zju.edu.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Lin Ma authored
There is a possible data race as shown below: thread-A in nci_request() | thread-B in nci_close_device() | mutex_lock(&ndev->req_lock); test_bit(NCI_UP, &ndev->flags); | ... | test_and_clear_bit(NCI_UP, &ndev->flags) mutex_lock(&ndev->req_lock); | | This race will allow __nci_request() to be awaked while the device is getting removed. Similar to commit e2cb6b89 ("bluetooth: eliminate the potential race condition when removing the HCI controller"). this patch alters the function sequence in nci_request() to prevent the data races between the nci_close_device(). Signed-off-by: Lin Ma <linma@zju.edu.cn> Fixes: 6a2968aa ("NFC: basic NCI protocol implementation") Link: https://lore.kernel.org/r/20211115145600.8320-1-linma@zju.edu.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Tadeusz Struk authored
kmemdup can return a null pointer so need to check for it, otherwise the null key will be dereferenced later in tipc_crypto_key_xmit as can be seen in the trace [1]. Cc: tipc-discussion@lists.sourceforge.net Cc: stable@vger.kernel.org # 5.15, 5.14, 5.10 [1] https://syzkaller.appspot.com/bug?id=bca180abb29567b189efdbdb34cbf7ba851c2a58Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20211115160143.5099-1-tadeusz.struk@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 17 Nov, 2021 15 commits
-
-
Łukasz Stelmach authored
Change the values of EVENT_* constants from bit masks to bit numbers as accepted by {clear,set,test}_bit() functions. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jonathan Davies authored
virtio_net_hdr_to_skb does not set the skb's gso_size and gso_type correctly for UFO packets received via virtio-net that are a little over the GSO size. This can lead to problems elsewhere in the networking stack, e.g. ovs_vport_send dropping over-sized packets if gso_size is not set. This is due to the comparison if (skb->len - p_off > gso_size) not properly accounting for the transport layer header. p_off includes the size of the transport layer header (thlen), so skb->len - p_off is the size of the TCP/UDP payload. gso_size is read from the virtio-net header. For UFO, fragmentation happens at the IP level so does not need to include the UDP header. Hence the calculation could be comparing a TCP/UDP payload length with an IP payload length, causing legitimate virtio-net packets to have lack gso_type/gso_size information. Example: a UDP packet with payload size 1473 has IP payload size 1481. If the guest used UFO, it is not fragmented and the virtio-net header's flags indicate that it is a GSO frame (VIRTIO_NET_HDR_GSO_UDP), with gso_size = 1480 for an MTU of 1500. skb->len will be 1515 and p_off will be 42, so skb->len - p_off = 1473. Hence the comparison fails, and shinfo->gso_size and gso_type are not set as they should be. Instead, add the UDP header length before comparing to gso_size when using UFO. In this way, it is the size of the IP payload that is compared to gso_size. Fixes: 6dd912f8 ("net: check untrusted gso_size at kernel entry") Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pavel Skripkin authored
Access to netdev after free_netdev() will cause use-after-free bug. Move debug log before free_netdev() call to avoid it. Fixes: 7472dd9f ("staging: fsl-dpaa2/eth: Move print message") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Aaron Ma authored
Like ThinkaPad Thunderbolt 4 Dock, more Lenovo docks start to use the original Realtek USB ethernet chip ID 0bda:8153. Lenovo Docks always use their own IDs for usb hub, even for older Docks. If parent hub is from Lenovo, then r8152 should try MAC passthrough. Verified on Lenovo TBT3 dock too. Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
Saeed Mahameed says: ==================== mlx5-fixes-2021-11-16 Please pull this mlx5 fixes series, or let me know in case of any problem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Gleixner authored
The recent addition of timestamp correction to compensate the CDC error introduced a subtle signed/unsigned bug in stmmac_get_tx_hwtstamp() while it managed for some obscure reason to avoid that in stmmac_get_rx_hwtstamp(). The issue is: s64 adjust = 0; u64 ns; adjust += -(2 * (NSEC_PER_SEC / priv->plat->clk_ptp_rate)); ns += adjust; works by chance on 64bit, but falls apart on 32bit because the compiler knows that adjust fits into 32bit and then treats the addition as a u64 + u32 resulting in an off by ~2 seconds failure. The RX variant uses an u64 for adjust and does the adjustment via ns -= adjust; because consistency is obviously overrated. Get rid of the pointless zero initialized adjust variable and do: ns -= (2 * NSEC_PER_SEC) / priv->plat->clk_ptp_rate; which is obviously correct and spares the adjust obfuscation. Aside of that it yields a more accurate result because the multiplication takes place before the integer divide truncation and not afterwards. Stick the calculation into an inline so it can't be accidentally disimproved. Return an u32 from that inline as the result is guaranteed to fit which lets the compiler optimize the substraction. Cc: stable@vger.kernel.org Fixes: 3600be5f ("net: stmmac: add timestamp correction to rid CDC sync error") Reported-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Benedikt Spranger <b.spranger@linutronix.de> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> # Intel EHL Link: https://lore.kernel.org/r/87mtm578cs.ffs@tglxSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Xin Long says: ==================== net: fix the mirred packet drop due to the incorrect dst This issue was found when using OVS HWOL on OVN-k8s. These packets dropped on rx path were seen with output dst, which should've been dropped from the skbs when redirecting them. The 1st patch is to the fix and the 2nd is a selftest to reproduce and verify it. ==================== Link: https://lore.kernel.org/r/cover.1636734751.git.lucien.xin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Davide Caratti authored
add a selftest that verifies the correct behavior of TC act_mirred egress to ingress: in particular, it checks if the dst_entry is removed from skb before redirect egress -> ingress. The correct behavior is: an ICMP 'echo request' generated by ping will be received and generate a reply the same way as the one generated by mausezahn. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Xin Long authored
Without dropping dst, the packets sent from local mirred/redirected to ingress will may still use the old dst. ip_rcv() will drop it as the old dst is for output and its .input is dst_discard. This patch is to fix by also dropping dst for those packets that are mirred or redirected from egress to ingress in act_mirred. Note that we don't drop it for the direction change from ingress to egress, as on which there might be a user case attaching a metadata dst by act_tunnel_key that would be used later. Fixes: b57dc7c1 ("net/sched: Introduce action ct") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Cong Wang <cong.wang@bytedance.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Taehee Yoo authored
When the amt module is being removed, it calls cancel_delayed_work() to cancel pending delayed_work. But this function doesn't wait for canceling delayed_work. So, workers can be still doing after module delete. In order to avoid this, cancel_delayed_work_sync() should be used instead. Suggested-by: Jakub Kicinski <kuba@kernel.org> Fixes: bc54e49c ("amt: add multicast(IGMP) report message handler") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20211116160923.25258-1-ap420073@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Pavel Skripkin authored
I've sent a patch to GR-everest-linux-l2@marvell.com few days ago and got a reply from postmaster@marvell.com: Delivery has failed to these recipients or groups: gr-everest-linux-l2@marvell.com<mailto:gr-everest-linux-l2@marvell.com> The email address you entered couldn't be found. Please check the recipient's email address and try to resend the message. If the problem continues, please contact your helpdesk. As requested by Alok Prasad, replacing GR-everest-linux-l2@marvell.com with Manish Chopra's email address. [0] Link: https://lore.kernel.org/all/20211116081601.11208-1-palok@marvell.com/ [0] Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20211116141303.32180-1-paskripkin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Michael Chan authored
bp->sriov_cfg is not defined when CONFIG_BNXT_SRIOV is not set. Fix it by adding a helper function bnxt_sriov_cfg() to handle the logic with or without the config option. Fixes: 46d08f55 ("bnxt_en: extend RTNL to VF check in devlink driver_reinit") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/1637090770-22835-1-git-send-email-michael.chan@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Marcin Wojtas authored
The kernel test robot reported a following issue: >> drivers/net/ethernet/marvell/mvmdio.c:426:36: warning: unused variable 'orion_mdio_acpi_match' [-Wunused-const-variable] static const struct acpi_device_id orion_mdio_acpi_match[] = { ^ 1 warning generated. Fix that by surrounding the variable by appropriate ifdef. Fixes: c54da4c1 ("net: mvmdio: add ACPI support") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Marcin Wojtas <mw@semihalf.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20211115153024.209083-1-mw@semihalf.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Merge tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Couple of fixes: * bad dont-reorder check * throughput LED trigger for various new(ish) paths * radiotap header generation * locking assertions in mac80211 with monitor mode * radio statistics * don't try to access IV when not present * call stop_ap for P2P_GO as well as we should * tag 'mac80211-for-net-2021-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: fix throughput LED trigger mac80211: fix monitor_sdata RCU/locking assertions mac80211: drop check for DONT_REORDER in __ieee80211_select_queue mac80211: fix radiotap header generation mac80211: do not access the IV when it was stripped nl80211: fix radio statistics in survey dump cfg80211: call cfg80211_stop_ap when switch from P2P_GO type ==================== Link: https://lore.kernel.org/r/20211116160845.157214-1-johannes@sipsolutions.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf 2021-11-16 We've added 12 non-merge commits during the last 5 day(s) which contain a total of 23 files changed, 573 insertions(+), 73 deletions(-). The main changes are: 1) Fix pruning regression where verifier went overly conservative rejecting previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer. 2) Fix verifier TOCTOU bug when using read-only map's values as constant scalars during verification, from Daniel Borkmann. 3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson. 4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker. 5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing programs due to deadlock possibilities, from Dmitrii Banshchikov. 6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang. 7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin. 8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: udp: Validate checksum in udp_read_sock() bpf: Fix toctou on read-only map's constant scalar tracking samples/bpf: Fix build error due to -isystem removal selftests/bpf: Add tests for restricted helpers bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs libbpf: Perform map fd cleanup for gen_loader in case of error samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu tools/runqslower: Fix cross-build samples/bpf: Fix summary per-sec stats in xdp_sample_user selftests/bpf: Check map in map pruning bpf: Fix inner map state pruning regression. xsk: Fix crash on double free in buffer pool ==================== Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 16 Nov, 2021 18 commits
-
-
Raed Salem authored
On regular ConnectX HCAs getting encap mode isn't supported when the E-Switch is in NONE mode. Current code would return no error code when trying to get encap mode in such case which is wrong. Fix by returning error value to indicate failure to caller in such case. Fixes: 8e0aa4bc ("net/mlx5: E-switch, Protect eswitch mode changes") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Maher Sanalla authored
Currently, In NETDEV_CHANGELOWERSTATE/NETDEV_CHANGEUPPERSTATE events handling, tracking is not fully completed if the LAG device is not ready at the time the events occur. But, we must keep track of the upper and lower states after receiving the events because RoCE needs this info in mlx5_lag_get_roce_netdev() - in order to return the corresponding port that its running on. Returning the wrong (not most recent) port will lead to gids table being incorrect. For example: If during the attachment of a slave to the bond, the other non-attached port performs pci_reload, then the LAG device is not ready, but that should not result in dismissing attached slave tracker update automatically (which is performed in mlx5_handle_changelowerstate()), Since these events might not come later, which can lead to both bond ports having tx_enabled=0 - which is not a valid state of LAG bond. Fixes: 9b412cc3 ("net/mlx5e: Add LAG warning if bond slave is not lag master") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Roi Dayan authored
CT clear action offload adds additional mod hdr actions to the flow's original mod actions in order to clear the registers which hold ct_state. When such flow also includes encap action, a neigh update event can cause the driver to unoffload the flow and then reoffload it. Each time this happens, the ct clear handling adds that same set of mod hdr actions to reset ct_state until the max of mod hdr actions is reached. Also the driver never releases the allocated mod hdr actions and causing a memleak. Fix above two issues by moving CT clear mod acts allocation into the parsing actions phase and only use it when offloading the rule. The release of mod acts will be done in the normal flow_put(). backtrace: [<000000007316e2f3>] krealloc+0x83/0xd0 [<00000000ef157de1>] mlx5e_mod_hdr_alloc+0x147/0x300 [mlx5_core] [<00000000970ce4ae>] mlx5e_tc_match_to_reg_set_and_get_id+0xd7/0x240 [mlx5_core] [<0000000067c5fa17>] mlx5e_tc_match_to_reg_set+0xa/0x20 [mlx5_core] [<00000000d032eb98>] mlx5_tc_ct_entry_set_registers.isra.0+0x36/0xc0 [mlx5_core] [<00000000fd23b869>] mlx5_tc_ct_flow_offload+0x272/0x1f10 [mlx5_core] [<000000004fc24acc>] mlx5e_tc_offload_fdb_rules.part.0+0x150/0x620 [mlx5_core] [<00000000dc741c17>] mlx5e_tc_encap_flows_add+0x489/0x690 [mlx5_core] [<00000000e92e49d7>] mlx5e_rep_update_flows+0x6e4/0x9b0 [mlx5_core] [<00000000f60f5602>] mlx5e_rep_neigh_update+0x39a/0x5d0 [mlx5_core] Fixes: 1ef3018f ("net/mlx5e: CT: Support clear action") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Avihai Horon authored
When doing a flow counters bulk query, the number of counters to query must be aligned to 4. Current SF bulk query len is not aligned to 4, which leads to an error when trying to query more than 4 counters. Fix it by aligning SF bulk query len to 4. Fixes: 2fdeb4f4 ("net/mlx5: Reduce flow counters bulk query buffer size for SFs") Signed-off-by: Avihai Horon <avihaih@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Mark Bloch authored
A user can enable VFs without changing E-Switch mode, this can happen when a user moves straight to switchdev mode and only once in switchdev VFs are enabled via the sysfs interface. The cited commit assumed this isn't possible and exposed a single API function where the E-switch calls into the lag code, breaks the lag and prevents any other lag operations to take place until the E-switch update has ended. Breaking the hardware lag when it isn't needed can make it such that hardware lag can't be enabled again. In the sysfs call path check if the current E-Switch mode is NONE, in the context of the function it can only mean the E-Switch is moving out of NONE mode and the hardware lag should be disabled and enabled once the mode change has ended. If the mode isn't NONE it means VFs are about to be enabled and such operation doesn't require toggling the hardware lag. Fixes: cac1eb2c ("net/mlx5: Lag, properly lock eswitch if needed") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Neta Ostrovsky authored
In the fast unload flow, the device state is set to internal error, which indicates that the driver started the destroy process. In this case, when a destroy command is being executed, it should return MLX5_CMD_STAT_OK. Fix MLX5_CMD_OP_DESTROY_UCTX and MLX5_CMD_OP_DESTROY_UMEM to return OK instead of EIO. This fixes a call trace in the umem release process - [ 2633.536695] Call Trace: [ 2633.537518] ib_uverbs_remove_one+0xc3/0x140 [ib_uverbs] [ 2633.538596] remove_client_context+0x8b/0xd0 [ib_core] [ 2633.539641] disable_device+0x8c/0x130 [ib_core] [ 2633.540615] __ib_unregister_device+0x35/0xa0 [ib_core] [ 2633.541640] ib_unregister_device+0x21/0x30 [ib_core] [ 2633.542663] __mlx5_ib_remove+0x38/0x90 [mlx5_ib] [ 2633.543640] auxiliary_bus_remove+0x1e/0x30 [auxiliary] [ 2633.544661] device_release_driver_internal+0x103/0x1f0 [ 2633.545679] bus_remove_device+0xf7/0x170 [ 2633.546640] device_del+0x181/0x410 [ 2633.547606] mlx5_rescan_drivers_locked.part.10+0x63/0x160 [mlx5_core] [ 2633.548777] mlx5_unregister_device+0x27/0x40 [mlx5_core] [ 2633.549841] mlx5_uninit_one+0x21/0xc0 [mlx5_core] [ 2633.550864] remove_one+0x69/0xe0 [mlx5_core] [ 2633.551819] pci_device_remove+0x3b/0xc0 [ 2633.552731] device_release_driver_internal+0x103/0x1f0 [ 2633.553746] unbind_store+0xf6/0x130 [ 2633.554657] kernfs_fop_write+0x116/0x190 [ 2633.555567] vfs_write+0xa5/0x1a0 [ 2633.556407] ksys_write+0x4f/0xb0 [ 2633.557233] do_syscall_64+0x5b/0x1a0 [ 2633.558071] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 2633.559018] RIP: 0033:0x7f9977132648 [ 2633.559821] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 55 6f 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 2633.562332] RSP: 002b:00007fffb1a83888 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 2633.563472] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f9977132648 [ 2633.564541] RDX: 000000000000000c RSI: 000055b90546e230 RDI: 0000000000000001 [ 2633.565596] RBP: 000055b90546e230 R08: 00007f9977406860 R09: 00007f9977a54740 [ 2633.566653] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f99774056e0 [ 2633.567692] R13: 000000000000000c R14: 00007f9977400880 R15: 000000000000000c [ 2633.568725] ---[ end trace 10b4fe52945e544d ]--- Fixes: 6a6fabbf ("net/mlx5: Update pci error handler entries and command translation") Signed-off-by: Neta Ostrovsky <netao@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Yevgeny Kliteynik authored
The existing loop doesn't cast the buffer while scanning it, which results in out-of-bounds read and failure to create the matcher. Fixes: 941f1979 ("net/mlx5: DR, Add check for unsupported fields in match param") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Yevgeny Kliteynik authored
When querying eswitch manager vport capabilities as "other = 1", we encounter a FW compatibility issue with older FW versions. To maintain backward compatibility, eswitch manager vport should be queried as "other = 0" vport both for ECPF and non-ECPF cases. This patch fixes these queries and improves the code readability by handling eswitch manager and uplink vports separately, avoiding the excessive 'if' conditions. Also, uplink caps are stored similar to esw manager and not as part of xarray. Fixes: dd4acb2a ("net/mlx5: DR, Add missing query for vport 0") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Valentine Fatiev authored
Prior to this patch in case mlx5_core_destroy_cq() failed it proceeds to rest of destroy operations. mlx5_core_destroy_cq() could be called again by user and cause additional call of mlx5_debug_cq_remove(). cq->dbg was not nullify in previous call and cause the crash. Fix it by nullify cq->dbg pointer after removal. Also proceed to destroy operations only if FW return 0 for MLX5_CMD_OP_DESTROY_CQ command. general protection fault, probably for non-canonical address 0x2000300004058: 0000 [#1] SMP PTI CPU: 5 PID: 1228 Comm: python Not tainted 5.15.0-rc5_for_upstream_min_debug_2021_10_14_11_06 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:lockref_get+0x1/0x60 Code: 5d e9 53 ff ff ff 48 8d 7f 70 e8 0a 2e 48 00 c7 85 d0 00 00 00 02 00 00 00 c6 45 70 00 fb 5d c3 c3 cc cc cc cc cc cc cc cc 53 <48> 8b 17 48 89 fb 85 d2 75 3d 48 89 d0 bf 64 00 00 00 48 89 c1 48 RSP: 0018:ffff888137dd7a38 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff888107d5f458 RCX: 00000000fffffffe RDX: 000000000002c2b0 RSI: ffffffff8155e2e0 RDI: 0002000300004058 RBP: ffff888137dd7a88 R08: 0002000300004058 R09: ffff8881144a9f88 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881141d4000 R13: ffff888137dd7c68 R14: ffff888137dd7d58 R15: ffff888137dd7cc0 FS: 00007f4644f2a4c0(0000) GS:ffff8887a2d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b4500f4380 CR3: 0000000114f7a003 CR4: 0000000000170ea0 Call Trace: simple_recursive_removal+0x33/0x2e0 ? debugfs_remove+0x60/0x60 debugfs_remove+0x40/0x60 mlx5_debug_cq_remove+0x32/0x70 [mlx5_core] mlx5_core_destroy_cq+0x41/0x1d0 [mlx5_core] devx_obj_cleanup+0x151/0x330 [mlx5_ib] ? __pollwait+0xd0/0xd0 ? xas_load+0x5/0x70 ? xa_load+0x62/0xa0 destroy_hw_idr_uobject+0x20/0x80 [ib_uverbs] uverbs_destroy_uobject+0x3b/0x360 [ib_uverbs] uobj_destroy+0x54/0xa0 [ib_uverbs] ib_uverbs_cmd_verbs+0xaf2/0x1160 [ib_uverbs] ? uverbs_finalize_object+0xd0/0xd0 [ib_uverbs] ib_uverbs_ioctl+0xc4/0x1b0 [ib_uverbs] __x64_sys_ioctl+0x3e4/0x8e0 Fixes: 94b960b9 ("net/mlx5e: Fix memory leak in mlx5_core_destroy_cq() error path") Signed-off-by: Valentine Fatiev <valentinef@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Paul Blakey authored
E-Switch encap mode is relevant only when in switchdev mode. The RDMA driver can query the encap configuration via mlx5_eswitch_get_encap_mode(). Make sure it returns the currently used mode and not the set one. This reverts the cited commit which reset the encap mode on entering switchdev and fixes the original issue properly. Fixes: 9a64144d ("net/mlx5: E-Switch, Fix default encap mode") Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Vlad Buslov authored
Function mlx5e_take_tmp_flow() skips flows with zero reference count. This can cause syndrome 0x179e84 when the called from neigh or route update code and the skipped flow is not removed from the hardware by the time underlying encap/decap resource is deleted. Add new completion 'del_hw_done' that is completed when flow is unoffloaded. This is safe to do because flow with reference count zero needs to be detached from encap/decap entry before its memory is deallocated, which requires taking the encap_tbl_lock mutex that is held by the event handlers code. Fixes: 8914add2 ("net/mlx5e: Handle FIB events to update tunnel endpoint device") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Tariq Toukan authored
For the TLS RX resync flow, we maintain a list of TLS contexts that require some attention, to communicate their resync information to the HW. Here we fix list corruptions, by protecting the entries against movements coming from resync_handle_seq_match(), until their resync handling in napi is fully completed. Fixes: e9ce991b ("net/mlx5e: kTLS, Add resiliency to RX resync failures") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queueDavid S. Miller authored
Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2021-11-15 This series contains updates to iavf driver only. Mateusz adds a wait for reset completion when changing queue count which could otherwise cause issues with VF reset. Nick adds a null check for vf_res in iavf_fix_features(), corrects ordering of function calls to resolve dependency issues, and prevents possible freeing of a lock which isn't being held. Piotr fixes logic that did not allow setting all multicast mode without promiscuous mode. Jake prevents possible accidental freeing of filter structure. Mitch adds null checks for key and indir parameters in iavf_get_rxfh(). Surabhi adds an additional check that would, previously, cause the driver to print a false error due to values obtained while the VF is in reset. Grzegorz prevents a queue request of 0 which would cause queue count to reset to default values. Akeem restores VLAN filters when bringing the interface back up. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
It turns out the skb's in sock receive queue could have bad checksums, as both ->poll() and ->recvmsg() validate checksums. We have to do the same for ->read_sock() path too before they are redirected in sockmap. Fixes: d7f57118 ("udp: Implement ->read_sock() for sockmap") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211115044006.26068-1-xiyou.wangcong@gmail.com
-
Daniel Borkmann authored
Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is checking whether maps are read-only both from BPF program side and user space side, and then, given their content is constant, reading out their data via map->ops->map_direct_value_addr() which is then subsequently used as known scalar value for the register, that is, it is marked as __mark_reg_known() with the read value at verification time. Before a23740ec, the register content was marked as an unknown scalar so the verifier could not make any assumptions about the map content. The current implementation however is prone to a TOCTOU race, meaning, the value read as known scalar for the register is not guaranteed to be exactly the same at a later point when the program is executed, and as such, the prior made assumptions of the verifier with regards to the program will be invalid which can cause issues such as OOB access, etc. While the BPF_F_RDONLY_PROG map flag is always fixed and required to be specified at map creation time, the map->frozen property is initially set to false for the map given the map value needs to be populated, e.g. for global data sections. Once complete, the loader "freezes" the map from user space such that no subsequent updates/deletes are possible anymore. For the rest of the lifetime of the map, this freeze one-time trigger cannot be undone anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_* cmd calls which would update/delete map entries will be rejected with -EPERM since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also means that pending update/delete map entries must still complete before this guarantee is given. This corner case is not an issue for loaders since they create and prepare such program private map in successive steps. However, a malicious user is able to trigger this TOCTOU race in two different ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is used to expand the competition interval, so that map_update_elem() can modify the contents of the map after map_freeze() and bpf_prog_load() were executed. This works, because userfaultfd halts the parallel thread which triggered a map_update_elem() at the time where we copy key/value from the user buffer and this already passed the FMODE_CAN_WRITE capability test given at that time the map was not "frozen". Then, the main thread performs the map_freeze() and bpf_prog_load(), and once that had completed successfully, the other thread is woken up to complete the pending map_update_elem() which then changes the map content. For ii) the idea of the batched update is similar, meaning, when there are a large number of updates to be processed, it can increase the competition interval between the two. It is therefore possible in practice to modify the contents of the map after executing map_freeze() and bpf_prog_load(). One way to fix both i) and ii) at the same time is to expand the use of the map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with the rationale to make a writable mmap()'ing of a map mutually exclusive with read-only freezing. The counter indicates writable mmap() mappings and then prevents/fails the freeze operation. Its semantics can be expanded beyond just mmap() by generally indicating ongoing write phases. This would essentially span any parallel regular and batched flavor of update/delete operation and then also have map_freeze() fail with -EBUSY. For the check_mem_access() in the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all last pending writes have completed via bpf_map_write_active() test. Once the map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0 only then we are really guaranteed to use the map's data as known constants. For map->frozen being set and pending writes in process of still being completed we fall back to marking that register as unknown scalar so we don't end up making assumptions about it. With this, both TOCTOU reproducers from i) and ii) are fixed. Note that the map->writecnt has been converted into a atomic64 in the fix in order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning the freeze_mutex over entire map update/delete operations in syscall side would not be possible due to then causing everything to be serialized. Similarly, something like synchronize_rcu() after setting map->frozen to wait for update/deletes to complete is not possible either since it would also have to span the user copy which can sleep. On the libbpf side, this won't break d66562fb ("libbpf: Add BPF object skeleton support") as the anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed mmap()-ed memory where for .rodata it's non-writable. Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars") Reported-by: w1tcher.bupt@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexander Lobakin authored
Since recent Kbuild updates we no longer include files from compiler directories. However, samples/bpf/hbm_kern.h hasn't been tuned for this (LLVM 13): CLANG-bpf samples/bpf/hbm_out_kern.o In file included from samples/bpf/hbm_out_kern.c:55: samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found ^~~~~~~~~~ 1 error generated. CLANG-bpf samples/bpf/hbm_edt_kern.o In file included from samples/bpf/hbm_edt_kern.c:53: samples/bpf/hbm_kern.h:12:10: fatal error: 'stddef.h' file not found ^~~~~~~~~~ 1 error generated. It is enough to just drop both stdbool.h and stddef.h from includes to fix those. Fixes: 04e85bbf ("isystem: delete global -isystem compile option") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://lore.kernel.org/bpf/20211115130741.3584-1-alexandr.lobakin@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Dmitrii Banshchikov says: ==================== Various locking issues are possible with bpf_ktime_get_coarse_ns() and bpf_timer_* set of helpers. syzbot found a locking issue with bpf_ktime_get_coarse_ns() helper executed in BPF_PROG_TYPE_PERF_EVENT prog type - [1]. The issue is possible because the helper uses non fast version of time accessor that isn't safe for any context. The helper was added because it provided performance benefits in comparison to bpf_ktime_get_ns() helper. A similar locking issue is possible with bpf_timer_* set of helpers when used in tracing progs. The solution is to restrict use of the helpers in tracing progs. In the [1] discussion it was stated that bpf_spin_lock related helpers shall also be excluded for tracing progs. The verifier has a compatibility check between a map and a program. If a tracing program tries to use a map which value has struct bpf_spin_lock the verifier fails that is why bpf_spin_lock is already restricted. Patch 1 restricts helpers Patch 2 adds tests v1 -> v2: * Limit the helpers via func proto getters instead of allowed callback * Add note about helpers' restrictions to linux/bpf.h * Add Fixes tag * Remove extra \0 from btf_str_sec * Beside asm tests add prog tests * Trim CC 1. https://lore.kernel.org/all/00000000000013aebd05cff8e064@google.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dmitrii Banshchikov authored
This patch adds tests that bpf_ktime_get_coarse_ns(), bpf_timer_* and bpf_spin_lock()/bpf_spin_unlock() helpers are forbidden in tracing progs as their use there may result in various locking issues. Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211113142227.566439-3-me@ubique.spb.ru
-