1. 02 Oct, 2020 14 commits
    • Vlad Buslov's avatar
      net/mlx5e: Fix race condition on nhe->n pointer in neigh update · 1253935a
      Vlad Buslov authored
      Current neigh update event handler implementation takes reference to
      neighbour structure, assigns it to nhe->n, tries to schedule workqueue task
      and releases the reference if task was already enqueued. This results
      potentially overwriting existing nhe->n pointer with another neighbour
      instance, which causes double release of the instance (once in neigh update
      handler that failed to enqueue to workqueue and another one in neigh update
      workqueue task that processes updated nhe->n pointer instead of original
      one):
      
      [ 3376.512806] ------------[ cut here ]------------
      [ 3376.513534] refcount_t: underflow; use-after-free.
      [ 3376.521213] Modules linked in: act_skbedit act_mirred act_tunnel_key vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd
       grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_umad ib_ipoib ib_iser rdma_cm ib_cm iw_cm rfkill ib_uverbs ib_core sunrpc kvm_intel kvm iTCO_wdt iTCO_vendor_support virtio_net irqbypass net_failover crc32_pclmul lpc_ich i2c_i801 failover pcspkr i2c_smbus mfd_core ghash_clmulni_intel sch_fq_codel drm i2c
      _core ip_tables crc32c_intel serio_raw [last unloaded: mlxfw]
      [ 3376.529468] CPU: 8 PID: 22756 Comm: kworker/u20:5 Not tainted 5.9.0-rc5+ #6
      [ 3376.530399] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [ 3376.531975] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
      [ 3376.532820] RIP: 0010:refcount_warn_saturate+0xd8/0xe0
      [ 3376.533589] Code: ff 48 c7 c7 e0 b8 27 82 c6 05 0b b6 09 01 01 e8 94 93 c1 ff 0f 0b c3 48 c7 c7 88 b8 27 82 c6 05 f7 b5 09 01 01 e8 7e 93 c1 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
      [ 3376.536017] RSP: 0018:ffffc90002a97e30 EFLAGS: 00010286
      [ 3376.536793] RAX: 0000000000000000 RBX: ffff8882de30d648 RCX: 0000000000000000
      [ 3376.537718] RDX: ffff8882f5c28f20 RSI: ffff8882f5c18e40 RDI: ffff8882f5c18e40
      [ 3376.538654] RBP: ffff8882cdf56c00 R08: 000000000000c580 R09: 0000000000001a4d
      [ 3376.539582] R10: 0000000000000731 R11: ffffc90002a97ccd R12: 0000000000000000
      [ 3376.540519] R13: ffff8882de30d600 R14: ffff8882de30d640 R15: ffff88821e000900
      [ 3376.541444] FS:  0000000000000000(0000) GS:ffff8882f5c00000(0000) knlGS:0000000000000000
      [ 3376.542732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3376.543545] CR2: 0000556e5504b248 CR3: 00000002c6f10005 CR4: 0000000000770ee0
      [ 3376.544483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3376.545419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 3376.546344] PKRU: 55555554
      [ 3376.546911] Call Trace:
      [ 3376.547479]  mlx5e_rep_neigh_update.cold+0x33/0xe2 [mlx5_core]
      [ 3376.548299]  process_one_work+0x1d8/0x390
      [ 3376.548977]  worker_thread+0x4d/0x3e0
      [ 3376.549631]  ? rescuer_thread+0x3e0/0x3e0
      [ 3376.550295]  kthread+0x118/0x130
      [ 3376.550914]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3376.551675]  ret_from_fork+0x1f/0x30
      [ 3376.552312] ---[ end trace d84e8f46d2a77eec ]---
      
      Fix the bug by moving work_struct to dedicated dynamically-allocated
      structure. This enabled every event handler to work on its own private
      neighbour pointer and removes the need for handling the case when task is
      already enqueued.
      
      Fixes: 232c0013 ("net/mlx5e: Add support to neighbour update flow")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1253935a
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN create flow · d4a16052
      Aya Levin authored
      When interface is attached while in promiscuous mode and with VLAN
      filtering turned off, both configurations are not respected and VLAN
      filtering is performed.
      There are 2 flows which add the any-vid rules during interface attach:
      VLAN creation table and set rx mode. Each is relaying on the other to
      add any-vid rules, eventually non of them does.
      
      Fix this by adding any-vid rules on VLAN creation regardless of
      promiscuous mode.
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d4a16052
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN cleanup flow · 8c7353b6
      Aya Levin authored
      Prior to this patch unloading an interface in promiscuous mode with RX
      VLAN filtering feature turned off - resulted in a warning. This is due
      to a wrong condition in the VLAN rules cleanup flow, which left the
      any-vid rules in the VLAN steering table. These rules prevented
      destroying the flow group and the flow table.
      
      The any-vid rules are removed in 2 flows, but none of them remove it in
      case both promiscuous is set and VLAN filtering is off. Fix the issue by
      changing the condition of the VLAN table cleanup flow to clean also in
      case of promiscuous mode.
      
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 20 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 19 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_table:2112:(pid 28729): Flow table 262149 wasn't destroyed, refcount > 1
      ...
      ...
      ------------[ cut here ]------------
      FW pages counter is 11560 after reclaiming all pages
      WARNING: CPU: 1 PID: 28729 at
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:660
      mlx5_reclaim_startup_pages+0x178/0x230 [mlx5_core]
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      Call Trace:
        mlx5_function_teardown+0x2f/0x90 [mlx5_core]
        mlx5_unload_one+0x71/0x110 [mlx5_core]
        remove_one+0x44/0x80 [mlx5_core]
        pci_device_remove+0x3e/0xc0
        device_release_driver_internal+0xfb/0x1c0
        device_release_driver+0x12/0x20
        pci_stop_bus_device+0x68/0x90
        pci_stop_and_remove_bus_device+0x12/0x20
        hv_eject_device_work+0x6f/0x170 [pci_hyperv]
        ? __schedule+0x349/0x790
        process_one_work+0x206/0x400
        worker_thread+0x34/0x3f0
        ? process_one_work+0x400/0x400
        kthread+0x126/0x140
        ? kthread_park+0x90/0x90
        ret_from_fork+0x22/0x30
         ---[ end trace 6283bde8d26170dc ]---
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8c7353b6
    • Aya Levin's avatar
      net/mlx5e: Fix return status when setting unsupported FEC mode · 2608a2f8
      Aya Levin authored
      Verify the configured FEC mode is supported by at least a single link
      mode before applying the command. Otherwise fail the command and return
      "Operation not supported".
      Prior to this patch, the command was successful, yet it falsely set all
      link modes to FEC auto mode - like configuring FEC mode to auto. Auto
      mode is the default configuration if a link mode doesn't support the
      configured FEC mode.
      
      Fixes: b5ede32d ("net/mlx5e: Add support for FEC modes based on 50G per lane links")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2608a2f8
    • Aya Levin's avatar
      net/mlx5e: Fix driver's declaration to support GRE offload · 3d093bc2
      Aya Levin authored
      Declare GRE offload support with respect to the inner protocol. Add a
      list of supported inner protocols on which the driver can offload
      checksum and GSO. For other protocols, inform the stack to do the needed
      operations. There is no noticeable impact on GRE performance.
      
      Fixes: 27299841 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3d093bc2
    • Maor Dickman's avatar
      net/mlx5e: CT, Fix coverity issue · 2b021989
      Maor Dickman authored
      The cited commit introduced the following coverity issue at function
      mlx5_tc_ct_rule_to_tuple_nat:
      - Memory - corruptions (OVERRUN)
        Overrunning array "tuple->ip.src_v6.in6_u.u6_addr32" of 4 4-byte
        elements at element index 7 (byte offset 31) using index
        "ip6_offset" (which evaluates to 7).
      
      In case of IPv6 destination address rewrite, ip6_offset values are
      between 4 to 7, which will cause memory overrun of array
      "tuple->ip.src_v6.in6_u.u6_addr32" to array
      "tuple->ip.dst_v6.in6_u.u6_addr32".
      
      Fixed by writing the value directly to array
      "tuple->ip.dst_v6.in6_u.u6_addr32" in case ip6_offset values are
      between 4 to 7.
      
      Fixes: bc562be9 ("net/mlx5e: CT: Save ct entries tuples in hashtables")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2b021989
    • Aya Levin's avatar
      net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU · c3c94023
      Aya Levin authored
      Prior to this fix, in Striding RQ mode the driver was vulnerable when
      receiving packets in the range (stride size - headroom, stride size].
      Where stride size is calculated by mtu+headroom+tailroom aligned to the
      closest power of 2.
      Usually, this filtering is performed by the HW, except for a few cases:
      - Between 2 VFs over the same PF with different MTUs
      - On bluefield, when the host physical function sets a larger MTU than
        the ARM has configured on its representor and uplink representor.
      
      When the HW filtering is not present, packets that are larger than MTU
      might be harmful for the RQ's integrity, in the following impacts:
      1) Overflow from one WQE to the next, causing a memory corruption that
      in most cases is unharmful: as the write happens to the headroom of next
      packet, which will be overwritten by build_skb(). In very rare cases,
      high stress/load, this is harmful. When the next WQE is not yet reposted
      and points to existing SKB head.
      2) Each oversize packet overflows to the headroom of the next WQE. On
      the last WQE of the WQ, where addresses wrap-around, the address of the
      remainder headroom does not belong to the next WQE, but it is out of the
      memory region range. This results in a HW CQE error that moves the RQ
      into an error state.
      
      Solution:
      Add a page buffer at the end of each WQE to absorb the leak. Actually
      the maximal overflow size is headroom but since all memory units must be
      of the same size, we use page size to comply with UMR WQEs. The increase
      in memory consumption is of a single page per RQ. Initialize the mkey
      with all MTTs pointing to a default page. When the channels are
      activated, UMR WQEs will redirect the RX WQEs to the actual memory from
      the RQ's pool, while the overflow MTTs remain mapped to the default page.
      
      Fixes: 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c3c94023
    • Aya Levin's avatar
      net/mlx5e: Fix error path for RQ alloc · 08a762ce
      Aya Levin authored
      Increase granularity of the error path to avoid unneeded free/release.
      Fix the cleanup to be symmetric to the order of creation.
      
      Fixes: 0ddf5432 ("xdp/mlx5: setup xdp_rxq_info")
      Fixes: 422d4c40 ("net/mlx5e: RX, Split WQ objects for different RQ types")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      08a762ce
    • Maor Gottlieb's avatar
      net/mlx5: Fix request_irqs error flow · 732ebfab
      Maor Gottlieb authored
      Fix error flow handling in request_irqs which try to free irq
      that we failed to request.
      It fixes the below trace.
      
      WARNING: CPU: 1 PID: 7587 at kernel/irq/manage.c:1684 free_irq+0x4d/0x60
      CPU: 1 PID: 7587 Comm: bash Tainted: G        W  OE    4.15.15-1.el7MELLANOXsmp-x86_64 #1
      Hardware name: Advantech SKY-6200/SKY-6200, BIOS F2.00 08/06/2020
      RIP: 0010:free_irq+0x4d/0x60
      RSP: 0018:ffffc9000ef47af0 EFLAGS: 00010282
      RAX: ffff88001476ae00 RBX: 0000000000000655 RCX: 0000000000000000
      RDX: ffff88001476ae00 RSI: ffffc9000ef47ab8 RDI: ffff8800398bb478
      RBP: ffff88001476a838 R08: ffff88001476ae00 R09: 000000000000156d
      R10: 0000000000000000 R11: 0000000000000004 R12: ffff88001476a838
      R13: 0000000000000006 R14: ffff88001476a888 R15: 00000000ffffffe4
      FS:  00007efeadd32740(0000) GS:ffff88047fc40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc9cc010008 CR3: 00000001a2380004 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       mlx5_irq_table_create+0x38d/0x400 [mlx5_core]
       ? atomic_notifier_chain_register+0x50/0x60
       mlx5_load_one+0x7ee/0x1130 [mlx5_core]
       init_one+0x4c9/0x650 [mlx5_core]
       pci_device_probe+0xb8/0x120
       driver_probe_device+0x2a1/0x470
       ? driver_allows_async_probing+0x30/0x30
       bus_for_each_drv+0x54/0x80
       __device_attach+0xa3/0x100
       pci_bus_add_device+0x4a/0x90
       pci_iov_add_virtfn+0x2dc/0x2f0
       pci_enable_sriov+0x32e/0x420
       mlx5_core_sriov_configure+0x61/0x1b0 [mlx5_core]
       ? kstrtoll+0x22/0x70
       num_vf_store+0x4b/0x70 [mlx5_core]
       kernfs_fop_write+0x102/0x180
       __vfs_write+0x26/0x140
       ? rcu_all_qs+0x5/0x80
       ? _cond_resched+0x15/0x30
       ? __sb_start_write+0x41/0x80
       vfs_write+0xad/0x1a0
       SyS_write+0x42/0x90
       do_syscall_64+0x60/0x110
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 24163189 ("net/mlx5: Separate IRQ request/free from EQ life cycle")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      732ebfab
    • Saeed Mahameed's avatar
      net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible · b898ce7b
      Saeed Mahameed authored
      In case of pci is offline reclaim_pages_cmd() will still try to call
      the FW to release FW pages, cmd_exec() in this case will return a silent
      success without actually calling the FW.
      
      This is wrong and will cause page leaks, what we should do is to detect
      pci offline or command interface un-available before tying to access the
      FW and manually release the FW pages in the driver.
      
      In this patch we share the code to check for FW command interface
      availability and we call it in sensitive places e.g. reclaim_pages_cmd().
      
      Alternative fix:
       1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value,
          command success simulation list.
       2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd().
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b898ce7b
    • Eran Ben Elisha's avatar
      net/mlx5: Add retry mechanism to the command entry index allocation · 410bd754
      Eran Ben Elisha authored
      It is possible that new command entry index allocation will temporarily
      fail. The new command holds the semaphore, so it means that a free entry
      should be ready soon. Add one second retry mechanism before returning an
      error.
      
      Patch "net/mlx5: Avoid possible free of command entry while timeout comp
      handler" increase the possibility to bump into this temporarily failure
      as it delays the entry index release for non-callback commands.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      410bd754
    • Eran Ben Elisha's avatar
      net/mlx5: poll cmd EQ in case of command timeout · 1d5558b1
      Eran Ben Elisha authored
      Once driver detects a command interface command timeout, it warns the
      user and returns timeout error to the caller. In such case, the entry of
      the command is not evacuated (because only real event interrupt is allowed
      to clear command interface entry). If the HW event interrupt
      of this entry will never arrive, this entry will be left unused forever.
      Command interface entries are limited and eventually we can end up without
      the ability to post a new command.
      
      In addition, if driver will not consume the EQE of the lost interrupt and
      rearm the EQ, no new interrupts will arrive for other commands.
      
      Add a resiliency mechanism for manually polling the command EQ in case of
      a command timeout. In case resiliency mechanism will find non-handled EQE,
      it will consume it, and the command interface will be fully functional
      again. Once the resiliency flow finished, wait another 5 seconds for the
      command interface to complete for this command entry.
      
      Define mlx5_cmd_eq_recover() to manage the cmd EQ polling resiliency flow.
      Add an async EQ spinlock to avoid races between resiliency flows and real
      interrupts that might run simultaneously.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1d5558b1
    • Eran Ben Elisha's avatar
      net/mlx5: Avoid possible free of command entry while timeout comp handler · 50b2412b
      Eran Ben Elisha authored
      Upon command completion timeout, driver simulates a forced command
      completion. In a rare case where real interrupt for that command arrives
      simultaneously, it might release the command entry while the forced
      handler might still access it.
      
      Fix that by adding an entry refcount, to track current amount of allowed
      handlers. Command entry to be released only when this refcount is
      decremented to zero.
      
      Command refcount is always initialized to one. For callback commands,
      command completion handler is the symmetric flow to decrement it. For
      non-callback commands, it is wait_func().
      
      Before ringing the doorbell, increment the refcount for the real completion
      handler. Once the real completion handler is called, it will decrement it.
      
      For callback commands, once the delayed work is scheduled, increment the
      refcount. Upon callback command completion handler, we will try to cancel
      the timeout callback. In case of success, we need to decrement the callback
      refcount as it will never run.
      
      In addition, gather the entry index free and the entry free into a one
      flow for all command types release.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      50b2412b
    • Eran Ben Elisha's avatar
      net/mlx5: Fix a race when moving command interface to polling mode · 432161ea
      Eran Ben Elisha authored
      As part of driver unload, it destroys the commands EQ (via FW command).
      As the commands EQ is destroyed, FW will not generate EQEs for any command
      that driver sends afterwards. Driver should poll for later commands status.
      
      Driver commands mode metadata is updated before the commands EQ is
      actually destroyed. This can lead for double completion handle by the
      driver (polling and interrupt), if a command is executed and completed by
      FW after the mode was changed, but before the EQ was destroyed.
      
      Fix that by using the mlx5_cmd_allowed_opcode mechanism to guarantee
      that only DESTROY_EQ command can be executed during this time period.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      432161ea
  2. 30 Sep, 2020 12 commits
  3. 29 Sep, 2020 14 commits