1. 02 Oct, 2020 27 commits
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2020-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · ab0faf5f
      David S. Miller authored
      From: Saeed Mahameed <saeedm@nvidia.com>
      
      ====================
      This series introduces some fixes to mlx5 driver.
      
      v1->v2:
       - Patch #1 Don't return while mutex is held. (Dave)
      
      v2->v3:
       - Drop patch #1, will consider a better approach (Jakub)
       - use cpu_relax() instead of cond_resched() (Jakub)
       - while(i--) to reveres a loop (Jakub)
       - Drop old mellanox email sign-off and change the committer email
         (Jakub)
      
      Please pull and let me know if there is any problem.
      
      For -stable v4.15
       ('net/mlx5e: Fix VLAN cleanup flow')
       ('net/mlx5e: Fix VLAN create flow')
      
      For -stable v4.16
       ('net/mlx5: Fix request_irqs error flow')
      
      For -stable v5.4
       ('net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU')
       ('net/mlx5: Avoid possible free of command entry while timeout comp handler')
      
      For -stable v5.7
       ('net/mlx5e: Fix return status when setting unsupported FEC mode')
      
      For -stable v5.8
       ('net/mlx5e: Fix race condition on nhe->n pointer in neigh update')
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab0faf5f
    • Paolo Abeni's avatar
      tcp: fix syn cookied MPTCP request socket leak · 9d8c05ad
      Paolo Abeni authored
      If a syn-cookies request socket don't pass MPTCP-level
      validation done in syn_recv_sock(), we need to release
      it immediately, or it will be leaked.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/89
      Fixes: 9466a1cc ("mptcp: enable JOIN requests even if cookies are in use")
      Reported-and-tested-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d8c05ad
    • David S. Miller's avatar
      Merge branch 'Introduce-sendpage_ok-to-detect-misused-sendpage-in-network-related-drivers' · e7d4005d
      David S. Miller authored
      Coly Li says:
      
      ====================
      Introduce sendpage_ok() to detect misused sendpage in network related drivers
      
      As Sagi Grimberg suggested, the original fix is refind to a more common
      inline routine:
          static inline bool sendpage_ok(struct page *page)
          {
              return  (!PageSlab(page) && page_count(page) >= 1);
          }
      If sendpage_ok() returns true, the checking page can be handled by the
      concrete zero-copy sendpage method in network layer.
      
      The v10 series has 7 patches, fixes a WARN_ONCE() usage from v9 series,
      - The 1st patch in this series introduces sendpage_ok() in header file
        include/linux/net.h.
      - The 2nd patch adds WARN_ONCE() for improper zero-copy send in
        kernel_sendpage().
      - The 3rd patch fixes the page checking issue in nvme-over-tcp driver.
      - The 4th patch adds page_count check by using sendpage_ok() in
        do_tcp_sendpages() as Eric Dumazet suggested.
      - The 5th and 6th patches just replace existing open coded checks with
        the inline sendpage_ok() routine.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7d4005d
    • Coly Li's avatar
      libceph: use sendpage_ok() in ceph_tcp_sendpage() · 40efc4dc
      Coly Li authored
      In libceph, ceph_tcp_sendpage() does the following checks before handle
      the page by network layer's zero copy sendpage method,
      	if (page_count(page) >= 1 && !PageSlab(page))
      
      This check is exactly what sendpage_ok() does. This patch replace the
      open coded checks by sendpage_ok() as a code cleanup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40efc4dc
    • Coly Li's avatar
      scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() · 6aa25c73
      Coly Li authored
      In iscsci driver, iscsi_tcp_segment_map() uses the following code to
      check whether the page should or not be handled by sendpage:
          if (!recv && page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)))
      
      The "page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)" part is to
      make sure the page can be sent to network layer's zero copy path. This
      part is exactly what sendpage_ok() does.
      
      This patch uses  use sendpage_ok() in iscsi_tcp_segment_map() to replace
      the original open coded checks.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
      Acked-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6aa25c73
    • Coly Li's avatar
      drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() · fb25ebe1
      Coly Li authored
      In _drbd_send_page() a page is checked by following code before sending
      it by kernel_sendpage(),
              (page_count(page) < 1) || PageSlab(page)
      If the check is true, this page won't be send by kernel_sendpage() and
      handled by sock_no_sendpage().
      
      This kind of check is exactly what macro sendpage_ok() does, which is
      introduced into include/linux/net.h to solve a similar send page issue
      in nvme-tcp code.
      
      This patch uses macro sendpage_ok() to replace the open coded checks to
      page type and refcount in _drbd_send_page(), as a code cleanup.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb25ebe1
    • Coly Li's avatar
      tcp: use sendpage_ok() to detect misused .sendpage · cf83a17e
      Coly Li authored
      commit a10674bf ("tcp: detecting the misuse of .sendpage for Slab
      objects") adds the checks for Slab pages, but the pages don't have
      page_count are still missing from the check.
      
      Network layer's sendpage method is not designed to send page_count 0
      pages neither, therefore both PageSlab() and page_count() should be
      both checked for the sending page. This is exactly what sendpage_ok()
      does.
      
      This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
      .sendpage, to make the code more robust.
      
      Fixes: a10674bf ("tcp: detecting the misuse of .sendpage for Slab objects")
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf83a17e
    • Coly Li's avatar
      nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab
      Coly Li authored
      Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
      send slab pages. But for pages allocated by __get_free_pages() without
      __GFP_COMP, which also have refcount as 0, they are still sent by
      kernel_sendpage() to remote end, this is problematic.
      
      The new introduced helper sendpage_ok() checks both PageSlab tag and
      page_count counter, and returns true if the checking page is OK to be
      sent by kernel_sendpage().
      
      This patch fixes the page checking issue of nvme_tcp_try_send_data()
      with sendpage_ok(). If sendpage_ok() returns true, send this page by
      kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d4194ab
    • Coly Li's avatar
      net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send · 7b62d31d
      Coly Li authored
      If a page sent into kernel_sendpage() is a slab page or it doesn't have
      ref_count, this page is improper to send by the zero copy sendpage()
      method. Otherwise such page might be unexpected released in network code
      path and causes impredictable panic due to kernel memory management data
      structure corruption.
      
      This path adds a WARN_ON() on the sending page before sends it into the
      concrete zero-copy sendpage() method, if the page is improper for the
      zero-copy sendpage() method, a warning message can be observed before
      the consequential unpredictable kernel panic.
      
      This patch does not change existing kernel_sendpage() behavior for the
      improper page zero-copy send, it just provides hint warning message for
      following potential panic due the kernel memory heap corruption.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b62d31d
    • Coly Li's avatar
      net: introduce helper sendpage_ok() in include/linux/net.h · c381b079
      Coly Li authored
      The original problem was from nvme-over-tcp code, who mistakenly uses
      kernel_sendpage() to send pages allocated by __get_free_pages() without
      __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
      tail pages, sending them by kernel_sendpage() may trigger a kernel panic
      from a corrupted kernel heap, because these pages are incorrectly freed
      in network stack as page_count 0 pages.
      
      This patch introduces a helper sendpage_ok(), it returns true if the
      checking page,
      - is not slab page: PageSlab(page) is false.
      - has page refcount: page_count(page) is not zero
      
      All drivers who want to send page to remote end by kernel_sendpage()
      may use this helper to check whether the page is OK. If the helper does
      not return true, the driver should try other non sendpage method (e.g.
      sock_no_sendpage()) to handle the page.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c381b079
    • Petko Manolov's avatar
      net: usb: pegasus: Proper error handing when setting pegasus' MAC address · f30e25a9
      Petko Manolov authored
      v2:
      
      If reading the MAC address from eeprom fail don't throw an error, use randomly
      generated MAC instead.  Either way the adapter will soldier on and the return
      type of set_ethernet_addr() can be reverted to void.
      
      v1:
      
      Fix a bug in set_ethernet_addr() which does not take into account possible
      errors (or partial reads) returned by its helpers.  This can potentially lead to
      writing random data into device's MAC address registers.
      Signed-off-by: default avatarPetko Manolov <petko.manolov@konsulko.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f30e25a9
    • Mauro Carvalho Chehab's avatar
      net: core: document two new elements of struct net_device · a93bdcb9
      Mauro Carvalho Chehab authored
      As warned by "make htmldocs", there are two new struct elements
      that aren't documented:
      
      	../include/linux/netdevice.h:2159: warning: Function parameter or member 'unlink_list' not described in 'net_device'
      	../include/linux/netdevice.h:2159: warning: Function parameter or member 'nested_level' not described in 'net_device'
      
      Fixes: 1fc70edb ("net: core: add nested_level variable in net_device")
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a93bdcb9
    • Johannes Berg's avatar
      netlink: fix policy dump leak · a95bc734
      Johannes Berg authored
      If userspace doesn't complete the policy dump, we leak the
      allocated state. Fix this.
      
      Fixes: d07dcf9a ("netlink: add infrastructure to expose policies to userspace")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a95bc734
    • Vlad Buslov's avatar
      net/mlx5e: Fix race condition on nhe->n pointer in neigh update · 1253935a
      Vlad Buslov authored
      Current neigh update event handler implementation takes reference to
      neighbour structure, assigns it to nhe->n, tries to schedule workqueue task
      and releases the reference if task was already enqueued. This results
      potentially overwriting existing nhe->n pointer with another neighbour
      instance, which causes double release of the instance (once in neigh update
      handler that failed to enqueue to workqueue and another one in neigh update
      workqueue task that processes updated nhe->n pointer instead of original
      one):
      
      [ 3376.512806] ------------[ cut here ]------------
      [ 3376.513534] refcount_t: underflow; use-after-free.
      [ 3376.521213] Modules linked in: act_skbedit act_mirred act_tunnel_key vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd
       grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_umad ib_ipoib ib_iser rdma_cm ib_cm iw_cm rfkill ib_uverbs ib_core sunrpc kvm_intel kvm iTCO_wdt iTCO_vendor_support virtio_net irqbypass net_failover crc32_pclmul lpc_ich i2c_i801 failover pcspkr i2c_smbus mfd_core ghash_clmulni_intel sch_fq_codel drm i2c
      _core ip_tables crc32c_intel serio_raw [last unloaded: mlxfw]
      [ 3376.529468] CPU: 8 PID: 22756 Comm: kworker/u20:5 Not tainted 5.9.0-rc5+ #6
      [ 3376.530399] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [ 3376.531975] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
      [ 3376.532820] RIP: 0010:refcount_warn_saturate+0xd8/0xe0
      [ 3376.533589] Code: ff 48 c7 c7 e0 b8 27 82 c6 05 0b b6 09 01 01 e8 94 93 c1 ff 0f 0b c3 48 c7 c7 88 b8 27 82 c6 05 f7 b5 09 01 01 e8 7e 93 c1 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
      [ 3376.536017] RSP: 0018:ffffc90002a97e30 EFLAGS: 00010286
      [ 3376.536793] RAX: 0000000000000000 RBX: ffff8882de30d648 RCX: 0000000000000000
      [ 3376.537718] RDX: ffff8882f5c28f20 RSI: ffff8882f5c18e40 RDI: ffff8882f5c18e40
      [ 3376.538654] RBP: ffff8882cdf56c00 R08: 000000000000c580 R09: 0000000000001a4d
      [ 3376.539582] R10: 0000000000000731 R11: ffffc90002a97ccd R12: 0000000000000000
      [ 3376.540519] R13: ffff8882de30d600 R14: ffff8882de30d640 R15: ffff88821e000900
      [ 3376.541444] FS:  0000000000000000(0000) GS:ffff8882f5c00000(0000) knlGS:0000000000000000
      [ 3376.542732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3376.543545] CR2: 0000556e5504b248 CR3: 00000002c6f10005 CR4: 0000000000770ee0
      [ 3376.544483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3376.545419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 3376.546344] PKRU: 55555554
      [ 3376.546911] Call Trace:
      [ 3376.547479]  mlx5e_rep_neigh_update.cold+0x33/0xe2 [mlx5_core]
      [ 3376.548299]  process_one_work+0x1d8/0x390
      [ 3376.548977]  worker_thread+0x4d/0x3e0
      [ 3376.549631]  ? rescuer_thread+0x3e0/0x3e0
      [ 3376.550295]  kthread+0x118/0x130
      [ 3376.550914]  ? kthread_create_worker_on_cpu+0x70/0x70
      [ 3376.551675]  ret_from_fork+0x1f/0x30
      [ 3376.552312] ---[ end trace d84e8f46d2a77eec ]---
      
      Fix the bug by moving work_struct to dedicated dynamically-allocated
      structure. This enabled every event handler to work on its own private
      neighbour pointer and removes the need for handling the case when task is
      already enqueued.
      
      Fixes: 232c0013 ("net/mlx5e: Add support to neighbour update flow")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1253935a
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN create flow · d4a16052
      Aya Levin authored
      When interface is attached while in promiscuous mode and with VLAN
      filtering turned off, both configurations are not respected and VLAN
      filtering is performed.
      There are 2 flows which add the any-vid rules during interface attach:
      VLAN creation table and set rx mode. Each is relaying on the other to
      add any-vid rules, eventually non of them does.
      
      Fix this by adding any-vid rules on VLAN creation regardless of
      promiscuous mode.
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d4a16052
    • Aya Levin's avatar
      net/mlx5e: Fix VLAN cleanup flow · 8c7353b6
      Aya Levin authored
      Prior to this patch unloading an interface in promiscuous mode with RX
      VLAN filtering feature turned off - resulted in a warning. This is due
      to a wrong condition in the VLAN rules cleanup flow, which left the
      any-vid rules in the VLAN steering table. These rules prevented
      destroying the flow group and the flow table.
      
      The any-vid rules are removed in 2 flows, but none of them remove it in
      case both promiscuous is set and VLAN filtering is off. Fix the issue by
      changing the condition of the VLAN table cleanup flow to clean also in
      case of promiscuous mode.
      
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 20 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_group:2123:(pid 28729): Flow group 19 wasn't destroyed, refcount > 1
      mlx5_core 0000:00:08.0: mlx5_destroy_flow_table:2112:(pid 28729): Flow table 262149 wasn't destroyed, refcount > 1
      ...
      ...
      ------------[ cut here ]------------
      FW pages counter is 11560 after reclaiming all pages
      WARNING: CPU: 1 PID: 28729 at
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:660
      mlx5_reclaim_startup_pages+0x178/0x230 [mlx5_core]
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      Call Trace:
        mlx5_function_teardown+0x2f/0x90 [mlx5_core]
        mlx5_unload_one+0x71/0x110 [mlx5_core]
        remove_one+0x44/0x80 [mlx5_core]
        pci_device_remove+0x3e/0xc0
        device_release_driver_internal+0xfb/0x1c0
        device_release_driver+0x12/0x20
        pci_stop_bus_device+0x68/0x90
        pci_stop_and_remove_bus_device+0x12/0x20
        hv_eject_device_work+0x6f/0x170 [pci_hyperv]
        ? __schedule+0x349/0x790
        process_one_work+0x206/0x400
        worker_thread+0x34/0x3f0
        ? process_one_work+0x400/0x400
        kthread+0x126/0x140
        ? kthread_park+0x90/0x90
        ret_from_fork+0x22/0x30
         ---[ end trace 6283bde8d26170dc ]---
      
      Fixes: 9df30601 ("net/mlx5e: Restore vlan filter after seamless reset")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8c7353b6
    • Aya Levin's avatar
      net/mlx5e: Fix return status when setting unsupported FEC mode · 2608a2f8
      Aya Levin authored
      Verify the configured FEC mode is supported by at least a single link
      mode before applying the command. Otherwise fail the command and return
      "Operation not supported".
      Prior to this patch, the command was successful, yet it falsely set all
      link modes to FEC auto mode - like configuring FEC mode to auto. Auto
      mode is the default configuration if a link mode doesn't support the
      configured FEC mode.
      
      Fixes: b5ede32d ("net/mlx5e: Add support for FEC modes based on 50G per lane links")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2608a2f8
    • Aya Levin's avatar
      net/mlx5e: Fix driver's declaration to support GRE offload · 3d093bc2
      Aya Levin authored
      Declare GRE offload support with respect to the inner protocol. Add a
      list of supported inner protocols on which the driver can offload
      checksum and GSO. For other protocols, inform the stack to do the needed
      operations. There is no noticeable impact on GRE performance.
      
      Fixes: 27299841 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3d093bc2
    • Maor Dickman's avatar
      net/mlx5e: CT, Fix coverity issue · 2b021989
      Maor Dickman authored
      The cited commit introduced the following coverity issue at function
      mlx5_tc_ct_rule_to_tuple_nat:
      - Memory - corruptions (OVERRUN)
        Overrunning array "tuple->ip.src_v6.in6_u.u6_addr32" of 4 4-byte
        elements at element index 7 (byte offset 31) using index
        "ip6_offset" (which evaluates to 7).
      
      In case of IPv6 destination address rewrite, ip6_offset values are
      between 4 to 7, which will cause memory overrun of array
      "tuple->ip.src_v6.in6_u.u6_addr32" to array
      "tuple->ip.dst_v6.in6_u.u6_addr32".
      
      Fixed by writing the value directly to array
      "tuple->ip.dst_v6.in6_u.u6_addr32" in case ip6_offset values are
      between 4 to 7.
      
      Fixes: bc562be9 ("net/mlx5e: CT: Save ct entries tuples in hashtables")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2b021989
    • Aya Levin's avatar
      net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU · c3c94023
      Aya Levin authored
      Prior to this fix, in Striding RQ mode the driver was vulnerable when
      receiving packets in the range (stride size - headroom, stride size].
      Where stride size is calculated by mtu+headroom+tailroom aligned to the
      closest power of 2.
      Usually, this filtering is performed by the HW, except for a few cases:
      - Between 2 VFs over the same PF with different MTUs
      - On bluefield, when the host physical function sets a larger MTU than
        the ARM has configured on its representor and uplink representor.
      
      When the HW filtering is not present, packets that are larger than MTU
      might be harmful for the RQ's integrity, in the following impacts:
      1) Overflow from one WQE to the next, causing a memory corruption that
      in most cases is unharmful: as the write happens to the headroom of next
      packet, which will be overwritten by build_skb(). In very rare cases,
      high stress/load, this is harmful. When the next WQE is not yet reposted
      and points to existing SKB head.
      2) Each oversize packet overflows to the headroom of the next WQE. On
      the last WQE of the WQ, where addresses wrap-around, the address of the
      remainder headroom does not belong to the next WQE, but it is out of the
      memory region range. This results in a HW CQE error that moves the RQ
      into an error state.
      
      Solution:
      Add a page buffer at the end of each WQE to absorb the leak. Actually
      the maximal overflow size is headroom but since all memory units must be
      of the same size, we use page size to comply with UMR WQEs. The increase
      in memory consumption is of a single page per RQ. Initialize the mkey
      with all MTTs pointing to a default page. When the channels are
      activated, UMR WQEs will redirect the RX WQEs to the actual memory from
      the RQ's pool, while the overflow MTTs remain mapped to the default page.
      
      Fixes: 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c3c94023
    • Aya Levin's avatar
      net/mlx5e: Fix error path for RQ alloc · 08a762ce
      Aya Levin authored
      Increase granularity of the error path to avoid unneeded free/release.
      Fix the cleanup to be symmetric to the order of creation.
      
      Fixes: 0ddf5432 ("xdp/mlx5: setup xdp_rxq_info")
      Fixes: 422d4c40 ("net/mlx5e: RX, Split WQ objects for different RQ types")
      Signed-off-by: default avatarAya Levin <ayal@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      08a762ce
    • Maor Gottlieb's avatar
      net/mlx5: Fix request_irqs error flow · 732ebfab
      Maor Gottlieb authored
      Fix error flow handling in request_irqs which try to free irq
      that we failed to request.
      It fixes the below trace.
      
      WARNING: CPU: 1 PID: 7587 at kernel/irq/manage.c:1684 free_irq+0x4d/0x60
      CPU: 1 PID: 7587 Comm: bash Tainted: G        W  OE    4.15.15-1.el7MELLANOXsmp-x86_64 #1
      Hardware name: Advantech SKY-6200/SKY-6200, BIOS F2.00 08/06/2020
      RIP: 0010:free_irq+0x4d/0x60
      RSP: 0018:ffffc9000ef47af0 EFLAGS: 00010282
      RAX: ffff88001476ae00 RBX: 0000000000000655 RCX: 0000000000000000
      RDX: ffff88001476ae00 RSI: ffffc9000ef47ab8 RDI: ffff8800398bb478
      RBP: ffff88001476a838 R08: ffff88001476ae00 R09: 000000000000156d
      R10: 0000000000000000 R11: 0000000000000004 R12: ffff88001476a838
      R13: 0000000000000006 R14: ffff88001476a888 R15: 00000000ffffffe4
      FS:  00007efeadd32740(0000) GS:ffff88047fc40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fc9cc010008 CR3: 00000001a2380004 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       mlx5_irq_table_create+0x38d/0x400 [mlx5_core]
       ? atomic_notifier_chain_register+0x50/0x60
       mlx5_load_one+0x7ee/0x1130 [mlx5_core]
       init_one+0x4c9/0x650 [mlx5_core]
       pci_device_probe+0xb8/0x120
       driver_probe_device+0x2a1/0x470
       ? driver_allows_async_probing+0x30/0x30
       bus_for_each_drv+0x54/0x80
       __device_attach+0xa3/0x100
       pci_bus_add_device+0x4a/0x90
       pci_iov_add_virtfn+0x2dc/0x2f0
       pci_enable_sriov+0x32e/0x420
       mlx5_core_sriov_configure+0x61/0x1b0 [mlx5_core]
       ? kstrtoll+0x22/0x70
       num_vf_store+0x4b/0x70 [mlx5_core]
       kernfs_fop_write+0x102/0x180
       __vfs_write+0x26/0x140
       ? rcu_all_qs+0x5/0x80
       ? _cond_resched+0x15/0x30
       ? __sb_start_write+0x41/0x80
       vfs_write+0xad/0x1a0
       SyS_write+0x42/0x90
       do_syscall_64+0x60/0x110
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 24163189 ("net/mlx5: Separate IRQ request/free from EQ life cycle")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      732ebfab
    • Saeed Mahameed's avatar
      net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible · b898ce7b
      Saeed Mahameed authored
      In case of pci is offline reclaim_pages_cmd() will still try to call
      the FW to release FW pages, cmd_exec() in this case will return a silent
      success without actually calling the FW.
      
      This is wrong and will cause page leaks, what we should do is to detect
      pci offline or command interface un-available before tying to access the
      FW and manually release the FW pages in the driver.
      
      In this patch we share the code to check for FW command interface
      availability and we call it in sensitive places e.g. reclaim_pages_cmd().
      
      Alternative fix:
       1. Remove MLX5_CMD_OP_MANAGE_PAGES form mlx5_internal_err_ret_value,
          command success simulation list.
       2. Always Release FW pages even if cmd_exec fails in reclaim_pages_cmd().
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b898ce7b
    • Eran Ben Elisha's avatar
      net/mlx5: Add retry mechanism to the command entry index allocation · 410bd754
      Eran Ben Elisha authored
      It is possible that new command entry index allocation will temporarily
      fail. The new command holds the semaphore, so it means that a free entry
      should be ready soon. Add one second retry mechanism before returning an
      error.
      
      Patch "net/mlx5: Avoid possible free of command entry while timeout comp
      handler" increase the possibility to bump into this temporarily failure
      as it delays the entry index release for non-callback commands.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      410bd754
    • Eran Ben Elisha's avatar
      net/mlx5: poll cmd EQ in case of command timeout · 1d5558b1
      Eran Ben Elisha authored
      Once driver detects a command interface command timeout, it warns the
      user and returns timeout error to the caller. In such case, the entry of
      the command is not evacuated (because only real event interrupt is allowed
      to clear command interface entry). If the HW event interrupt
      of this entry will never arrive, this entry will be left unused forever.
      Command interface entries are limited and eventually we can end up without
      the ability to post a new command.
      
      In addition, if driver will not consume the EQE of the lost interrupt and
      rearm the EQ, no new interrupts will arrive for other commands.
      
      Add a resiliency mechanism for manually polling the command EQ in case of
      a command timeout. In case resiliency mechanism will find non-handled EQE,
      it will consume it, and the command interface will be fully functional
      again. Once the resiliency flow finished, wait another 5 seconds for the
      command interface to complete for this command entry.
      
      Define mlx5_cmd_eq_recover() to manage the cmd EQ polling resiliency flow.
      Add an async EQ spinlock to avoid races between resiliency flows and real
      interrupts that might run simultaneously.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1d5558b1
    • Eran Ben Elisha's avatar
      net/mlx5: Avoid possible free of command entry while timeout comp handler · 50b2412b
      Eran Ben Elisha authored
      Upon command completion timeout, driver simulates a forced command
      completion. In a rare case where real interrupt for that command arrives
      simultaneously, it might release the command entry while the forced
      handler might still access it.
      
      Fix that by adding an entry refcount, to track current amount of allowed
      handlers. Command entry to be released only when this refcount is
      decremented to zero.
      
      Command refcount is always initialized to one. For callback commands,
      command completion handler is the symmetric flow to decrement it. For
      non-callback commands, it is wait_func().
      
      Before ringing the doorbell, increment the refcount for the real completion
      handler. Once the real completion handler is called, it will decrement it.
      
      For callback commands, once the delayed work is scheduled, increment the
      refcount. Upon callback command completion handler, we will try to cancel
      the timeout callback. In case of success, we need to decrement the callback
      refcount as it will never run.
      
      In addition, gather the entry index free and the entry free into a one
      flow for all command types release.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      50b2412b
    • Eran Ben Elisha's avatar
      net/mlx5: Fix a race when moving command interface to polling mode · 432161ea
      Eran Ben Elisha authored
      As part of driver unload, it destroys the commands EQ (via FW command).
      As the commands EQ is destroyed, FW will not generate EQEs for any command
      that driver sends afterwards. Driver should poll for later commands status.
      
      Driver commands mode metadata is updated before the commands EQ is
      actually destroyed. This can lead for double completion handle by the
      driver (polling and interrupt), if a command is executed and completed by
      FW after the mode was changed, but before the EQ was destroyed.
      
      Fix that by using the mlx5_cmd_allowed_opcode mechanism to guarantee
      that only DESTROY_EQ command can be executed during this time period.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      432161ea
  2. 01 Oct, 2020 2 commits
  3. 30 Sep, 2020 11 commits
    • David S. Miller's avatar
      Merge branch 'Fix-bugs-in-Octeontx2-netdev-driver' · a59cf619
      David S. Miller authored
      Geetha sowjanya says:
      
      ====================
      Fix bugs in Octeontx2 netdev driver
      
      In existing Octeontx2 network drivers code has issues
      like stale entries in broadcast replication list, missing
      L3TYPE for IPv6 frames, running tx queues on error and
      race condition in mbox reset.
      This patch set fixes the above issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a59cf619
    • Hariprasad Kelam's avatar
      octeontx2-pf: Fix synchnorization issue in mbox · 66a5209b
      Hariprasad Kelam authored
      Mbox implementation in octeontx2 driver has three states
      alloc, send and reset in mbox response. VF allocate and
      sends message to PF for processing, PF ACKs them back and
      reset the mbox memory. In some case we see synchronization
      issue where after msgs_acked is incremented and before
      mbox_reset API is called, if current execution is scheduled
      out and a different thread is scheduled in which checks for
      msgs_acked. Since the new thread sees msgs_acked == msgs_sent
      it will try to allocate a new message and to send a new mbox
      message to PF.Now if mbox_reset is scheduled in, PF will see
      '0' in msgs_send.
      This patch fixes the issue by calling mbox_reset before
      incrementing msgs_acked flag for last processing message and
      checks for valid message size.
      
      Fixes: d424b6c0 ("octeontx2-pf: Enable SRIOV and added VF mbox handling")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66a5209b
    • Hariprasad Kelam's avatar
      octeontx2-pf: Fix the device state on error · 1ea0166d
      Hariprasad Kelam authored
      Currently in otx2_open on failure of nix_lf_start
      transmit queues are not stopped which are already
      started in link_event. Since the tx queues are not
      stopped network stack still try's to send the packets
      leading to driver crash while access the device resources.
      
      Fixes: 50fe6c02 ("octeontx2-pf: Register and handle link notifications")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ea0166d
    • Geetha sowjanya's avatar
      octeontx2-pf: Fix TCP/UDP checksum offload for IPv6 frames · 89eae5e8
      Geetha sowjanya authored
      For TCP/UDP checksum offload feature in Octeontx2
      expects L3TYPE to be set irrespective of IP header
      checksum is being offloaded or not. Currently for
      IPv6 frames L3TYPE is not being set resulting in
      packet drop with checksum error. This patch fixes
      this issue.
      
      Fixes: 3ca6c4c8 ("octeontx2-pf: Add packet transmission support")
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89eae5e8
    • Subbaraya Sundeep's avatar
      octeontx2-af: Fix enable/disable of default NPC entries · e154b5b7
      Subbaraya Sundeep authored
      Packet replication feature present in Octeontx2
      is a hardware linked list of PF and its VF
      interfaces so that broadcast packets are sent
      to all interfaces present in the list. It is
      driver job to add and delete a PF/VF interface
      to/from the list when the interface is brought
      up and down. This patch fixes the
      npc_enadis_default_entries function to handle
      broadcast replication properly if packet replication
      feature is present.
      
      Fixes: 40df309e ("octeontx2-af: Support to enable/disable default MCAM entries")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e154b5b7
    • David S. Miller's avatar
      Merge branch '100GbE' of https://github.com/anguy11/net-queue · 03e7e72c
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2020-09-30
      
      This series contains updates to ice driver only.
      
      Jake increases the wait time for firmware response as it can take longer
      than the current wait time. Preserves the NVM capabilities of the device in
      safe mode so the device reports its NVM update capabilities properly
      when in this state.
      
      v2: Added cover letter
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03e7e72c
    • Jacob Keller's avatar
      ice: preserve NVM capabilities in safe mode · be49b1ad
      Jacob Keller authored
      If the driver initializes in safe mode, it will call
      ice_set_safe_mode_caps. This results in clearing the capabilities
      structures, in order to set them up for operating in safe mode, ensuring
      many features are disabled.
      
      This has a side effect of also clearing the capability bits that relate
      to NVM update. The result is that the device driver will not indicate
      support for unified update, even if the firmware is capable.
      
      Fix this by adding the relevant capability fields to the list of values
      we preserve. To simplify the code, use a common_cap structure instead of
      a handful of local variables. To reduce some duplication of the
      capability name, introduce a couple of macros used to restore the
      capabilities values from the cached copy.
      
      Fixes: de9b277e ("ice: Add support for unified NVM update flow capability")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarBrijesh Behera <brijeshx.behera@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      be49b1ad
    • Jacob Keller's avatar
      ice: increase maximum wait time for flash write commands · 0ec86e8e
      Jacob Keller authored
      The ice driver needs to wait for a firmware response to each command to
      write a block of data to the scratch area used to update the device
      firmware. The driver currently waits for up to 1 second for this to be
      returned.
      
      It turns out that firmware might take longer than 1 second to return
      a completion in some cases. If this happens, the flash update will fail
      to complete.
      
      Fix this by increasing the maximum time that the driver will wait for
      both writing a block of data, and for activating the new NVM bank. The
      timeout for an erase command is already several minutes, as the firmware
      had to erase the entire bank which was already expected to take a minute
      or more in the worst case.
      
      In the case where firmware really won't respond, we will now take longer
      to fail. However, this ensures that if the firmware is simply slow to
      respond, the flash update can still complete. This new maximum timeout
      should not adversely increase the update time, as the implementation for
      wait_event_interruptible_timeout, and should wake very soon after we get
      a completion event. It is better for a flash update be slow but still
      succeed than to fail because we gave up too quickly.
      
      Fixes: d69ea414 ("ice: implement device flash update via devlink")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarBrijesh Behera <brijeshx.behera@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0ec86e8e
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 1f25c9bb
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2020-09-29
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 7 non-merge commits during the last 14 day(s) which contain
      a total of 7 files changed, 28 insertions(+), 8 deletions(-).
      
      The main changes are:
      
      1) fix xdp loading regression in libbpf for old kernels, from Andrii.
      
      2) Do not discard packet when NETDEV_TX_BUSY, from Magnus.
      
      3) Fix corner cases in libbpf related to endianness and kconfig, from Tony.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f25c9bb
    • David S. Miller's avatar
      Merge branch 'mptcp-Fix-for-32-bit-DATA_FIN' · 2b3e981a
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: Fix for 32-bit DATA_FIN
      
      The main fix is contained in patch 2, and that commit message explains
      the issue with not properly converting truncated DATA_FIN sequence
      numbers sent by the peer.
      
      With patch 2 adding an unlocked read of msk->ack_seq, patch 1 cleans up
      access to that data with READ_ONCE/WRITE_ONCE.
      
      This does introduce two merge conflicts with net-next, but both have
      straightforward resolution. Patch 1 modifies a line that got removed in
      net-next so the modification can be dropped when merging. Patch 2 will
      require a trivial conflict resolution for a modified function
      declaration.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b3e981a
    • Mat Martineau's avatar
      mptcp: Handle incoming 32-bit DATA_FIN values · 1a49b2c2
      Mat Martineau authored
      The peer may send a DATA_FIN mapping with either a 32-bit or 64-bit
      sequence number. When a 32-bit sequence number is received for the
      DATA_FIN, it must be expanded to 64 bits before comparing it to the
      last acked sequence number. This expansion was missing.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/93
      Fixes: 3721b9b6 ("mptcp: Track received DATA_FIN sequence number and add related helpers")
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a49b2c2