Commits · 987914ab841e2ec281a35b54348ab109b4c0bb4e · Kirill Smelkov / linux

26 Mar, 2020 1 commit

RDMA/cm: Update num_paths in cma_resolve_iboe_route error flow · 987914ab

Avihai Horon authored Mar 18, 2020

After a successful allocation of path_rec, num_paths is set to 1, but any
error after such allocation will leave num_paths uncleared.

This causes to de-referencing a NULL pointer later on. Hence, num_paths
needs to be set back to 0 if such an error occurs.

The following crash from syzkaller revealed it.

  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
  CPU: 0 PID: 357 Comm: syz-executor060 Not tainted 4.18.0+ #311
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
  RIP: 0010:ib_copy_path_rec_to_user+0x94/0x3e0
  Code: f1 f1 f1 f1 c7 40 0c 00 00 f4 f4 65 48 8b 04 25 28 00 00 00 48 89
  45 c8 31 c0 e8 d7 60 24 ff 48 8d 7b 4c 48 89 f8 48 c1 e8 03 <42> 0f b6
  14 30 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
  RSP: 0018:ffff88006586f980 EFLAGS: 00010207
  RAX: 0000000000000009 RBX: 0000000000000000 RCX: 1ffff1000d5fe475
  RDX: ffff8800621e17c0 RSI: ffffffff820d45f9 RDI: 000000000000004c
  RBP: ffff88006586fa50 R08: ffffed000cb0df73 R09: ffffed000cb0df72
  R10: ffff88006586fa70 R11: ffffed000cb0df73 R12: 1ffff1000cb0df30
  R13: ffff88006586fae8 R14: dffffc0000000000 R15: ffff88006aff2200
  FS: 00000000016fc880(0000) GS:ffff88006d000000(0000)
  knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000020000040 CR3: 0000000063fec000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
  ? ib_copy_path_rec_from_user+0xcc0/0xcc0
  ? __mutex_unlock_slowpath+0xfc/0x670
  ? wait_for_completion+0x3b0/0x3b0
  ? ucma_query_route+0x818/0xc60
  ucma_query_route+0x818/0xc60
  ? ucma_listen+0x1b0/0x1b0
  ? sched_clock_cpu+0x18/0x1d0
  ? sched_clock_cpu+0x18/0x1d0
  ? ucma_listen+0x1b0/0x1b0
  ? ucma_write+0x292/0x460
  ucma_write+0x292/0x460
  ? ucma_close_id+0x60/0x60
  ? sched_clock_cpu+0x18/0x1d0
  ? sched_clock_cpu+0x18/0x1d0
  __vfs_write+0xf7/0x620
  ? ucma_close_id+0x60/0x60
  ? kernel_read+0x110/0x110
  ? time_hardirqs_on+0x19/0x580
  ? lock_acquire+0x18b/0x3a0
  ? finish_task_switch+0xf3/0x5d0
  ? _raw_spin_unlock_irq+0x29/0x40
  ? _raw_spin_unlock_irq+0x29/0x40
  ? finish_task_switch+0x1be/0x5d0
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70
  ? security_file_permission+0x172/0x1e0
  vfs_write+0x192/0x460
  ksys_write+0xc6/0x1a0
  ? __ia32_sys_read+0xb0/0xb0
  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
  ? do_syscall_64+0x1d/0x470
  do_syscall_64+0x9e/0x470
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 3c86aa70 ("RDMA/cm: Add RDMA CM support for IBoE devices")
Link: https://lore.kernel.org/r/20200318101741.47211-1-leon@kernel.orgSigned-off-by: Avihai Horon <avihaih@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

987914ab

24 Mar, 2020 6 commits

IB/mlx5: Generally use the WC auto detection test result · 1f3db161

Yishai Hadas authored Mar 18, 2020

Now that we have direct and reliable detection of WC support by the
system, use is broadly. The only case we have to worry about is when the
WC autodetector cannot run.

For this fringe case generally assume that that WC is available, except in
the well defined case of no PAT support on x86 which is tested by calling
arch_can_pci_mmap_wc().

If WC is wrongly assumed to be available then it causes a small
performance hit on paths in userspace that are tuned to the assumption
that WC is available. There is no functional loss.

It is very unlikely that any platforms exist that lack WC and also care
about the micro optimization of WC in the fringe case where autodetection
does not work.

By removing the fairly bogus CONFIG tests this makes WC work broadly on
all arches and all platforms.

Link: https://lore.kernel.org/r/20200318100323.46659-1-leon@kernel.orgSigned-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

1f3db161

RDMA/hns: Optimize mhop put flow for multi-hop addressing · 38dcb350

Xi Wang authored Mar 17, 2020

Optimizes hns_roce_table_mhop_get() by encapsulating code about clearing
hem into clear_mhop_hem(), which will make the code flow clearer.

Link: https://lore.kernel.org/r/1584417324-2255-3-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

38dcb350

RDMA/hns: Optimize mhop get flow for multi-hop addressing · 2f49de21

Xi Wang authored Mar 17, 2020

Splits hns_roce_table_mhop_get() into 4 sub-functions to make the code flow
clearer.

Link: https://lore.kernel.org/r/1584417324-2255-2-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

2f49de21

RDMA/bnxt_re: Wait for all the CQ events before freeing CQ data structures · b1d56fdc

Selvin Xavier authored Mar 13, 2020

Destroy CQ command to firmware returns the num_cnq_events as a
response. This indicates the driver about the number of CQ events
generated for this CQ. Driver should wait for all these events before
freeing the CQ host structures. Also, add routine to clean all the
pending notification for the CQs getting destroyed. This avoids the
possibility of accessing the CQ data structures after its freed.

Fixes: 1ac5a404 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1584120842-3200-1-git-send-email-selvin.xavier@broadcom.comSigned-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

b1d56fdc

IB/mlx5: Fix a NULL vs IS_ERR() check · a766fa84

Dan Carpenter authored Mar 20, 2020

The kzalloc() function returns NULL, not error pointers.

Fixes: 30f2fe40 ("IB/mlx5: Introduce UAPIs to manage packet pacing")
Link: https://lore.kernel.org/r/20200320132641.GF95012@mwandaSigned-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

a766fa84

RDMA/siw: Suppress uninitialized var warning · 5fb51863

Andrew Morton authored Mar 23, 2020

drivers/infiniband/sw/siw/siw_qp_rx.c: In function siw_proc_send:
./include/linux/spinlock.h:288:3: warning: flags may be used uninitialized in this function [-Wmaybe-uninitialized]
   _raw_spin_unlock_irqrestore(lock, flags); \
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/sw/siw/siw_qp_rx.c:335:16: note: flags was declared here
  unsigned long flags;

Link: https://lore.kernel.org/r/20200323184627.ZWPg91uin%akpm@linux-foundation.orgSigned-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

5fb51863

19 Mar, 2020 1 commit

RDMA/efa: Use in-kernel offsetofend() to check field availability · fa8a44f6

Leon Romanovsky authored Mar 10, 2020

Remove custom and duplicated variant of offsetofend().

Link: https://lore.kernel.org/r/20200310091438.248429-4-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

fa8a44f6

18 Mar, 2020 9 commits

IB/hfi1: Remove kobj from hfi1_devdata · 5ab17a24

Kaike Wan authored Mar 16, 2020

The field kobj was added to hfi1_devdata structure to manage the life time
of the hfi1_devdata structure for PSM accesses:

commit e11ffbd5 ("IB/hfi1: Do not free hfi1 cdev parent structure early")

Later another mechanism user_refcount/user_comp was introduced to provide
the same functionality:

commit acd7c8fe ("IB/hfi1: Fix an Oops on pci device force remove")

This patch will remove this kobj field, as it is no longer needed.

Link: https://lore.kernel.org/r/20200316210500.7753.4145.stgit@awfm-01.aw.intel.comReviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

5ab17a24

IB/rdmavt: Delete unused routine · d61ba1b9

Mike Marciniszyn authored Mar 16, 2020

This routine was obsoleted by the patch below.

Delete it.

Fixes: a2a074ef ("RDMA: Handle ucontext allocations by IB/core")
Link: https://lore.kernel.org/r/20200316210454.7753.94689.stgit@awfm-01.aw.intel.comReviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

d61ba1b9

RDMA/hns: Check if depth of qp is 0 before configure · 026ded37

Lang Cheng authored Mar 12, 2020

Depth of qp shouldn't be allowed to be set to zero, after ensuring that,
subsequent process can be simplified. And when qp is changed from reset to
reset, the capability of minimum qp depth was used to identify hardware of
hip06, it should be changed into a more readable form.

Link: https://lore.kernel.org/r/1584006624-11846-1-git-send-email-liweihang@huawei.comSigned-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

026ded37

i40iw: Report correct firmware version · 4b34e23f

Sindhu, Devale authored Mar 13, 2020

The driver uses a hard-coded value for FW version and reports an
inconsistent FW version between ibv_devinfo and
/sys/class/infiniband/i40iw/fw_ver.

Retrieve the FW version via a Control QP (CQP) operation and report it
consistently across sysfs and query device.

Fixes: d3749841 ("i40iw: add files for iwarp interface")
Link: https://lore.kernel.org/r/20200313214406.2159-1-shiraz.saleem@intel.comReported-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Sindhu, Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

4b34e23f

RDMA/hns: Optimize wqe buffer set flow for post send · d6a3627e

Xi Wang authored Mar 10, 2020

Splits hns_roce_v2_post_send() into three sub-functions: set_rc_wqe(),
set_ud_wqe() and update_sq_db() to simplify the code.

Link: https://lore.kernel.org/r/1583839084-31579-6-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

d6a3627e

RDMA/hns: Optimize base address table config flow for qp buffer · 11334014

Xi Wang authored Mar 10, 2020

Currently, before the qp is created, a page size needs to be calculated
for the base address table to store all base addresses in the mtr. As a
result, the parameter configuration of the mtr is complex. So integrate
the process of calculating the base table page size into the hem related
interface to simplify the process of using mtr.

Link: https://lore.kernel.org/r/1583839084-31579-5-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

11334014

RDMA/hns: Optimize the wr opcode conversion from ib to hns · e363f7de

Xi Wang authored Mar 10, 2020

Simplify the wr opcode conversion from ib to hns by using a map table
instead of the switch-case statement.

Link: https://lore.kernel.org/r/1583839084-31579-4-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

e363f7de

RDMA/hns: Optimize wqe buffer filling process for post send · 00a59d30

Xi Wang authored Mar 10, 2020

Encapsulates the wqe buffer process details for datagram seg, fast mr seg
and atomic seg.

Link: https://lore.kernel.org/r/1583839084-31579-3-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

00a59d30

RDMA/hns: Rename wqe buffer related functions · 6c6e3921

Xi Wang authored Mar 10, 2020

There are serval global functions related to wqe buffer in the hns driver
and are called in different files. These symbols cannot directly represent
the namespace they belong to. So add prefix 'hns_roce_' to 3 wqe buffer
related global functions: get_recv_wqe(), get_send_wqe(), and
get_send_extend_sge().

Link: https://lore.kernel.org/r/1583839084-31579-2-git-send-email-liweihang@huawei.comSigned-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

6c6e3921

17 Mar, 2020 18 commits

RDMA/bnxt_re: Remove unnecessary sched count · 4e88cef1

Selvin Xavier authored Mar 13, 2020

Since the lifetime of bnxt_re_task is controlled by the kref of device,
sched_count is no longer required. Remove it.

Link: https://lore.kernel.org/r/1584117207-2664-4-git-send-email-selvin.xavier@broadcom.comSigned-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

4e88cef1

RDMA/bnxt_re: Fix lifetimes in bnxt_re_task · 8a6c6170

Jason Gunthorpe authored Mar 13, 2020

A work queue cannot just rely on the ib_device not being freed, it must
hold a kref on the memory so that the BNXT_RE_FLAG_IBDEV_REGISTERED check
works.

Fixes: 1ac5a404 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1584117207-2664-3-git-send-email-selvin.xavier@broadcom.comSigned-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

8a6c6170

RDMA/bnxt_re: Use ib_device_try_get() · 3cae5804

Jason Gunthorpe authored Mar 13, 2020

There are a couple places in this driver running from a work queue that
need the ib_device to be registered. Instead of using a broken internal
bit rely on the new core code to guarantee device registration.

Link: https://lore.kernel.org/r/1584117207-2664-2-git-send-email-selvin.xavier@broadcom.comSigned-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

3cae5804

RDMA/cm: Make sure the cm_id is in the IB_CM_IDLE state in destroy · 67b3c8dc

Jason Gunthorpe authored Mar 10, 2020

The first switch statement in cm_destroy_id() tries to move the ID to
either IB_CM_IDLE or IB_CM_TIMEWAIT. Both states will block concurrent
MAD handlers from progressing.

Previous patches removed the unreliably lock/unlock sequences in this
flow, this patch removes the extra locking steps and adds the missing
parts to guarantee that destroy reaches IB_CM_IDLE. There is no point in
leaving the ID in the IB_CM_TIMEWAIT state the memory about to be kfreed.

Rework things to hold the lock across all the state transitions and
directly assert when done that it ended up in IB_CM_IDLE as expected.

This was accompanied by a careful audit of all the state transitions here,
which generally did end up in IDLE on their success and non-racy paths.

Link: https://lore.kernel.org/r/20200310092545.251365-16-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

67b3c8dc

RDMA/cm: Allow ib_send_cm_sidr_rep() to be done under lock · 6a8824a7

Jason Gunthorpe authored Mar 10, 2020

The first thing ib_send_cm_sidr_rep() does is obtain the lock, so use the
usual unlocked wrapper, locked actor pattern here.

Get rid of the cm_reject_sidr_req() wrapper so each call site can call the
locked or unlocked version as required.

This avoids a sketchy lock/unlock sequence (which could allow state to
change) during cm_destroy_id().

Link: https://lore.kernel.org/r/20200310092545.251365-15-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

6a8824a7

RDMA/cm: Allow ib_send_cm_rej() to be done under lock · 81ddb41f

Jason Gunthorpe authored Mar 10, 2020

The first thing ib_send_cm_rej() does is obtain the lock, so use the usual
unlocked wrapper, locked actor pattern here.

This avoids a sketchy lock/unlock sequence (which could allow state to
change) during cm_destroy_id().

While here simplify some of the logic in the implementation.

Link: https://lore.kernel.org/r/20200310092545.251365-14-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

81ddb41f

RDMA/cm: Allow ib_send_cm_drep() to be done under lock · 87cabf3e

Jason Gunthorpe authored Mar 10, 2020

The first thing ib_send_cm_drep() does is obtain the lock, so use the
usual unlocked wrapper, locked actor pattern here.

This avoids a sketchy lock/unlock sequence (which could allow state to
change) during cm_destroy_id().

Link: https://lore.kernel.org/r/20200310092545.251365-13-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

87cabf3e

RDMA/cm: Allow ib_send_cm_dreq() to be done under lock · e029fdc0

Jason Gunthorpe authored Mar 10, 2020

The first thing ib_send_cm_dreq() does is obtain the lock, so use the
usual unlocked wrapper, locked actor pattern here.

This avoids a sketchy lock/unlock sequence (which could allow state to
change) during cm_destroy_id().

Link: https://lore.kernel.org/r/20200310092545.251365-12-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

e029fdc0

RDMA/cm: Add some lockdep assertions for cm_id_priv->lock · 00777a68

Jason Gunthorpe authored Mar 10, 2020

These functions all touch state, so must be called under the lock.
Inspection shows this is currently true.

Link: https://lore.kernel.org/r/20200310092545.251365-11-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

00777a68

RDMA/cm: Add missing locking around id.state in cm_dup_req_handler · d1de9a88

Jason Gunthorpe authored Mar 10, 2020

All accesses to id.state must be done under the spinlock.

Fixes: a977049d ("[PATCH] IB: Add the kernel CM implementation")
Link: https://lore.kernel.org/r/20200310092545.251365-10-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

d1de9a88

RDMA/cm: Make it clearer how concurrency works in cm_req_handler() · c206f8ba

Jason Gunthorpe authored Mar 10, 2020

ib_crate_cm_id() immediately places the id in the xarray, and publishes it
into the remote_id and remote_qpn rbtrees. This makes it visible to other
threads before it is fully set up.

It appears the thinking here was that the states IB_CM_IDLE and
IB_CM_REQ_RCVD do not allow any MAD handler or lookup in the remote_id and
remote_qpn rbtrees to advance.

However, cm_rej_handler() does take an action on IB_CM_REQ_RCVD, which is
not really expected by the design.

Make the whole thing clearer:
 - Keep the new cm_id out of the xarray until it is completely set up.
   This directly prevents MAD handlers and all rbtree lookups from seeing
   the pointer.
 - Move all the trivial setup right to the top so it is obviously done
   before any concurrency begins
 - Move the mutation of the cm_id_priv out of cm_match_id() and into the
   caller so the state transition is obvious
 - Place the manipulation of the work_list at the end, under lock, after
   the cm_id is placed in the xarray. The work_count cannot change on an
   ID outside the xarray.
 - Add some comments

Link: https://lore.kernel.org/r/20200310092545.251365-9-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

c206f8ba

RDMA/cm: Make it clear that there is no concurrency in cm_sidr_req_handler() · 083bfdbf

Jason Gunthorpe authored Mar 10, 2020

ib_create_cm_id() immediately places the id in the xarray, so it is visible
to network traffic.

The state is initially set to IB_CM_IDLE and all the MAD handlers will
test this state under lock and refuse to advance from IDLE, so adding to
the xarray is harmless.

Further, the set to IB_CM_SIDR_REQ_RCVD also excludes all MAD handlers.

However, the local_id isn't even used for SIDR mode, and there will be no
input MADs related to the newly created ID.

So, make the whole flow simpler so it can be understood:
 - Do not put the SIDR cm_id in the xarray. This directly shows that there
   is no concurrency
 - Delete the confusing work_count and pending_list manipulations. This
   mechanism is only used by MAD handlers and timewait, neither of which
   apply to SIDR.
 - Add a few comments and rename 'cur_cm_id_priv' to 'listen_cm_id_priv'
 - Move other loose sets up to immediately after cm_id creation so that
   the cm_id is fully configured right away. This fixes an oversight where
   the service_id will not be returned back on a IB_SIDR_UNSUPPORTED
   reject.

Link: https://lore.kernel.org/r/20200310092545.251365-8-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

083bfdbf

RDMA/cm: Read id.state under lock when doing pr_debug() · 153a2e43

Jason Gunthorpe authored Mar 10, 2020

The lock should not be dropped before doing the pr_debug() print as it is
accessing data protected by the lock, such as id.state.

Fixes: 119bf817 ("IB/cm: Add debug prints to ib_cm")
Link: https://lore.kernel.org/r/20200310092545.251365-7-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

153a2e43

RDMA/cm: Simplify establishing a listen cm_id · 98f67156

Jason Gunthorpe authored Mar 10, 2020

Any manipulation of cm_id->state must be done under the cm_id_priv->lock,
the two routines that added listens did not follow this rule, because they
never participate in any concurrent access around the state.

However, since this exception makes the code hard to understand, simplify
the flow so that it can be fully locked:
 - Move manipulation of listen_sharecount into cm_insert_listen() so it is
   trivially under the cm.lock without having to expose the cm.lock to the
   caller.
 - Push the cm.lock down into cm_insert_listen() and have the function
   increment the reference count before returning an existing pointer.
 - Split ib_cm_listen() into an cm_init_listen() and do not call
   ib_cm_listen() from ib_cm_insert_listen()
 - Make both ib_cm_listen() and ib_cm_insert_listen() directly call
   cm_insert_listen() under their cm_id_priv->lock which does both a
   collision detect and, if needed, the insert (atomically)
 - Enclose all state manipulation within the cm_id_priv->lock, notice this
   set can be done safely after cm_insert_listen() as no reader is allowed
   to read the state without holding the lock.
 - Do not set the listen cm_id in the xarray, as it is never correct to
   look it up. This makes the concurrency simpler to understand.

Many needless error unwinds are removed in the process.

Link: https://lore.kernel.org/r/20200310092545.251365-6-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

98f67156

RDMA/cm: Make the destroy_id flow more robust · 2305d686

Jason Gunthorpe authored Mar 10, 2020

Too much of the destruction is very carefully sensitive to the state
and various other things. Move more code to the unconditional path and
add several WARN_ONs to check consistency.

Link: https://lore.kernel.org/r/20200310092545.251365-5-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

2305d686

RDMA/cm: Remove a race freeing timewait_info · bede86a3

Jason Gunthorpe authored Mar 10, 2020

When creating a cm_id during REQ the id immediately becomes visible to the
other MAD handlers, and shortly after the state is moved to IB_CM_REQ_RCVD

This allows cm_rej_handler() to run concurrently and free the work:

        CPU 0                                CPU1
 cm_req_handler()
  ib_create_cm_id()
  cm_match_req()
    id_priv->state = IB_CM_REQ_RCVD
                                       cm_rej_handler()
                                         cm_acquire_id()
                                         spin_lock(&id_priv->lock)
                                         switch (id_priv->state)
  					   case IB_CM_REQ_RCVD:
                                            cm_reset_to_idle()
                                             kfree(id_priv->timewait_info);
   goto destroy
  destroy:
    kfree(id_priv->timewait_info);
                                             id_priv->timewait_info = NULL

Causing a double free or worse.

Do not free the timewait_info without also holding the
id_priv->lock. Simplify this entire flow by making the free unconditional
during cm_destroy_id() and removing the confusing special case error
unwind during creation of the timewait_info.

This also fixes a leak of the timewait if cm_destroy_id() is called in
IB_CM_ESTABLISHED with an XRC TGT QP. The state machine will be left in
ESTABLISHED while it needed to transition through IB_CM_TIMEWAIT to
release the timewait pointer.

Also fix a leak of the timewait_info if the caller mis-uses the API and
does ib_send_cm_reqs().

Fixes: a977049d ("[PATCH] IB: Add the kernel CM implementation")
Link: https://lore.kernel.org/r/20200310092545.251365-4-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

bede86a3

RDMA/cm: Fix checking for allowed duplicate listens · ca21cb7f

Jason Gunthorpe authored Mar 10, 2020

The test here typod the cm_id_priv to use, it used the one that was
freshly allocated. By definition the allocated one has the matching
cm_handler and zero context, so the condition was always true.

Instead check that the existing listening ID is compatible with the
proposed handler so that it can be shared, as was originally intended.

Fixes: 067b171b ("IB/cm: Share listening CM IDs")
Link: https://lore.kernel.org/r/20200310092545.251365-3-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

ca21cb7f

RDMA/cm: Fix ordering of xa_alloc_cyclic() in ib_create_cm_id() · e8dc4e88

Jason Gunthorpe authored Mar 10, 2020

xa_alloc_cyclic() is a SMP release to be paired with some later acquire
during xa_load() as part of cm_acquire_id().

As such, xa_alloc_cyclic() must be done after the cm_id is fully
initialized, in particular, it absolutely must be after the
refcount_set(), otherwise the refcount_inc() in cm_acquire_id() may not
see the set.

As there are several cases where a reader will be able to use the
id.local_id after cm_acquire_id in the IB_CM_IDLE state there needs to be
an unfortunate split into a NULL allocate and a finalizing xa_store.

Fixes: a977049d ("[PATCH] IB: Add the kernel CM implementation")
Link: https://lore.kernel.org/r/20200310092545.251365-2-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

e8dc4e88

13 Mar, 2020 5 commits

RDMA/hns: Fix wrong judgments of udata->outlen · 9e57a9aa

Weihang Li authored Mar 10, 2020

These judgments were used to keep the compatibility with older versions of
userspace that don't have the field named "cap_flags" in structure
hns_roce_ib_create_cq_resp. But it will be wrong to compare outlen with
the size of resp if another new field were added in resp. oulen should be
compared with the end offset of cap_flags in resp.

Fixes: 4f8f0d5e ("RDMA/hns: Package the flow of creating cq")
Link: https://lore.kernel.org/r/1583845569-47257-1-git-send-email-liweihang@huawei.comSigned-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

9e57a9aa

Merge branch 'mlx5_mr_cache' into rdma.git for-next · d613bd64

Jason Gunthorpe authored Mar 13, 2020

Leon Romanovsky says:

====================
This series fixes various corner cases in the mlx5_ib MR cache
implementation, see specific commit messages for more information.
====================

Based on the mlx5-next branch at
 git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Due to dependencies

* branch 'mlx5_mr-cache':
  RDMA/mlx5: Allow MRs to be created in the cache synchronously
  RDMA/mlx5: Revise how the hysteresis scheme works for cache filling
  RDMA/mlx5: Fix locking in MR cache work queue
  RDMA/mlx5: Lock access to ent->available_mrs/limit when doing queue_work
  RDMA/mlx5: Fix MR cache size and limit debugfs
  RDMA/mlx5: Always remove MRs from the cache before destroying them
  RDMA/mlx5: Simplify how the MR cache bucket is located
  RDMA/mlx5: Rename the tracking variables for the MR cache
  RDMA/mlx5: Replace spinlock protected write with atomic var
  {IB,net}/mlx5: Move asynchronous mkey creation to mlx5_ib
  {IB,net}/mlx5: Assign mkey variant in mlx5_ib only
  {IB,net}/mlx5: Setup mkey variant before mr create command invocation

d613bd64

RDMA/mlx5: Allow MRs to be created in the cache synchronously · aad719dc

Jason Gunthorpe authored Mar 10, 2020

If the cache is completely out of MRs, and we are running in cache mode,
then directly, and synchronously, create an MR that is compatible with the
cache bucket using a sleeping mailbox command. This ensures that the
thread that is waiting for the MR absolutely will get one.

When a MR allocated in this way becomes freed then it is compatible with
the cache bucket and will be recycled back into it.

Deletes the very buggy ent->compl scheme to create a synchronous MR
allocation.

Link: https://lore.kernel.org/r/20200310082238.239865-13-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

aad719dc

RDMA/mlx5: Revise how the hysteresis scheme works for cache filling · 1c78a21a

Jason Gunthorpe authored Mar 10, 2020

Currently if the work queue is running then it is in 'hysteresis' mode and
will fill until the cache reaches the high water mark. This implicit state
is very tricky and doesn't interact with pending very well.

Instead of self re-scheduling the work queue after the add_keys() has
started to create the new MR, have the queue scheduled from
reg_mr_callback() only after the requested MR has been added.

This avoids the bad design of an in-rush of queue'd work doing back to
back add_keys() until EAGAIN then sleeping. The add_keys() will be paced
one at a time as they complete, slowly filling up the cache.

Also, fix pending to be only manipulated under lock.

Link: https://lore.kernel.org/r/20200310082238.239865-12-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

1c78a21a

RDMA/mlx5: Fix locking in MR cache work queue · b9358bdb

Jason Gunthorpe authored Mar 10, 2020

All of the members of mlx5_cache_ent must be accessed while holding the
spinlock, add the missing spinlock in the __cache_work_func().

Using cache->stopped and flush_workqueue() is an inherently racy way to
shutdown self-scheduling work on a queue. Replace it with ent->disabled
under lock, and always check disabled before queuing any new work. Use
cancel_work_sync() to shutdown the queue.

Use READ_ONCE/WRITE_ONCE for dev->last_add to manage concurrency as
coherency is less important here.

Split fill_delay from the bitfield. C bitfield updates are not atomic and
this is just a mess. Use READ_ONCE/WRITE_ONCE, but this could also use
test_bit()/set_bit().

Link: https://lore.kernel.org/r/20200310082238.239865-11-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

b9358bdb