Commits · 2a7cec538169ada48a7810e77abff2ca99dbb098 · Kirill Smelkov / linux

17 Sep, 2020 1 commit

RDMA/cma: Fix locking for the RDMA_CM_CONNECT state · 2a7cec53

Jason Gunthorpe authored Sep 02, 2020

It is currently a bit confusing, but the design is if the handler_mutex
is held, and the state is in RDMA_CM_CONNECT, then the state cannot leave
RDMA_CM_CONNECT without also serializing with the handler_mutex.

Make this clearer by adding a direct assertion, fixing the usage in
rdma_connect and generally using READ_ONCE to read the state value.

Link: https://lore.kernel.org/r/20200902081122.745412-2-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

2a7cec53

16 Sep, 2020 3 commits

RDMA/ipoib: Convert to use DEFINE_SEQ_ATTRIBUTE macro · 3cc30e8d

Liu Shixin authored Sep 16, 2020

Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Link: https://lore.kernel.org/r/20200916025022.3992627-1-liushixin2@huawei.comSigned-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

3cc30e8d

RDMA/i40iw: Avoid typecast from void to pci_dev · 1d7c9958

Parav Pandit authored Sep 14, 2020

hw_context stores a pci_device pointer. Use the correct type instead of
void * and avoid typecasts.

Link: https://lore.kernel.org/r/20200914123528.270382-1-parav@mellanox.comSigned-off-by: Parav Pandit <parav@nvidia.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

1d7c9958

RDMA/qedr: Add support for user mode XRC-SRQ's · 06e8d1df

Yuval Basson authored Jul 22, 2020

Implement the XRC specific verbs. The additional QP type introduced new
logic to the rest of the verbs that now require distinguishing whether a
QP has an "RQ" or an "SQ" or both.

Link: https://lore.kernel.org/r/20200722102339.30104-1-ybason@marvell.comSigned-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Yuval Basson <ybason@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

06e8d1df

11 Sep, 2020 21 commits

RDMA/qedr: Fix function prototype parameters alignment · 9e054b13

Michal Kalderon authored Sep 02, 2020

Alignment of parameters was off by one

Link: https://lore.kernel.org/r/20200902165741.8355-9-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

9e054b13

RDMA/qedr: Fix inline size returned for iWARP · fbf58026

Michal Kalderon authored Sep 02, 2020

commit 59e8970b ("RDMA/qedr: Return max inline data in QP query
result") changed query_qp max_inline size to return the max roce inline
size. When iwarp was introduced, this should have been modified to return
the max inline size based on protocol. This size is cached in the device
attributes

Fixes: 69ad0e7f ("RDMA/qedr: Add support for iWARP in user space")
Link: https://lore.kernel.org/r/20200902165741.8355-8-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

fbf58026

RDMA/qedr: Fix iWARP active mtu display · cc293f54

Michal Kalderon authored Sep 02, 2020

Currently iWARP does not support mtu-change. Notify user when MTU changes
that reload of qedr is required for mtu to change. Display the correct
active mtu.

Fixes: f5b1b177 ("RDMA/qedr: Add iWARP support in existing verbs")
Link: https://lore.kernel.org/r/20200902165741.8355-7-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

cc293f54

qede: Notify qedr when mtu has changed · 97fb3e33

Michal Kalderon authored Sep 02, 2020

MTU change on ethtool is currently not supported for iWARP. Notify qedr
driver so that appropriate logging can take place.

Link: https://lore.kernel.org/r/20200902165741.8355-6-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

97fb3e33

RDMA/qedr: Fix return code if accept is called on a destroyed qp · 8a5a10a1

Michal Kalderon authored Sep 02, 2020

In iWARP, accept could be called after a QP is already destroyed. In this
case an error should be returned and not success.

Fixes: 82af6d19 ("RDMA/qedr: Fix synchronization methods and memory leaks in qedr")
Link: https://lore.kernel.org/r/20200902165741.8355-5-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

8a5a10a1

RDMA/qedr: Fix use of uninitialized field · a379ad54

Michal Kalderon authored Sep 02, 2020

dev->attr.page_size_caps was used uninitialized when setting device
attributes

Fixes: ec72fce4 ("qedr: Add support for RoCE HW init")
Link: https://lore.kernel.org/r/20200902165741.8355-4-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

a379ad54

RDMA/qedr: Fix doorbell setting · 0b1eddc1

Michal Kalderon authored Sep 02, 2020

Change the doorbell setting so that the maximum value between the last and
current value is set. This is to avoid doorbells being lost.

Fixes: a7efd777 ("qedr: Add support for PD,PKEY and CQ verbs")
Link: https://lore.kernel.org/r/20200902165741.8355-3-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

0b1eddc1

RDMA/qedr: Fix qp structure memory leak · 098e345a

Michal Kalderon authored Sep 02, 2020

The qedr_qp structure wasn't freed when the protocol was RoCE.  kmemleak
output when running basic RoCE scenario.

unreferenced object 0xffff927ad7e22c00 (size 1024):
  comm "ib_send_bw", pid 7082, jiffies 4384133693 (age 274.698s)
  hex dump (first 32 bytes):
    00 b0 cd a2 79 92 ff ff 00 3f a1 a2 79 92 ff ff  ....y....?..y...
    00 ee 5c dd 80 92 ff ff 00 f6 5c dd 80 92 ff ff  ..\.......\.....
  backtrace:
    [<00000000b2ba0f35>] qedr_create_qp+0xb3/0x6c0 [qedr]
    [<00000000e85a43dd>] ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x555/0xad0 [ib_uverbs]
    [<00000000fee4d029>] ib_uverbs_cmd_verbs+0xa5a/0xb80 [ib_uverbs]
    [<000000005d622660>] ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs]
    [<00000000eb4cdc71>] ksys_ioctl+0x87/0xc0
    [<00000000abe6b23a>] __x64_sys_ioctl+0x16/0x20
    [<0000000046e7cef4>] do_syscall_64+0x4d/0x90
    [<00000000c6948f76>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1212767e ("qedr: Add wrapping generic structure for qpidr and adjust idr routines.")
Link: https://lore.kernel.org/r/20200902165741.8355-2-michal.kalderon@marvell.comSigned-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

098e345a

RDMA/core: Added missing WR and WC opcodes · b60b9c02

Bob Pearson authored Sep 03, 2020

Add work completion opcodes to a new ib_uverbs_wc_opcode enum in
ib_user_verbs.h. This plays the same role as ib_uverbs_wr_opcode
documenting the opcodes in the user space API.

Assigned the IB_WC_XXX opcodes in ib_verbs.h to the IB_UVERBS_WC_XXX
where they are defined. This follows the same pattern as the IB_WR_XXX
opcodes. This fixes an incorrect value for LSO that had crept in but
is not currently being used.

Added a missing IB_WR_BIND_MW opcode in ib_verbs.h.

Link: https://lore.kernel.org/r/20200903224039.437391-2-rpearson@hpe.comSigned-off-by: Bob Pearson <rpearson@hpe.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

b60b9c02

RDMA/ocrdma: Remove fbo from MR · 1d4299ed

Jason Gunthorpe authored Sep 04, 2020

This is always the same value as IOVA masked by the page size, just use
that clearer calculation directly.

It is unclear of ocrdma hardware can actually support a true fbo, if so it
could use a different algorithm to compute the best page size.

Link: https://lore.kernel.org/r/17-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

1d4299ed

RDMA/qedr: Remove fbo and zbva from the MR · b3003a74

Jason Gunthorpe authored Sep 04, 2020

zbva is always false, so fbo is never read.

A 'zero-based-virtual-address' is simply IOVA == 0, and the driver already
supports this.

Link: https://lore.kernel.org/r/16-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.com
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

b3003a74

RDMA/mlx4: Use ib_umem_num_dma_blocks() · 81655d3c

Jason Gunthorpe authored Sep 04, 2020

For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() use
ib_umem_num_dma_blocks() inside the function, it is just some weird static
default.

All other places are just using it with PAGE_SIZE, switch to
ib_umem_num_dma_blocks().

As this is the last call site, remove ib_umem_num_count().

Link: https://lore.kernel.org/r/15-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

81655d3c

RDMA/pvrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() · 87aebd3f

Jason Gunthorpe authored Sep 04, 2020

This driver always uses PAGE_SIZE.

Link: https://lore.kernel.org/r/14-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

87aebd3f

RDMA/ocrdma: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() · b8387f81

Jason Gunthorpe authored Sep 04, 2020

This driver always uses a DMA array made up of PAGE_SIZE elements, so just
use ib_umem_num_dma_blocks().

Since rdma_for_each_dma_block() always iterates exactly
ib_umem_num_dma_blocks() there is no need for the early exit check in
build_user_pbes(), delete it.

Link: https://lore.kernel.org/r/13-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

b8387f81

RDMA/hns: Use ib_umem_num_dma_blocks() instead of opencoding · cf9ce3c8

Jason Gunthorpe authored Sep 04, 2020

mtr_umem_page_count() does the same thing, replace it with the core code.

Also, ib_umem_find_best_pgsz() should always be called to check that the
umem meets the page_size requirement. If there is a limited set of
page_sizes that work it the pgsz_bitmap should be set to that set. 0 is a
failure and the umem cannot be used.

Lightly tidy the control flow to implement this flow properly.

Link: https://lore.kernel.org/r/12-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comAcked-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

cf9ce3c8

RDMA/bnxt: Do not use ib_umem_page_count() or ib_umem_num_pages() · 84e71b4d

Jason Gunthorpe authored Sep 04, 2020

ib_umem_page_count() returns the number of 4k entries required for a DMA
map, but bnxt_re already computes a variable page size. The correct API to
determine the size of the page table array is ib_umem_num_dma_blocks().

Fix the overallocation of the page array in fill_umem_pbl_tbl() when
working with larger page sizes by using the right function. Lightly
re-organize this function to make it clearer.

Replace the other calls to ib_umem_num_pages().

Fixes: d8558251 ("RDMA/bnxt_re: Use core helpers to get aligned DMA address")
Link: https://lore.kernel.org/r/11-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comAcked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

84e71b4d

RDMA/qedr: Use ib_umem_num_dma_blocks() instead of ib_umem_page_count() · 901bca71

Jason Gunthorpe authored Sep 04, 2020

The length of the list populated by qedr_populate_pbls() should be
calculated using ib_umem_num_dma_blocks() with the same size/shift passed
to qedr_populate_pbls().

Link: https://lore.kernel.org/r/10-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comAcked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

901bca71

RDMA/qedr: Use rdma_umem_for_each_dma_block() instead of open-coding · 68363052

Jason Gunthorpe authored Sep 04, 2020

This loop is splitting the DMA SGL into pg_shift sized pages, use the core
code for this directly.

Link: https://lore.kernel.org/r/9-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comAcked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

68363052

RDMA/i40iw: Use ib_umem_num_dma_pages() · 22123a0e

Jason Gunthorpe authored Sep 04, 2020

If ib_umem_find_best_pgsz() returns > PAGE_SIZE then the equation here is
not correct. 'start' should be 'virt'. Change it to use the core code for
page_num and the canonical calculation of page_shift.

Fixes: eb52c033 ("RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size")
Link: https://lore.kernel.org/r/8-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

22123a0e

RDMA/efa: Use ib_umem_num_dma_pages() · 1f9b6827

Jason Gunthorpe authored Sep 04, 2020

Fixes: 40ddb3f0 ("RDMA/efa: Use API to get contiguous memory blocks aligned to device supported page size")
Link: https://lore.kernel.org/r/7-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comTested-by: Gal Pressman <galpress@amazon.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

1f9b6827

RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks() · a665aca8

Jason Gunthorpe authored Sep 04, 2020

ib_umem_num_pages() should only be used by things working with the SGL in
CPU pages directly.

Drivers building DMA lists should use the new ib_num_dma_blocks() which
returns the number of blocks rdma_umem_for_each_block() will return.

To make this general for DMA drivers requires a different implementation.
Computing DMA block count based on umem->address only works if the
requested page size is < PAGE_SIZE and/or the IOVA == umem->address.

Instead the number of DMA pages should be computed in the IOVA address
space, not umem->address. Thus the IOVA has to be stored inside the umem
so it can be used for these calculations.

For now set it to umem->address by default and fix it up if
ib_umem_find_best_pgsz() was called. This allows drivers to be converted
to ib_umem_num_dma_blocks() safely.

Link: https://lore.kernel.org/r/6-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

a665aca8

09 Sep, 2020 15 commits

RDMA/umem: Replace for_each_sg_dma_page with rdma_umem_for_each_dma_block · 89603f7e

Jason Gunthorpe authored Sep 04, 2020

Generally drivers should be using this core helper to split up the umem
into DMA pages.

These drivers are all probably wrong in some way to pass PAGE_SIZE in as
the HW page size. Either the driver doesn't support other page sizes and
it should use 4096, or the driver does support other page sizes and should
use ib_umem_find_best_pgsz() to select the best HW pages size of the HW
supported set.

The only case it could be correct is if the HW has a global setting for
PAGE_SIZE set at driver initialization time.

Link: https://lore.kernel.org/r/5-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

89603f7e

RDMA/umem: Add rdma_umem_for_each_dma_block() · ebc24096

Jason Gunthorpe authored Sep 04, 2020

This helper does the same as rdma_for_each_block(), except it works on a
umem. This simplifies most of the call sites.

Link: https://lore.kernel.org/r/4-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comAcked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

ebc24096

RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz() · 3361c29e

Jason Gunthorpe authored Sep 04, 2020

The calculation in rdma_find_pg_bit() is fairly complicated, and the
function is never called anywhere else. Inline a simpler version into
ib_umem_find_best_pgsz()

Link: https://lore.kernel.org/r/3-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comSigned-off-by: Jason Gunthorpe <jgg@nvidia.com>

3361c29e

RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz() · 10c75ccb

Jason Gunthorpe authored Sep 04, 2020

rdma_for_each_block() makes assumptions about how the SGL is constructed
that don't work if the block size is below the page size used to to build
the SGL.

The rules for umem SGL construction require that the SG's all be PAGE_SIZE
aligned and we don't encode the actual byte offset of the VA range inside
the SGL using offset and length. So rdma_for_each_block() has no idea
where the actual starting/ending point is to compute the first/last block
boundary if the starting address should be within a SGL.

Fixing the SGL construction turns out to be really hard, and will be the
subject of other patches. For now block smaller pages.

Fixes: 4a353399 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/2-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comReviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

10c75ccb

RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundary · a40c20da

Jason Gunthorpe authored Sep 04, 2020

It is possible for a single SGL to span an aligned boundary, eg if the SGL
is

  61440 -> 90112

Then the length is 28672, which currently limits the block size to
32k. With a 32k page size the two covering blocks will be:

  32768->65536 and 65536->98304

However, the correct answer is a 128K block size which will span the whole
28672 bytes in a single block.

Instead of limiting based on length figure out which high IOVA bits don't
change between the start and end addresses. That is the highest useful
page size.

Fixes: 4a353399 ("RDMA/umem: Add API to find best driver supported page size in an MR")
Link: https://lore.kernel.org/r/1-v2-270386b7e60b+28f4-umem_1_jgg@nvidia.comReviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

a40c20da

RDMA: Make counters destroy symmetrical · 71ff3f62

Leon Romanovsky authored Sep 07, 2020

Change counters to return failure like any other verbs destroy, however
this flow shouldn't return error at all.

Link: https://lore.kernel.org/r/20200907120921.476363-10-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

71ff3f62

RDMA: Restore ability to return error for destroy WQ · add53535

Leon Romanovsky authored Sep 07, 2020

Make this interface symmetrical to other destroy paths.

Fixes: a49b1dc7 ("RDMA: Convert destroy_wq to be void")
Link: https://lore.kernel.org/r/20200907120921.476363-9-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

add53535

RDMA: Change XRCD destroy return value · d0c45c85

Leon Romanovsky authored Sep 07, 2020

Update XRCD destroy flow to allow command failure.

Fixes: 28ad5f65 ("RDMA: Move XRCD to be under ib_core responsibility")
Link: https://lore.kernel.org/r/20200907120921.476363-8-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

d0c45c85

RDMA: Allow fail of destroy CQ · 43d781b9

Leon Romanovsky authored Sep 07, 2020

Like any other verbs objects, CQ shouldn't fail during destroy, but
mlx5_ib didn't follow this contract with mixed IB verbs objects with
DEVX. Such mix causes to the situation where FW and kernel are fully
interdependent on the reference counting of each side.

Kernel verbs and drivers that don't have DEVX flows shouldn't fail.

Fixes: e39afe3d ("RDMA: Convert CQ allocations to be under core responsibility")
Link: https://lore.kernel.org/r/20200907120921.476363-7-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

43d781b9

RDMA/core: Delete function indirection for alloc/free kernel CQ · 7e3c66c9

Leon Romanovsky authored Sep 07, 2020

The ib_alloc_cq*() and ib_free_cq*() are solely kernel verbs to manage CQs
and doesn't need extra indirection just to call same functions with
constant parameter NULL as udata.

Link: https://lore.kernel.org/r/20200907120921.476363-6-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

7e3c66c9

RDMA: Restore ability to fail on SRQ destroy · 119181d1

Leon Romanovsky authored Sep 07, 2020

In similar way to other IB objects, restore the ability to return error on
SRQ destroy. Strictly speaking, this change is not necessary, and provided
here to ensure a symmetrical interface like other destroy functions.

Fixes: 68e326de ("RDMA: Handle SRQ allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

119181d1

RDMA/mlx5: Issue FW command to destroy SRQ on reentry · fd89099d

Leon Romanovsky authored Sep 07, 2020

The HW release can fail and leave the system in limbo state, where SRQ is
removed from the table, but can't be destroyed later. In every reentry,
the initial xa_erase_irq() check will fail.

Rewrite the erase logic to keep index, but don't store the entry
itself. By doing it, we can safely reinsert entry back in the case of
destroy failure.

Link: https://lore.kernel.org/r/20200907120921.476363-4-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

fd89099d

RDMA: Restore ability to fail on AH destroy · 9a9ebf8c

Leon Romanovsky authored Sep 07, 2020

Like any other IB verbs objects, AH are refcounted by ib_core. The release
of those objects are controlled by ib_core with promise that AH destroy
can't fail.

Being SW object for now, this change makes dealloc_ah() to behave like any
other destroy IB flows.

Fixes: d3456914 ("RDMA: Handle AH allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-3-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

9a9ebf8c

RDMA: Restore ability to fail on PD deallocate · 91a7c58f

Leon Romanovsky authored Sep 07, 2020

The IB verbs objects are counted by the kernel and ib_core ensures that
deallocate PD will success so it will be called once all other objects
that depends on PD will be released. This is achieved by managing various
reference counters on such objects.

The mlx5 driver didn't follow this standard flow when allowed DEVX objects
that are not managed by ib_core to be interleaved with the ones under
ib_core responsibility.

In such interleaved scenarios deallocate command can fail and ib_core will
leave uobject in internal DB and attempt to clean it later to free
resources anyway.

This change partially restores returned value from dealloc_pd() for all
drivers, but keeping in mind that non-DEVX devices and kernel verbs paths
shouldn't fail.

Fixes: 21a428a0 ("RDMA: Handle PD allocations by IB/core")
Link: https://lore.kernel.org/r/20200907120921.476363-2-leon@kernel.orgSigned-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

91a7c58f

RDMA/rtrs-srv: Incorporate ib_register_client into rtrs server init · 558d52b2

Md Haris Iqbal authored Sep 07, 2020

The rnbd_server module's communication manager (cm) initialization depends
on the registration of the "network namespace subsystem" of the RDMA CM
agent module. As such, when the kernel is configured to load the
rnbd_server and the RDMA cma module during initialization; and if the
rnbd_server module is initialized before RDMA cma module, a null ptr
dereference occurs during the RDMA bind operation.

Call trace:

Call Trace:
? xas_load+0xd/0x80
xa_load+0x47/0x80
cma_ps_find+0x44/0x70
rdma_bind_addr+0x782/0x8b0
? get_random_bytes+0x35/0x40
rtrs_srv_cm_init+0x50/0x80
rtrs_srv_open+0x102/0x180
? rnbd_client_init+0x6e/0x6e
rnbd_srv_init_module+0x34/0x84
? rnbd_client_init+0x6e/0x6e
do_one_initcall+0x4a/0x200
kernel_init_freeable+0x1f1/0x26e
? rest_init+0xb0/0xb0
kernel_init+0xe/0x100
ret_from_fork+0x22/0x30
Modules linked in:
CR2: 0000000000000015

All this happens cause the cm init is in the call chain of the module
init, which is not a preferred practice.

So remove the call to rdma_create_id() from the module init call chain.
Instead register rtrs-srv as an ib client, which makes sure that the
rdma_create_id() is called only when an ib device is added.

Fixes: 9cb83748 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20200907103106.104530-1-haris.iqbal@cloud.ionos.comReported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

558d52b2