1. 22 Feb, 2019 2 commits
    • Devesh Sharma's avatar
      bnxt_re: fix the regression due to changes in alloc_pbl · c50866e2
      Devesh Sharma authored
      While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
      driver, there was a regression added in the __alloc_pbl path. The change
      left bnxt_re in DOA state in for-next branch.
      
      Fixing the regression to avoid the host crash when a user space object is
      created. Restricting the unconditional access to hwq.pg_arr when hwq is
      initialized for user space objects.
      
      Fixes: 161ebe24 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
      Reported-by: default avatarGal Pressman <galpress@amazon.com>
      Signed-off-by: default avatarSelvin Xavier <selvin.xavier@broadcom.com>
      Signed-off-by: default avatarDevesh Sharma <devesh.sharma@broadcom.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c50866e2
    • Håkon Bugge's avatar
      IB/mlx4: Increase the timeout for CM cache · 2612d723
      Håkon Bugge authored
      Using CX-3 virtual functions, either from a bare-metal machine or
      pass-through from a VM, MAD packets are proxied through the PF driver.
      
      Since the VF drivers have separate name spaces for MAD Transaction Ids
      (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
      in a cache.
      
      Following the RDMA Connection Manager (CM) protocol, it is clear when
      an entry has to evicted form the cache. But life is not perfect,
      remote peers may die or be rebooted. Hence, it's a timeout to wipe out
      a cache entry, when the PF driver assumes the remote peer has gone.
      
      During workloads where a high number of QPs are destroyed concurrently,
      excessive amount of CM DREQ retries has been observed
      
      The problem can be demonstrated in a bare-metal environment, where two
      nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
      have 16 vPorts per physical server.
      
      64 processes are associated with each vPort and creates and destroys
      one QP for each of the remote 64 processes. That is, 1024 QPs per
      vPort, all in all 16K QPs. The QPs are created/destroyed using the
      CM.
      
      When tearing down these 16K QPs, excessive CM DREQ retries (and
      duplicates) are observed. With some cat/paste/awk wizardry on the
      infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
      nodes:
      
      cm_rx_duplicates:
            dreq  2102
      cm_rx_msgs:
            drep  1989
            dreq  6195
             rep  3968
             req  4224
             rtu  4224
      cm_tx_msgs:
            drep  4093
            dreq 27568
             rep  4224
             req  3968
             rtu  3968
      cm_tx_retries:
            dreq 23469
      
      Note that the active/passive side is equally distributed between the
      two nodes.
      
      Enabling pr_debug in cm.c gives tons of:
      
      [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
      1,sl_cm_id: 0xd393089f} is NULL!
      
      By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
      tear-down phase of the application is reduced from approximately 90 to
      50 seconds. Retries/duplicates are also significantly reduced:
      
      cm_rx_duplicates:
            dreq  2460
      []
      cm_tx_retries:
            dreq  3010
             req    47
      
      Increasing the timeout further didn't help, as these duplicates and
      retries stems from a too short CMA timeout, which was 20 (~4 seconds)
      on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
      numbers fell down to about 10 for both of them.
      
      Adjustment of the CMA timeout is not part of this commit.
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Acked-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      2612d723
  2. 21 Feb, 2019 6 commits
    • Moni Shoua's avatar
      IB/core: Abort page fault handler silently during owning process exit · 4438ee3f
      Moni Shoua authored
      It is possible that during a page fault handling, the process that owns
      the MR is terminating. The indication for it is failure to get the
      task_struct or take reference on the mm_struct. In this case just abort
      the page-fault handler with error but without a warning to the kernel log.
      Signed-off-by: default avatarMoni Shoua <monis@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      4438ee3f
    • Moni Shoua's avatar
      IB/mlx5: Validate correct PD before prefetch MR · 81dd4c4b
      Moni Shoua authored
      When prefetching odp mr it is required to verify that pd of the mr is
      identical to the pd for which the advise_mr request arrived with.
      
      This check was missing from synchronous flow and is added now.
      
      Fixes: 813e90b1 ("IB/mlx5: Add advise_mr() support")
      Reported-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarMoni Shoua <monis@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      81dd4c4b
    • Moni Shoua's avatar
      IB/mlx5: Protect against prefetch of invalid MR · a6bc3875
      Moni Shoua authored
      When deferring a prefetch request we need to protect against MR or PD
      being destroyed while the request is still enqueued.
      
      The first step is to validate that PD owns the lkey that describes the MR
      and that the MR that the lkey refers to is owned by that PD.
      
      The second step is to dequeue all requests when MR is destroyed.
      
      Since PD can't be destroyed while it owns MRs it is guaranteed that when a
      worker wakes up the request it refers to is still valid.
      
      Now, it is possible to refrain from taking a reference on the device since
      it is assured to be present as pd.
      
      While that, replace the dedicated ordered workqueue with the system
      unbound workqueue to reuse an existing resource and improve
      performance. This will also fix a bug of queueing to the wrong workqueue.
      
      Fixes: 813e90b1 ("IB/mlx5: Add advise_mr() support")
      Reported-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarMoni Shoua <monis@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      a6bc3875
    • Leon Romanovsky's avatar
      RDMA/uverbs: Store PR pointer before it is overwritten · 25fd08eb
      Leon Romanovsky authored
      The IB_MR_REREG_PD command rewrites mr->pd after successful
      rereg_user_mr(), such change causes to lost usecnt information and
      produces the following warning:
      
       WARNING: CPU: 1 PID: 1771 at drivers/infiniband/core/verbs.c:336 ib_dealloc_pd+0x4e/0x60 [ib_core]
       CPU: 1 PID: 1771 Comm: rereg_mr Tainted: G        W  OE 5.0.0-rc7-for-upstream-perf-2019-02-20_14-03-40-34 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
       RIP: 0010:ib_dealloc_pd+0x4e/0x60 [ib_core]
       RSP: 0018:ffffc90003923dc0 EFLAGS: 00010286
       RAX: 00000000ffffffff RBX: ffff88821f7f0400 RCX: ffff888236a40c00
       RDX: ffff88821f7f0400 RSI: 0000000000000001 RDI: 0000000000000000
       RBP: 0000000000000001 R08: ffff88835f665d80 R09: ffff8882209c90d8
       R10: ffff88835ec003e0 R11: 0000000000000000 R12: ffff888221680ba0
       R13: ffff888221680b00 R14: 00000000ffffffea R15: ffff88821f53c318
       FS:  00007f70db11e740(0000) GS:ffff88835f640000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000001dfd030 CR3: 000000029d9d8000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        uverbs_free_pd+0x2d/0x30 [ib_uverbs]
        destroy_hw_idr_uobject+0x16/0x40 [ib_uverbs]
        uverbs_destroy_uobject+0x28/0x170 [ib_uverbs]
        __uverbs_cleanup_ufile+0x6b/0x90 [ib_uverbs]
        uverbs_destroy_ufile_hw+0x8b/0x110 [ib_uverbs]
        ib_uverbs_close+0x1f/0x80 [ib_uverbs]
        __fput+0xb1/0x220
        task_work_run+0x7f/0xa0
        exit_to_usermode_loop+0x6b/0xb2
        do_syscall_64+0xc5/0x100
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f70dad00664
      
      Fixes: e278173f ("RDMA/core: Cosmetic change - move member initialization to correct block")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarMajd Dibbiny <majd@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      25fd08eb
    • Gustavo A. R. Silva's avatar
      IB/hfi1: Add missing break in switch statement · 7264235e
      Gustavo A. R. Silva authored
      Fix the following warning by adding a missing break:
      
      drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
      drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
         switch (prev->wr.opcode) {
         ^~~~~~
      drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
        case IB_WR_RDMA_READ:
        ^~~~
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enable
      -Wimplicit-fallthrough.
      
      Fixes: c6c23117 ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarKaike Wan <Kaike.wan@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      7264235e
    • Jason Gunthorpe's avatar
      Merge branch 'mlx5-next' into rdma.git for-next · 815f7480
      Jason Gunthorpe authored
      From
      git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
      
      To resolve conflicts with net-next and pick up the first patch.
      
      * branch 'mlx5-next':
        net/mlx5: Factor out HCA capabilities functions
        IB/mlx5: Add support for 50Gbps per lane link modes
        net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
        net/mlx5: Add new fields to Port Type and Speed register
        net/mlx5: Refactor queries to speed fields in Port Type and Speed register
        net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
        net/mlx5: Relocate vport macros to the vport header file
        net/mlx5: E-Switch, Normalize the name of uplink vport number
        net/mlx5: Provide an alternative VF upper bound for ECPF
        net/mlx5: Add host params change event
        net/mlx5: Add query host params command
        net/mlx5: Update enable HCA dependency
        net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
        IB/mlx5: Use unified register/load function for uplink and VF vports
        net/mlx5: Use consistent vport num argument type
        net/mlx5: Use void pointer as the type in address_of macro
        net/mlx5: Align ODP capability function with netdev coding style
        mlx5: use RCU lock in mlx5_eq_cq_get()
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      815f7480
  3. 20 Feb, 2019 18 commits
    • Davidlohr Bueso's avatar
      drivers/IB,qib: Fix pinned/locked limit check in qib_get_user_pages() · ec95e0fa
      Davidlohr Bueso authored
      The current check does not take into account the previous value of
      pinned_vm; thus it is quite bogus as is. Fix this by checking the
      new value after the (optimistic) atomic inc.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      ec95e0fa
    • Noa Osherovich's avatar
      RDMA/core: Verify that memory window type is legal · d0e02bf6
      Noa Osherovich authored
      Before calling the provider's alloc_mw function, verify that the
      given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d0e02bf6
    • Leon Romanovsky's avatar
      RDMA/iwcm: Fix string truncation error · 1882ab86
      Leon Romanovsky authored
      The strlen() check at the beginning of iw_cm_map() ensures that devname
      and ifname strings are less than destinations to which they are supposed
      to be copied. Change strncpy() call to be strcpy(), because we are
      protected from overflow. Zero the entire string buffer to avoid copying
      uninitialized kernel stack memory to userspace.
      
      This fixes the compilation warning below:
      
      In file included from ./include/linux/dma-mapping.h:6,
                       from drivers/infiniband/core/iwcm.c:38:
      In function _strncpy_,
          inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
      ./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
      bound 32 equals destination size [-Wstringop-truncation]
        return __builtin_strncpy(p, q, size);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fixes: d53ec8af ("RDMA/iwcm: Don't copy past the end of dev_name() string")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      1882ab86
    • Yuval Shaia's avatar
      RDMA/core: Cosmetic change - move member initialization to correct block · e278173f
      Yuval Shaia authored
      old_pd is used only if IB_MR_REREG_PD flags is set.
      For readability move it's initialization to where it is used.
      
      While there rewrite the whole 'if-else' block so on error jump directly
      to label and no need for 'else'
      Signed-off-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e278173f
    • Wei Yongjun's avatar
      iw_cxgb4: Make function read_tcb() static · 3b8f8b95
      Wei Yongjun authored
      Fixes the following sparse warning:
      
      drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
       symbol 'read_tcb' was not declared. Should it be static?
      
      Fixes: 11a27e21 ("iw_cxgb4: complete the cached SRQ buffers")
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Acked-by: default avatarRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3b8f8b95
    • Yangyang Li's avatar
      RDMA/hns: Bugfix for set hem of SCC · 6ac16e40
      Yangyang Li authored
      The method of set hem for scc context is different from other contexts. It
      should notify the hardware with the detailed idx in bt0 for scc, while for
      other contexts, it only need to notify the bt step and the hardware will
      calculate the idx.
      
      Here fixes the following error when unloading the hip08 driver:
      
      [  123.570768] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
      [  123.579023] {1}[Hardware Error]: event severity: recoverable
      [  123.584670] {1}[Hardware Error]:  Error 0, type: recoverable
      [  123.590317] {1}[Hardware Error]:   section_type: PCIe error
      [  123.595877] {1}[Hardware Error]:   version: 4.0
      [  123.600395] {1}[Hardware Error]:   command: 0x0006, status: 0x0010
      [  123.606562] {1}[Hardware Error]:   device_id: 0000:7d:00.0
      [  123.612034] {1}[Hardware Error]:   slot: 0
      [  123.616120] {1}[Hardware Error]:   secondary_bus: 0x00
      [  123.621245] {1}[Hardware Error]:   vendor_id: 0x19e5, device_id: 0xa222
      [  123.627847] {1}[Hardware Error]:   class_code: 000002
      [  123.632977] hns3 0000:7d:00.0: aer_status: 0x00000000, aer_mask: 0x00000000
      [  123.639928] hns3 0000:7d:00.0: aer_layer=Transaction Layer, aer_agent=Receiver ID
      [  123.647400] hns3 0000:7d:00.0: aer_uncor_severity: 0x00000000
      [  123.653136] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
      [  123.658959] hns3 0000:7d:00.0: ROCEE uncorrected RAS error identified
      [  123.665395] hns3 0000:7d:00.0: ROCEE RAS AXI rresp error
      [  123.670713] hns3 0000:7d:00.0: requesting reset due to PCI error
      [  123.676715] hns3 0000:7d:00.0: received reset event , reset type is 5
      [  123.683147] hns3 0000:7d:00.0: AER: Device recovery successful
      [  123.688978] hns3 0000:7d:00.0: PF Reset requested
      [  123.693684] hns3 0000:7d:00.0: PF failed(=-5) to send mailbox message to VF
      [  123.700633] hns3 0000:7d:00.0: inform reset to vf(1) failded -5!
      
      Fixes: 6a157f7d ("RDMA/hns: Add SCC context allocation support for hip08")
      Signed-off-by: default avatarYangyang Li <liyangyang20@huawei.com>
      Reviewed-by: default avatarYixian Liu <liuyixian@huawei.com>
      Reviewed-by: default avatarLijun Ou <oulijun@huawei.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      6ac16e40
    • Lijun Ou's avatar
      RDMA/hns: Modify qp&cq&pd specification according to UM · 3e394f94
      Lijun Ou authored
      Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
      specification 1M in kernel space.
      Signed-off-by: default avatarYangyang Li <liyangyang20@huawei.com>
      Signed-off-by: default avatarLijun Ou <oulijun@huawei.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3e394f94
    • Steve Wise's avatar
      lib/irq_poll: Support schedules in non-interrupt contexts · 4133b013
      Steve Wise authored
      Do not assume irq_poll_sched() is called from an interrupt context only.
      So use raise_softirq_irqoff() instead of __raise_softirq_irqoff() so it
      will kick the ksoftirqd if the schedule is from a non-interrupt context.
      
      This is required for RDMA drivers, like soft iwarp, that generate cq
      completion notifications in a workqueue or kthread context.  Without this
      change, siw completion notifications to the ULP can take several hundred
      usecs, depending on the system load.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      4133b013
    • Steve Wise's avatar
      rdma_rxe: Use netlink messages to add/delete links · 66920e1b
      Steve Wise authored
      Add support for the RDMA_NLDEV_CMD_NEWLINK/DELLINK messages which allow
      dynamically adding new RXE links.  Deprecate the old module options for
      now.
      
      Cc: Moni Shoua <monis@mellanox.com>
      Reviewed-by: default avatarYanjun Zhu <yanjun.zhu@oracle.com>
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      66920e1b
    • Steve Wise's avatar
      RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support · 3856ec4b
      Steve Wise authored
      Add support for new LINK messages to allow adding and deleting rdma
      interfaces.  This will be used initially for soft rdma drivers which
      instantiate device instances dynamically by the admin specifying a netdev
      device to use.  The rdma_rxe module will be the first user of these
      messages.
      
      The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
      with the rdma core if they provide link add/delete functions.  Each driver
      registers with a unique "type" string, that is used to dispatch messages
      coming from user space.  A new RDMA_NLDEV_ATTR is defined for the "type"
      string.  User mode will pass 3 attributes in a NEWLINK message:
      RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
      RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
      RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
      link.  The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
      the device to delete.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3856ec4b
    • Parvi Kaustubhi's avatar
      IB/usnic: Fix deadlock · 5bb3c1e9
      Parvi Kaustubhi authored
      There is a dead lock in usnic ib_register and netdev_notify path.
      
      	usnic_ib_discover_pf()
      	| mutex_lock(&usnic_ib_ibdev_list_lock);
      	 | usnic_ib_device_add();
      	  | ib_register_device()
      	   | usnic_ib_query_port()
      	    | mutex_lock(&us_ibdev->usdev_lock);
      	     | ib_get_eth_speed()
      	      | rtnl_lock()
      
      order of lock: &usnic_ib_ibdev_list_lock -> usdev_lock -> rtnl_lock
      
      	rtnl_lock()
      	 | usnic_ib_netdevice_event()
      	  | mutex_lock(&usnic_ib_ibdev_list_lock);
      
      order of lock: rtnl_lock -> &usnic_ib_ibdev_list_lock
      
      Solution is to use the core's lock-free ib_device_get_by_netdev() scheme
      to lookup ib_dev while handling netdev & inet events.
      Signed-off-by: default avatarParvi Kaustubhi <pkaustub@cisco.com>
      Reviewed-by: default avatarGovindarajulu Varadarajan <gvaradar@cisco.com>
      Reviewed-by: default avatarTanmay Inamdar <tinamdar@cisco.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5bb3c1e9
    • Jason Gunthorpe's avatar
      RDMA/rxe: Close a race after ib_register_device · ca22354b
      Jason Gunthorpe authored
      Since rxe allows unregistration from other threads the rxe pointer can
      become invalid any moment after ib_register_driver returns. This could
      cause a user triggered use after free.
      
      Add another driver callback to be called right after the device becomes
      registered to complete any device setup required post-registration.  This
      callback has enough core locking to prevent the device from becoming
      unregistered.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      ca22354b
    • Jason Gunthorpe's avatar
      RDMA/rxe: Add ib_device_get_by_name() and use it in rxe · 6cc2c8e5
      Jason Gunthorpe authored
      rxe has an open coded version of this that is not as safe as the core
      version. This lets us eliminate the internal device list entirely from
      rxe.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      6cc2c8e5
    • Jason Gunthorpe's avatar
      RDMA/rxe: Use driver_unregister and new unregistration API · c367074b
      Jason Gunthorpe authored
      rxe does not have correct locking for its registration/unregistration
      paths, use the core code to handle it instead. In this mode
      ib_unregister_device will also do the dealloc, so rxe is required to do
      clean up from a callback.
      
      The core code ensures that unregistration is done only once, and generally
      takes care of locking and concurrency problems for rxe.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c367074b
    • Jason Gunthorpe's avatar
      RDMA/device: Provide APIs from the core code to help unregistration · d0899892
      Jason Gunthorpe authored
      These APIs are intended to support drivers that exist outside the usual
      driver core probe()/remove() callbacks. Normally the driver core will
      prevent remove() from running concurrently with probe(), once this safety
      is lost drivers need more support to get the locking and lifetimes right.
      
      ib_unregister_driver() is intended to be used during module_exit of a
      driver using these APIs. It unregisters all the associated ib_devices.
      
      ib_unregister_device_and_put() is to be used by a driver-specific removal
      function (ie removal by name, removal from a netdev notifier, removal from
      netlink)
      
      ib_unregister_queued() is to be used from netdev notifier chains where
      RTNL is held.
      
      The locking is tricky here since once things become async it is possible
      to race unregister with registration. This is largely solved by relying on
      the registration refcount, unregistration will only ever work on something
      that has a positive registration refcount - and then an unregistration
      mutex serializes all competing unregistrations of the same device.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d0899892
    • Jason Gunthorpe's avatar
      RDMA/rxe: Use ib_device_get_by_netdev() instead of open coding · 4c173f59
      Jason Gunthorpe authored
      The core API handles the locking correctly and is faster if there are
      multiple devices.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      4c173f59
    • Jason Gunthorpe's avatar
      RDMA/device: Add ib_device_get_by_netdev() · 324e227e
      Jason Gunthorpe authored
      Several drivers need to find the ib_device from a given netdev. rxe needs
      this at speed in an unsleepable context, so choose to implement the
      translation using a RCU safe hash table.
      
      The hash table can have a many to one mapping. This is intended to support
      some future case where multiple IB drivers (ie iWarp and RoCE) connect to
      the same netdevs. driver_ids will need to be different to support this.
      
      In the process this makes the struct ib_device and ib_port_data RCU safe
      by deferring their kfrees.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      324e227e
    • Jason Gunthorpe's avatar
      RDMA/device: Add ib_device_set_netdev() as an alternative to get_netdev · c2261dd7
      Jason Gunthorpe authored
      The associated netdev should not actually be very dynamic, so for most
      drivers there is no reason for a callback like this. Provide an API to
      inform the core code about the net dev affiliation and use a core
      maintained data structure instead.
      
      This allows the core code to be more aware of the ndev relationship which
      will allow some new APIs based around this.
      
      This also uses locking that makes some kind of sense, many drivers had a
      confusing RCU lock, or missing locking which isn't right.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c2261dd7
  4. 19 Feb, 2019 14 commits