1. 07 May, 2019 6 commits
    • Dennis Dalessandro's avatar
      IB/core, ipoib: Do not overreact to SM LID change event · ba7d8117
      Dennis Dalessandro authored
      When IPoIB receives an SM LID change event, it reacts by flushing its
      path record cache and rejoining multicast groups. This is the same
      behavior it performs when it receives a reregistration event. This
      behavior is unnecessary as an SM may have database backup or
      synchronization mechanisms which permit the SM location or LID to change
      without loss of multicast membership and without impact to path records.
      
      Both opensm and the OPA FM issue reregistration events if a new SM is
      started (or restarted with a new config) or an SM event occurs which
      results in loss of multicast membership records by the SM (such as
      opensm failover) or the SM encounters new nodes with Active ports (such
      as after joining 2 fabrics by connecting switches via ISLs). Hence this
      event can be depended on as the trigger for IPoIB cache and multicast
      flushing.
      
      It appears that some drivers, such as qib, and hfi1 issue the
      IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
      Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
      that Mellanox EDR HCAs do not generate SM change events and that opensm
      does generate reregistration.
      
      An SM LID change event is generated by the mentioned drivers to reflect
      that sm_lid and/or sm_sl in the local port info has changed. The intent
      of this event is to permit applications and ULPs which have a local copy
      of this information (or an address handle using it) to update their
      information.
      
      The intent is that the reregistration event (caused by the SM via a bit
      in Set(PortInfo)) be used to inform nodes that they need to rejoin
      multicast groups, resubscribe for notices and potentially update path
      records.
      
      When an SM migrates or fails over, a SM LID change event can occur. In
      response IPoIB discards path records and multicast membership and loses
      connectivity until these records are restored via SA requests. In very
      large fabrics, it may take minutes for the SM to be ready and for the SA
      responses to be supplied.  This can result in undesirable and
      unnecessary IPoIB connectivity impacts. It also can result in an
      unnecessary storm of SA queries from all nodes in a cluster potentially
      followed by yet another storm if the SM issues the reregistration
      request.
      
      The fact the Mellanox HCAs do not even generate this event, is further
      evidence that on modern IB fabrics there will be no ill side effects
      from the proposed changes below to reduce the reaction by 3 kernel
      components to this event. So these changes should be benign for Mellanox
      IB fabrics and will benefit OPA fabrics while also making ib_core and
      ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
      APIs.
      
      Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
      IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
      anything on SM LID change. IPoIB makes use of other ib_core components
      to issue SA requests for it and those components correctly track SM LID
      and SM LID changes.
      
      Also in ib_core multicast handling,  remove the test for
      IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
      error state, which will trigger rejoins. This code is used by IPoIB as
      well as the connection manager and other clients of multicast groups.
      This kernel module centralizes group membership status and joins since a
      node can only join a given group once but multiple ULPs or applications
      may want to join the same group. It makes use of the sa_query.c
      component in ib_core, which correctly trackes SM LID and SL. This
      component does not track SM LID nor SL itself and hence need not react
      to their changes.
      
      Similarly in the ib_core cache code remove the handling for the
      IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
      which is ultimately called is updating local copies of the pkey table,
      gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
      such it does not need to be called on an SM LID change. It technically
      also does not need to be called on a reregistration. The LID_CHANGE,
      PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
      PORT_ACTICE) should be sufficient triggers.
      
      It is worth noting that the alternative of simply having the hfi1 and
      qib drivers not generate the SM LID change event was explored. While
      this would duplicate what Mellanox drivers do now, it is not the correct
      behavior and removes the ability for an SM to migrate without requiring
      reregistration. Since both opensm and OPA SM have mechanisms to backup
      or synchronize registration information, it is desirable to let them
      perform SM migrations (with LID or SL changes) without requiring
      reregistration when they deem it appropriate.
      Suggested-by: default avatarTodd Rimmer <todd.rimmer@intel.com>
      Tested-by: default avatarMichael Brooks <michael.brooks@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarTodd Rimmer <todd.rimmer@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      ba7d8117
    • Leon Romanovsky's avatar
      RDMA/device: Don't fire uevent before device is fully initialized · e7a5b4aa
      Leon Romanovsky authored
      When the refcount is 0 the device is invisible to netlink. However in the
      patch below the refcount = 1 was moved to after the device_add().  This
      creates a race where userspace can issue a netlink query after the
      device_add() event and not see the device as visible.
      
      Ensure that no uevent is fired before device is fully registered.
      
      Fixes: d79af724 ("RDMA/device: Expose ib_device_try_get(()")
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e7a5b4aa
    • Gal Pressman's avatar
      lib/scatterlist: Remove leftover from sg_page_iter comment · d2c4ada1
      Gal Pressman authored
      Commit d901b276 ("lib/scatterlist: Provide a DMA page iterator") added
      the sg DMA iterator but a leftover remained in the sg_page_iter
      documentation as you cannot get the page dma address (only the page
      itself), fix it.
      Signed-off-by: default avatarGal Pressman <galpress@amazon.com>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d2c4ada1
    • Gal Pressman's avatar
      RDMA/efa: Add driver to Kconfig/Makefile · f23afd75
      Gal Pressman authored
      Add EFA Makefile and Kconfig.
      Signed-off-by: default avatarGal Pressman <galpress@amazon.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f23afd75
    • Gal Pressman's avatar
      RDMA/efa: Add the efa module · b7f5e880
      Gal Pressman authored
      Add the main EFA module file which takes care of device
      probe/initialization/registration/etc.
      Signed-off-by: default avatarGal Pressman <galpress@amazon.com>
      Reviewed-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      b7f5e880
    • Gal Pressman's avatar
      RDMA/efa: Add EFA verbs implementation · 40909f66
      Gal Pressman authored
      Add a file that implements the EFA verbs.
      Signed-off-by: default avatarGal Pressman <galpress@amazon.com>
      Reviewed-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      40909f66
  2. 06 May, 2019 20 commits
  3. 03 May, 2019 12 commits
  4. 02 May, 2019 2 commits
    • Gal Pressman's avatar
      RDMA/uverbs: Initialize udata struct on destroy flows · f89adeda
      Gal Pressman authored
      Cited commit introduced the udata parameter to different destroy flows
      but the uapi method definition does not have udata (i.e has_udata flag
      is not set). As a result, an uninitialized udata struct is being passed
      down to the driver callbacks.
      
      Fix that by clearing the driver udata even in cases where has_udata flag
      is not set.
      
      Fixes: c4367a26 ("IB: Pass uverbs_attr_bundle down ib_x destroy path")
      Cc: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
      Co-developed-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarGal Pressman <galpress@amazon.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f89adeda
    • Shiraz Saleem's avatar
      RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table() · 7872168a
      Shiraz Saleem authored
      The flag update_cur_sg tracks whether contiguous pages from a new set of
      page_list pages can be merged into the SGE passed into
      ib_umem_add_sg_table(). If this flag is true, but the total segment length
      exceeds the max_seg_size supported by HW, we avoid combining to this SGE
      and move to a new SGE (x) and merge 'len' pages to it. However, if i <
      npages, the next iteration can incorrectly merge 'len' contiguous pages
      into x instead of into a new SGE since update_cur_sg is still true.
      
      Reset update_cur_sg to false always after the check to merge pages into
      the first SGE passed in to ib_umem_add_sg_table().  Also, prevent a new
      SGE's segment length from ever exceeding HW max_seg_sz.
      
      There is a crash on hfi1 as result of this where-in max_seg_sz is
      defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release
      points to a bad page ptr.
      
       TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093
       BUG: Bad page state in process ib_write_bw  pfn:7ebca0
       page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1
       flags: 0x17ffffc0000000()
       raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000
       raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
       page dumped because: nonzero mapcount
       CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G    B             5.1.0-rc4 #1
       Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015
       Call Trace:
        dump_stack+0x5a/0x73
        bad_page+0xf5/0x10f
        free_pcppages_bulk+0x62c/0x680
        free_unref_page+0x54/0x70
        __ib_umem_release+0x148/0x1a0 [ib_uverbs]
        ib_umem_release+0x22/0x80 [ib_uverbs]
        rvt_dereg_mr+0x67/0xb0 [rdmavt]
        ib_dereg_mr_user+0x37/0x60 [ib_core]
        destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
        uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
        uobj_destroy+0x4d/0x60 [ib_uverbs]
        __uobj_get_destroy+0x33/0x50 [ib_uverbs]
        __uobj_perform_destroy+0xa/0x30 [ib_uverbs]
        ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs]
        ib_uverbs_write+0x3e1/0x500 [ib_uverbs]
        vfs_write+0xad/0x1b0
        ksys_write+0x5a/0xd0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: d10bcf94 ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
      Tested-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      7872168a