1. 21 Apr, 2017 10 commits
    • Jack Morgenstein's avatar
      IB/core: Fix sysfs registration error flow · b312be3d
      Jack Morgenstein authored
      The kernel commit cited below restructured ib device management
      so that the device kobject is initialized in ib_alloc_device.
      
      As part of the restructuring, the kobject is now initialized in
      procedure ib_alloc_device, and is later added to the device hierarchy
      in the ib_register_device call stack, in procedure
      ib_device_register_sysfs (which calls device_add).
      
      However, in the ib_device_register_sysfs error flow, if an error
      occurs following the call to device_add, the cleanup procedure
      device_unregister is called. This call results in the device object
      being deleted -- which results in various use-after-free crashes.
      
      The correct cleanup call is device_del -- which undoes device_add
      without deleting the device object.
      
      The device object will then (correctly) be deleted in the
      ib_register_device caller's error cleanup flow, when the caller invokes
      ib_dealloc_device.
      
      Fixes: 55aeed06 ("IB/core: Make ib_alloc_device init the kobject")
      Cc: <stable@vger.kernel.org> # v4.2+
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b312be3d
    • Parav Pandit's avatar
      IB/core: Fix kernel crash during fail to initialize device · 4be3a4fa
      Parav Pandit authored
      This patch fixes the kernel crash that occurs during ib_dealloc_device()
      called due to provider driver fails with an error after
      ib_alloc_device() and before it can register using ib_register_device().
      
      This crashed seen in tha lab as below which can occur with any IB device
      which fails to perform its device initialization before invoking
      ib_register_device().
      
      This patch avoids touching cache and port immutable structures if device
      is not yet initialized.
      It also releases related memory when cache and port immutable data
      structure initialization fails during register_device() state.
      
      [81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
      [81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
      [81416.576222] PGD 78da66067
      [81416.576223] PUD 7f2d7c067
      [81416.579484] PMD 0
      [81416.582720]
      [81416.587242] Oops: 0000 [#1] SMP
      [81416.722395] task: ffff8807887515c0 task.stack: ffffc900062c0000
      [81416.729148] RIP: 0010:ib_cache_release_one+0x29/0x80 [ib_core]
      [81416.735793] RSP: 0018:ffffc900062c3a90 EFLAGS: 00010202
      [81416.741823] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
      [81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff880859fec000
      [81416.757757] RBP: ffffc900062c3aa0 R08: ffff8808536e5ac0 R09: ffff880859fec5b0
      [81416.765708] R10: 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
      [81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15: ffff88084ebc0060
      [81416.781621] FS:  00007fd879fab740(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
      [81416.790522] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [81416.797094] CR2: 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
      [81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [81416.812997] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [81416.820950] Call Trace:
      [81416.824226]  ib_device_release+0x1e/0x40 [ib_core]
      [81416.829858]  device_release+0x32/0xa0
      [81416.834370]  kobject_cleanup+0x63/0x170
      [81416.839058]  kobject_put+0x25/0x50
      [81416.843319]  ib_dealloc_device+0x25/0x40 [ib_core]
      [81416.848986]  mlx5_ib_add+0x163/0x1990 [mlx5_ib]
      [81416.854414]  mlx5_add_device+0x5a/0x160 [mlx5_core]
      [81416.860191]  mlx5_register_interface+0x8d/0xc0 [mlx5_core]
      [81416.866587]  ? 0xffffffffa09e9000
      [81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
      [81416.876094]  do_one_initcall+0x51/0x1b0
      [81416.880861]  ? __vunmap+0x85/0xd0
      [81416.885113]  ? kmem_cache_alloc_trace+0x14b/0x1b0
      [81416.890768]  ? vfree+0x2e/0x70
      [81416.894762]  do_init_module+0x60/0x1fa
      [81416.899441]  load_module+0x15f6/0x1af0
      [81416.904114]  ? __symbol_put+0x60/0x60
      [81416.908709]  ? ima_post_read_file+0x3d/0x80
      [81416.913828]  ? security_kernel_post_read_file+0x6b/0x80
      [81416.920006]  SYSC_finit_module+0xa6/0xf0
      [81416.924888]  SyS_finit_module+0xe/0x10
      [81416.929568]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [81416.935089] RIP: 0033:0x7fd879494949
      [81416.939543] RSP: 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
      [81416.947982] RAX: ffffffffffffffda RBX: 0000000001b66f00 RCX: 00007fd879494949
      [81416.955965] RDX: 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
      [81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000001b652a0
      [81416.971861] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffdbc1b3e70
      [81416.979763] R13: 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
      [81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP: ffffc900062c3a90
      [81417.016045] CR2: 0000000000000000
      
      Fixes: 55aeed06 ("IB/core: Make ib_alloc_device init the kobject")
      Fixes: 7738613e ("IB/core: Add per port immutable struct to ib_device")
      Cc: <stable@vger.kernel.org> # v4.2+
      Reviewed-by: default avatarDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4be3a4fa
    • Feras Daoud's avatar
      IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow · 3e31a490
      Feras Daoud authored
      Before calling ipoib_stop, rtnl_lock should be taken, then
      the flow clears the IPOIB_FLAG_ADMIN_UP and IPOIB_FLAG_OPER_UP
      flags, and waits for mcast completion if IPOIB_MCAST_FLAG_BUSY
      is set.
      
      On the other hand, the flow of multicast join task initializes
      a mcast completion, sets the IPOIB_MCAST_FLAG_BUSY and calls
      ipoib_mcast_join. If IPOIB_FLAG_OPER_UP flag is not set, this
      call returns EINVAL without setting the mcast completion and
      leads to a deadlock.
      
          ipoib_stop                          |
              |                               |
          clear_bit(IPOIB_FLAG_ADMIN_UP)      |
              |                               |
          Context Switch                      |
              |                       ipoib_mcast_join_task
              |                               |
              |                       spin_lock_irq(lock)
              |                               |
              |                       init_completion(mcast)
              |                               |
              |                       set_bit(IPOIB_MCAST_FLAG_BUSY)
              |                               |
              |                       Context Switch
              |                               |
          clear_bit(IPOIB_FLAG_OPER_UP)       |
              |                               |
          spin_lock_irqsave(lock)             |
              |                               |
          Context Switch                      |
              |                       ipoib_mcast_join
              |                       return (-EINVAL)
              |                               |
              |                       spin_unlock_irq(lock)
              |                               |
              |                       Context Switch
              |                               |
          ipoib_mcast_dev_flush               |
          wait_for_completion(mcast)          |
      
      ipoib_stop will wait for mcast completion for ever, and will
      not release the rtnl_lock. As a result panic occurs with the
      following trace:
      
          [13441.639268] Call Trace:
          [13441.640150]  [<ffffffff8168b579>] schedule+0x29/0x70
          [13441.641038]  [<ffffffff81688fc9>] schedule_timeout+0x239/0x2d0
          [13441.641914]  [<ffffffff810bc017>] ? complete+0x47/0x50
          [13441.642765]  [<ffffffff810a690d>] ? flush_workqueue_prep_pwqs+0x16d/0x200
          [13441.643580]  [<ffffffff8168b956>] wait_for_completion+0x116/0x170
          [13441.644434]  [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
          [13441.645293]  [<ffffffffa05af170>] ipoib_mcast_dev_flush+0x150/0x190 [ib_ipoib]
          [13441.646159]  [<ffffffffa05ac967>] ipoib_ib_dev_down+0x37/0x60 [ib_ipoib]
          [13441.647013]  [<ffffffffa05a4805>] ipoib_stop+0x75/0x150 [ib_ipoib]
      
      Fixes: 08bc3276 ("IB/ipoib: fix for rare multicast join race condition")
      Signed-off-by: default avatarFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      3e31a490
    • Feras Daoud's avatar
      IB/ipoib: Update broadcast object if PKey value was changed in index 0 · 9a9b8112
      Feras Daoud authored
      Update the broadcast address in the priv->broadcast object when the
      Pkey value changes in index 0, otherwise the multicast GID value will
      keep the previous value of the PKey, and will not be updated.
      This leads to interface state down because the interface will keep the
      old PKey value.
      
      For example, in SR-IOV environment, if the PF changes the value of PKey
      index 0 for one of the VFs, then the VF receives PKey change event that
      triggers heavy flush. This flush calls update_parent_pkey that update the
      broadcast object and its relevant members. If in this case the multicast
      GID will not be updated, the interface state will be down.
      
      Fixes: c2904141 ("IPoIB: Fix pkey change flow for virtualization environments")
      Signed-off-by: default avatarFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      9a9b8112
    • yonatanc's avatar
      IB/rxe: Cache dst in QP instead of getting it for each send · 4ed6ad1e
      yonatanc authored
      In RC QP there is no need to resolve the outgoing interface
      for each packet, as this does not change during QP life cycle.
      
      Instead cache the interface on the socket and use that one.
      This improves performance by 12% by sparing redundant
      calls to rxe_find_route.
      
      ib_send_bw -d rxe0  -x 1 -n 9000 -e  -s $((1024 * 1024 )) -l 100
      
      ----------------------------------------------------------------------------------------
      |        | bytes   | iterations | BW peak[MB/sec] | BW average[MB/sec] | MsgRate[Mpps] |
      ----------------------------------------------------------------------------------------
      | before | 1048576 | 9000       | inf             | 551.21             | 0.000551      |
      | after  | 1048576 | 9000       | inf             | 615.54             | 0.000616      |
      ----------------------------------------------------------------------------------------
      
      Fixes: 8700e3e7 ("Soft RoCE driver")
      Signed-off-by: default avatarYonatan Cohen <yonatanc@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4ed6ad1e
    • yonatanc's avatar
      IB/rxe: Offload CRC calculation when possible · cee2688e
      yonatanc authored
      Use CPU ability to perform CRC calculations, by
      replacing direct calls to crc32_le() with crypto_shash_updata().
      
      The overall performance gain measured with ib_send_bw tool is 10% and it
      was tested on "Intel CPU ES-2660 v2 @ 2.20Ghz" CPU.
      
      ib_send_bw -d rxe0  -x 1 -n 9000 -e  -s $((1024 * 1024 )) -l 100
      
      ---------------------------------------------------------------------------------------------
      |             | bytes   | iterations | BW peak[MB/sec] | BW average[MB/sec] | MsgRate[Mpps] |
      ---------------------------------------------------------------------------------------------
      | crc32_le    | 1048576 | 9000       | inf             | 497.60             | 0.000498      |
      | CRC offload | 1048576 | 9000       | inf             | 546.70             | 0.000547      |
      ---------------------------------------------------------------------------------------------
      
      Fixes: 8700e3e7 ("Soft RoCE driver")
      Signed-off-by: default avatarYonatan Cohen <yonatanc@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      cee2688e
    • Parav Pandit's avatar
      IB/rxe: Do not export module's private function · 0d38ac8a
      Parav Pandit authored
      Function rxe_rcv is used internally in RXE and don't need to be
      exported. This patch removes such export declaration.
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0d38ac8a
    • Parav Pandit's avatar
      IB/rxe: Avoid accessing timers for non RC QPs · 99fc12f6
      Parav Pandit authored
      This patch avoids RNR NAK timer and retransmit timer initialization and
      cleanup for non RC QPs (such as UD QP, GSI QP).
      Reviewed-by: default avatarMoni Shoua <monis@mellanox.com>
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      99fc12f6
    • Yonatan Cohen's avatar
      IB/rxe: Add port protocol stats · 0b1e5b99
      Yonatan Cohen authored
      Expose new counters using the get_hw_stats callback.
      We expose the following counters:
      
      +---------------------+----------------------------------------+
      |      Name           |           Description                  |
      |---------------------+----------------------------------------|
      |sent_pkts            | number of sent pkts                    |
      |---------------------+----------------------------------------|
      |rcvd_pkts            | number of received packets             |
      |---------------------+----------------------------------------|
      |out_of_sequence      | number of errors due to packet         |
      |                     | transport sequence number              |
      |---------------------+----------------------------------------|
      |duplicate_request    | number of received duplicated packets. |
      |                     | A request that previously executed is  |
      |                     | named duplicated.                      |
      |---------------------+----------------------------------------|
      |rcvd_rnr_err         | number of received RNR by completer    |
      |---------------------+----------------------------------------|
      |send_rnr_err         | number of sent RNR by responder        |
      |---------------------+----------------------------------------|
      |rcvd_seq_err         | number of out of sequence packets      |
      |                     | received                               |
      |---------------------+----------------------------------------|
      |ack_deffered         | number of deferred handling of ack     |
      |                     | packets.                               |
      |---------------------+----------------------------------------|
      |retry_exceeded_err   | number of times retry exceeded         |
      |---------------------+----------------------------------------|
      |completer_retry_err  | number of times completer decided to   |
      |                     | retry                                  |
      |---------------------+----------------------------------------|
      |send_err             | number of failed send packet           |
      +---------------------+----------------------------------------+
      Signed-off-by: default avatarYonatan Cohen <yonatanc@mellanox.com>
      Reviewed-by: default avatarMoni Shoua <monis@mellanox.com>
      Reviewed-by: default avatarAndrew Boyer <andrew.boyer@dell.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0b1e5b99
    • Doug Ledford's avatar
      cxgb4: Convert PDBG to pr_debug the second · 339e7575
      Doug Ledford authored
      A couple spots were missed in the original patch to implement this
      change.  Add those spots.
      
      Fixes: a9a42886 (cxgb4: Convert PDBG to pr_debug)
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      339e7575
  2. 20 Apr, 2017 30 commits