• Mike Marciniszyn's avatar
    IB/hfi1: Close window for pq and request coliding · be863834
    Mike Marciniszyn authored
    Cleaning up a pq can result in the following warning and panic:
    
      WARNING: CPU: 52 PID: 77418 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0
      list_del corruption, ffff88cb2c6ac068->next is LIST_POISON1 (dead000000000100)
      Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) 8021q garp mrp ib_isert iscsi_target_mod target_core_mod crc_t10dif crct10dif_generic opa_vnic rpcrdma ib_iser libiscsi scsi_transport_iscsi ib_ipoib(OE) bridge stp llc iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel ast aesni_intel ttm lrw gf128mul glue_helper ablk_helper drm_kms_helper cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops drm pcspkr joydev lpc_ich mei_me drm_panel_orientation_quirks i2c_i801 mei wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad hfi1(OE) rdmavt(OE) rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core binfmt_misc numatools(OE) xpmem(OE) ip_tables
       nfsv3 nfs_acl nfs lockd grace sunrpc fscache igb ahci i2c_algo_bit libahci dca ptp libata pps_core crc32c_intel [last unloaded: i2c_algo_bit]
      CPU: 52 PID: 77418 Comm: pvbatch Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.38.3.el7.x86_64 #1
      Hardware name: HPE.COM HPE SGI 8600-XA730i Gen10/X11DPT-SB-SG007, BIOS SBED1229 01/22/2019
      Call Trace:
       [<ffffffff90365ac0>] dump_stack+0x19/0x1b
       [<ffffffff8fc98b78>] __warn+0xd8/0x100
       [<ffffffff8fc98bff>] warn_slowpath_fmt+0x5f/0x80
       [<ffffffff8ff970c3>] __list_del_entry+0x63/0xd0
       [<ffffffff8ff9713d>] list_del+0xd/0x30
       [<ffffffff8fddda70>] kmem_cache_destroy+0x50/0x110
       [<ffffffffc0328130>] hfi1_user_sdma_free_queues+0xf0/0x200 [hfi1]
       [<ffffffffc02e2350>] hfi1_file_close+0x70/0x1e0 [hfi1]
       [<ffffffff8fe4519c>] __fput+0xec/0x260
       [<ffffffff8fe453fe>] ____fput+0xe/0x10
       [<ffffffff8fcbfd1b>] task_work_run+0xbb/0xe0
       [<ffffffff8fc2bc65>] do_notify_resume+0xa5/0xc0
       [<ffffffff90379134>] int_signal+0x12/0x17
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      IP: [<ffffffff8fe1f93e>] kmem_cache_close+0x7e/0x300
      PGD 2cdab19067 PUD 2f7bfdb067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) 8021q garp mrp ib_isert iscsi_target_mod target_core_mod crc_t10dif crct10dif_generic opa_vnic rpcrdma ib_iser libiscsi scsi_transport_iscsi ib_ipoib(OE) bridge stp llc iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel ast aesni_intel ttm lrw gf128mul glue_helper ablk_helper drm_kms_helper cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops drm pcspkr joydev lpc_ich mei_me drm_panel_orientation_quirks i2c_i801 mei wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad hfi1(OE) rdmavt(OE) rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core binfmt_misc numatools(OE) xpmem(OE) ip_tables
       nfsv3 nfs_acl nfs lockd grace sunrpc fscache igb ahci i2c_algo_bit libahci dca ptp libata pps_core crc32c_intel [last unloaded: i2c_algo_bit]
      CPU: 52 PID: 77418 Comm: pvbatch Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-957.38.3.el7.x86_64 #1
      Hardware name: HPE.COM HPE SGI 8600-XA730i Gen10/X11DPT-SB-SG007, BIOS SBED1229 01/22/2019
      task: ffff88cc26db9040 ti: ffff88b5393a8000 task.ti: ffff88b5393a8000
      RIP: 0010:[<ffffffff8fe1f93e>]  [<ffffffff8fe1f93e>] kmem_cache_close+0x7e/0x300
      RSP: 0018:ffff88b5393abd60  EFLAGS: 00010287
      RAX: 0000000000000000 RBX: ffff88cb2c6ac000 RCX: 0000000000000003
      RDX: 0000000000000400 RSI: 0000000000000400 RDI: ffffffff9095b800
      RBP: ffff88b5393abdb0 R08: ffffffff9095b808 R09: ffffffff8ff77c19
      R10: ffff88b73ce1f160 R11: ffffddecddde9800 R12: ffff88cb2c6ac000
      R13: 000000000000000c R14: ffff88cf3fdca780 R15: 0000000000000000
      FS:  00002aaaaab52500(0000) GS:ffff88b73ce00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000002d27664000 CR4: 00000000007607e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       [<ffffffff8fe20d44>] __kmem_cache_shutdown+0x14/0x80
       [<ffffffff8fddda78>] kmem_cache_destroy+0x58/0x110
       [<ffffffffc0328130>] hfi1_user_sdma_free_queues+0xf0/0x200 [hfi1]
       [<ffffffffc02e2350>] hfi1_file_close+0x70/0x1e0 [hfi1]
       [<ffffffff8fe4519c>] __fput+0xec/0x260
       [<ffffffff8fe453fe>] ____fput+0xe/0x10
       [<ffffffff8fcbfd1b>] task_work_run+0xbb/0xe0
       [<ffffffff8fc2bc65>] do_notify_resume+0xa5/0xc0
       [<ffffffff90379134>] int_signal+0x12/0x17
      Code: 00 00 ba 00 04 00 00 0f 4f c2 3d 00 04 00 00 89 45 bc 0f 84 e7 01 00 00 48 63 45 bc 49 8d 04 c4 48 89 45 b0 48 8b 80 c8 00 00 00 <48> 8b 78 10 48 89 45 c0 48 83 c0 10 48 89 45 d0 48 8b 17 48 39
      RIP  [<ffffffff8fe1f93e>] kmem_cache_close+0x7e/0x300
       RSP <ffff88b5393abd60>
      CR2: 0000000000000010
    
    The panic is the result of slab entries being freed during the destruction
    of the pq slab.
    
    The code attempts to quiesce the pq, but looking for n_req == 0 doesn't
    account for new requests.
    
    Fix the issue by using SRCU to get a pq pointer and adjust the pq free
    logic to NULL the fd pq pointer prior to the quiesce.
    
    Fixes: e87473bc ("IB/hfi1: Only set fd pointer when base context is completely initialized")
    Link: https://lore.kernel.org/r/20200210131033.87408.81174.stgit@awfm-01.aw.intel.comReviewed-by: default avatarKaike Wan <kaike.wan@intel.com>
    Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
    Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
    Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    be863834
user_sdma.c 42.6 KB