• Tony Camuso's avatar
    ipmi: use rcu lock around call to intf->handlers->sender() · 65acfd38
    Tony Camuso authored
    commit cdea4656 upstream.
    
    A vendor with a system having more than 128 CPUs occasionally encounters
    the following crash during shutdown. This is not an easily reproduceable
    event, but the vendor was able to provide the following analysis of the
    crash, which exhibits the same footprint each time.
    
    crash> bt
    PID: 0      TASK: ffff88017c70ce70  CPU: 5   COMMAND: "swapper/5"
     #0 [ffff88085c143ac8] machine_kexec at ffffffff81059c8b
     #1 [ffff88085c143b28] __crash_kexec at ffffffff811052e2
     #2 [ffff88085c143bf8] crash_kexec at ffffffff811053d0
     #3 [ffff88085c143c10] oops_end at ffffffff8168ef88
     #4 [ffff88085c143c38] no_context at ffffffff8167ebb3
     #5 [ffff88085c143c88] __bad_area_nosemaphore at ffffffff8167ec49
     #6 [ffff88085c143cd0] bad_area_nosemaphore at ffffffff8167edb3
     #7 [ffff88085c143ce0] __do_page_fault at ffffffff81691d1e
     #8 [ffff88085c143d40] do_page_fault at ffffffff81691ec5
     #9 [ffff88085c143d70] page_fault at ffffffff8168e188
        [exception RIP: unknown or invalid address]
        RIP: ffffffffa053c800  RSP: ffff88085c143e28  RFLAGS: 00010206
        RAX: ffff88017c72bfd8  RBX: ffff88017a8dc000  RCX: ffff8810588b5ac8
        RDX: ffff8810588b5a00  RSI: ffffffffa053c800  RDI: ffff8810588b5a00
        RBP: ffff88085c143e58   R8: ffff88017c70d408   R9: ffff88017a8dc000
        R10: 0000000000000002  R11: ffff88085c143da0  R12: ffff8810588b5ac8
        R13: 0000000000000100  R14: ffffffffa053c800  R15: ffff8810588b5a00
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        <IRQ stack>
        [exception RIP: cpuidle_enter_state+82]
        RIP: ffffffff81514192  RSP: ffff88017c72be50  RFLAGS: 00000202
        RAX: 0000001e4c3c6f16  RBX: 000000000000f8a0  RCX: 0000000000000018
        RDX: 0000000225c17d03  RSI: ffff88017c72bfd8  RDI: 0000001e4c3c6f16
        RBP: ffff88017c72be78   R8: 000000000000237e   R9: 0000000000000018
        R10: 0000000000002494  R11: 0000000000000001  R12: ffff88017c72be20
        R13: ffff88085c14f8e0  R14: 0000000000000082  R15: 0000001e4c3bb400
        ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
    
    This is the corresponding stack trace
    
    It has crashed because the area pointed with RIP extracted from timer
    element is already removed during a shutdown process.
    
    The function is smi_timeout().
    
    And we think ffff8810588b5a00 in RDX is a parameter struct smi_info
    
    crash> rd ffff8810588b5a00 20
    ffff8810588b5a00:  ffff8810588b6000 0000000000000000   .`.X............
    ffff8810588b5a10:  ffff880853264400 ffffffffa05417e0   .D&S......T.....
    ffff8810588b5a20:  24a024a000000000 0000000000000000   .....$.$........
    ffff8810588b5a30:  0000000000000000 0000000000000000   ................
    ffff8810588b5a30:  0000000000000000 0000000000000000   ................
    ffff8810588b5a40:  ffffffffa053a040 ffffffffa053a060   @.S.....`.S.....
    ffff8810588b5a50:  0000000000000000 0000000100000001   ................
    ffff8810588b5a60:  0000000000000000 0000000000000e00   ................
    ffff8810588b5a70:  ffffffffa053a580 ffffffffa053a6e0   ..S.......S.....
    ffff8810588b5a80:  ffffffffa053a4a0 ffffffffa053a250   ..S.....P.S.....
    ffff8810588b5a90:  0000000500000002 0000000000000000   ................
    
    Unfortunately the top of this area is already detroyed by someone.
    But because of two reasonns we think this is struct smi_info
     1) The address included in between  ffff8810588b5a70 and ffff8810588b5a80:
      are inside of ipmi_si_intf.c  see crash> module ffff88085779d2c0
    
     2) We've found the area which point this.
      It is offset 0x68 of  ffff880859df4000
    
    crash> rd  ffff880859df4000 100
    ffff880859df4000:  0000000000000000 0000000000000001   ................
    ffff880859df4010:  ffffffffa0535290 dead000000000200   .RS.............
    ffff880859df4020:  ffff880859df4020 ffff880859df4020    @.Y.... @.Y....
    ffff880859df4030:  0000000000000002 0000000000100010   ................
    ffff880859df4040:  ffff880859df4040 ffff880859df4040   @@.Y....@@.Y....
    ffff880859df4050:  0000000000000000 0000000000000000   ................
    ffff880859df4060:  0000000000000000 ffff8810588b5a00   .........Z.X....
    ffff880859df4070:  0000000000000001 ffff880859df4078   ........x@.Y....
    
     If we regards it as struct ipmi_smi in shutdown process
     it looks consistent.
    
    The remedy for this apparent race is affixed below.
    Signed-off-by: default avatarTony Camuso <tcamuso@redhat.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    
    This was first introduced in 7ea0ed2b ipmi: Make the
    message handler easier to use for SMI interfaces
    where some code was moved outside of the rcu_read_lock()
    and the lock was not added.
    Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
    65acfd38
ipmi_msghandler.c 118 KB