• Zhang Yuchen's avatar
    ipmi: fix msg stack when IPMI is disconnected · c608966f
    Zhang Yuchen authored
    If you continue to access and send messages at a high frequency (once
    every 55s) when the IPMI is disconnected, messages will accumulate in
    intf->[hp_]xmit_msg. If it lasts long enough, it takes up a lot of
    memory.
    
    The reason is that if IPMI is disconnected, each message will be set to
    IDLE after it returns to HOSED through IDLE->ERROR0->HOSED. The next
    message goes through the same process when it comes in. This process
    needs to wait for IBF_TIMEOUT * (MAX_ERROR_RETRIES + 1) = 55s.
    
    Each message takes 55S to destroy. This results in a continuous increase
    in memory.
    
    I find that if I wait 5 seconds after the first message fails, the
    status changes to ERROR0 in smi_timeout(). The next message will return
    the error code IPMI_NOT_IN_MY_STATE_ERR directly without wait.
    
    This is more in line with our needs.
    
    So instead of setting each message state to IDLE after it reaches the
    state HOSED, set state to ERROR0.
    
    After testing, the problem has been solved, no matter how many
    consecutive sends, will not cause continuous memory growth. It also
    returns to normal immediately after the IPMI is restored.
    
    In addition, the HOSED state should also count as invalid. So the HOSED
    is removed from the invalid judgment in start_kcs_transaction().
    
    The verification operations are as follows:
    
    1. Use BPF to record the ipmi_alloc/free_smi_msg().
    
      $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
          %p\n",retval);} kprobe:free_recv_msg {printf("free  %p\n",arg0)}'
    
    2. Exec `date; time for x in $(seq 1 2); do ipmitool mc info; done`.
    3. Record the output of `time` and when free all msgs.
    
    Before:
    
    `time` takes 120s, This is because `ipmitool mc info` send 4 msgs and
    waits only 15 seconds for each message. Last msg is free after 440s.
    
      $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
          %p\n",retval);} kprobe:free_recv_msg {printf("free  %p\n",arg0)}'
      Oct 05 11:40:55 Attaching 2 probes...
      Oct 05 11:41:12 alloc 0xffff9558a05f0c00
      Oct 05 11:41:27 alloc 0xffff9558a05f1a00
      Oct 05 11:41:42 alloc 0xffff9558a05f0000
      Oct 05 11:41:57 alloc 0xffff9558a05f1400
      Oct 05 11:42:07 free  0xffff9558a05f0c00
      Oct 05 11:42:07 alloc 0xffff9558a05f7000
      Oct 05 11:42:22 alloc 0xffff9558a05f2a00
      Oct 05 11:42:37 alloc 0xffff9558a05f5a00
      Oct 05 11:42:52 alloc 0xffff9558a05f3a00
      Oct 05 11:43:02 free  0xffff9558a05f1a00
      Oct 05 11:43:57 free  0xffff9558a05f0000
      Oct 05 11:44:52 free  0xffff9558a05f1400
      Oct 05 11:45:47 free  0xffff9558a05f7000
      Oct 05 11:46:42 free  0xffff9558a05f2a00
      Oct 05 11:47:37 free  0xffff9558a05f5a00
      Oct 05 11:48:32 free  0xffff9558a05f3a00
    
      $ root@dc00-pb003-t106-n078:~# date;time for x in $(seq 1 2); do
      ipmitool mc info; done
    
      Wed Oct  5 11:41:12 CST 2022
      No data available
      Get Device ID command failed
      No data available
      No data available
      No valid response received
      Get Device ID command failed: Unspecified error
      No data available
      Get Device ID command failed
      No data available
      No data available
      No valid response received
      No data available
      Get Device ID command failed
    
      real        1m55.052s
      user        0m0.001s
      sys        0m0.001s
    
    After:
    
    `time` takes 55s, all msgs is returned and free after 55s.
    
      $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc
          %p\n",retval);} kprobe:free_recv_msg {printf("free  %p\n",arg0)}'
    
      Oct 07 16:30:35 Attaching 2 probes...
      Oct 07 16:30:45 alloc 0xffff955943aa9800
      Oct 07 16:31:00 alloc 0xffff955943aacc00
      Oct 07 16:31:15 alloc 0xffff955943aa8c00
      Oct 07 16:31:30 alloc 0xffff955943aaf600
      Oct 07 16:31:40 free  0xffff955943aa9800
      Oct 07 16:31:40 free  0xffff955943aacc00
      Oct 07 16:31:40 free  0xffff955943aa8c00
      Oct 07 16:31:40 free  0xffff955943aaf600
      Oct 07 16:31:40 alloc 0xffff9558ec8f7e00
      Oct 07 16:31:40 free  0xffff9558ec8f7e00
      Oct 07 16:31:40 alloc 0xffff9558ec8f7800
      Oct 07 16:31:40 free  0xffff9558ec8f7800
      Oct 07 16:31:40 alloc 0xffff9558ec8f7e00
      Oct 07 16:31:40 free  0xffff9558ec8f7e00
      Oct 07 16:31:40 alloc 0xffff9558ec8f7800
      Oct 07 16:31:40 free  0xffff9558ec8f7800
    
      root@dc00-pb003-t106-n078:~# date;time for x in $(seq 1 2); do
      ipmitool mc info; done
      Fri Oct  7 16:30:45 CST 2022
      No data available
      Get Device ID command failed
      No data available
      No data available
      No valid response received
      Get Device ID command failed: Unspecified error
      Get Device ID command failed: 0xd5 Command not supported in present state
      Get Device ID command failed: Command not supported in present state
    
      real        0m55.038s
      user        0m0.001s
      sys        0m0.001s
    Signed-off-by: default avatarZhang Yuchen <zhangyuchen.lcr@bytedance.com>
    Message-Id: <20221009091811.40240-2-zhangyuchen.lcr@bytedance.com>
    Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
    c608966f
ipmi_kcs_sm.c 12.6 KB