• Hari Bathini's avatar
    powerpc/fadump: Fix inaccurate CPU state info in vmcore generated with panic · 06e629c2
    Hari Bathini authored
    In panic path, fadump is triggered via a panic notifier function.
    Before calling panic notifier functions, smp_send_stop() gets called,
    which stops all CPUs except the panic'ing CPU. Commit 8389b37d
    ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
    again commit bab26238 ("powerpc: Offline CPU in stop_this_cpu()")
    started marking CPUs as offline while stopping them. So, if a kernel
    has either of the above commits, vmcore captured with fadump via panic
    path would not process register data for all CPUs except the panic'ing
    CPU. Sample output of crash-utility with such vmcore:
    
      # crash vmlinux vmcore
      ...
            KERNEL: vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 1
              DATE: Wed Nov 10 09:56:34 EST 2021
            UPTIME: 00:00:42
      LOAD AVERAGE: 2.27, 0.69, 0.24
             TASKS: 183
          NODENAME: XXXXXXXXX
           RELEASE: 5.15.0+
           VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
           MACHINE: ppc64le  (2500 Mhz)
            MEMORY: 8 GB
             PANIC: "Kernel panic - not syncing: sysrq triggered crash"
               PID: 3394
           COMMAND: "bash"
              TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
               CPU: 1
             STATE: TASK_RUNNING (PANIC)
    
      crash> p -x __cpu_online_mask
      __cpu_online_mask = $1 = {
        bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
      }
      crash>
      crash>
      crash> p -x __cpu_active_mask
      __cpu_active_mask = $2 = {
        bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
      }
      crash>
    
    While this has been the case since fadump was introduced, the issue
    was not identified for two probable reasons:
    
      - In general, the bulk of the vmcores analyzed were from crash
        due to exception.
    
      - The above did change since commit 8341f2f2 ("sysrq: Use
        panic() to force a crash") started using panic() instead of
        deferencing NULL pointer to force a kernel crash. But then
        commit de6e5d38 ("powerpc: smp_send_stop do not offline
        stopped CPUs") stopped marking CPUs as offline till kernel
        commit bab26238 ("powerpc: Offline CPU in stop_this_cpu()")
        reverted that change.
    
    To ensure post processing register data of all other CPUs happens
    as intended, let panic() function take the crash friendly path (read
    crash_smp_send_stop()) with the help of crash_kexec_post_notifiers
    option. Also, as register data for all CPUs is captured by f/w, skip
    IPI callbacks here for fadump, to avoid any complications in finding
    the right backtraces.
    Signed-off-by: default avatarHari Bathini <hbathini@linux.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20211207103719.91117-2-hbathini@linux.ibm.com
    06e629c2
fadump.c 42.7 KB