• zhenwei pi's avatar
    mm/memory-failure: disable unpoison once hw error happens · 67f22ba7
    zhenwei pi authored
    Currently unpoison_memory(unsigned long pfn) is designed for soft
    poison(hwpoison-inject) only.  Since 17fae129, the KPTE gets cleared
    on a x86 platform once hardware memory corrupts.
    
    Unpoisoning a hardware corrupted page puts page back buddy only, the
    kernel has a chance to access the page with *NOT PRESENT* KPTE.  This
    leads BUG during accessing on the corrupted KPTE.
    
    Suggested by David&Naoya, disable unpoison mechanism when a real HW error
    happens to avoid BUG like this:
    
     Unpoison: Software-unpoisoned page 0x61234
     BUG: unable to handle page fault for address: ffff888061234000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     PGD 2c01067 P4D 2c01067 PUD 107267063 PMD 10382b063 PTE 800fffff9edcb062
     Oops: 0002 [#1] PREEMPT SMP NOPTI
     CPU: 4 PID: 26551 Comm: stress Kdump: loaded Tainted: G   M       OE     5.18.0.bm.1-amd64 #7
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...
     RIP: 0010:clear_page_erms+0x7/0x10
     Code: ...
     RSP: 0000:ffffc90001107bc8 EFLAGS: 00010246
     RAX: 0000000000000000 RBX: 0000000000000901 RCX: 0000000000001000
     RDX: ffffea0001848d00 RSI: ffffea0001848d40 RDI: ffff888061234000
     RBP: ffffea0001848d00 R08: 0000000000000901 R09: 0000000000001276
     R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
     R13: 0000000000000000 R14: 0000000000140dca R15: 0000000000000001
     FS:  00007fd8b2333740(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: ffff888061234000 CR3: 00000001023d2005 CR4: 0000000000770ee0
     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
     PKRU: 55555554
     Call Trace:
      <TASK>
      prep_new_page+0x151/0x170
      get_page_from_freelist+0xca0/0xe20
      ? sysvec_apic_timer_interrupt+0xab/0xc0
      ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
      __alloc_pages+0x17e/0x340
      __folio_alloc+0x17/0x40
      vma_alloc_folio+0x84/0x280
      __handle_mm_fault+0x8d4/0xeb0
      handle_mm_fault+0xd5/0x2a0
      do_user_addr_fault+0x1d0/0x680
      ? kvm_read_and_reset_apf_flags+0x3b/0x50
      exc_page_fault+0x78/0x170
      asm_exc_page_fault+0x27/0x30
    
    Link: https://lkml.kernel.org/r/20220615093209.259374-2-pizhenwei@bytedance.com
    Fixes: 847ce401 ("HWPOISON: Add unpoisoning support")
    Fixes: 17fae129 ("x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned")
    Signed-off-by: default avatarzhenwei pi <pizhenwei@bytedance.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
    Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
    Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: <stable@vger.kernel.org>	[5.8+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    67f22ba7
hwpoison.rst 5.9 KB