• Yang Shi's avatar
    mm: hwpoison: don't drop slab caches for offlining non-LRU page · d0505e9f
    Yang Shi authored
    In the current implementation of soft offline, if non-LRU page is met,
    all the slab caches will be dropped to free the page then offline.  But
    if the page is not slab page all the effort is wasted in vain.  Even
    though it is a slab page, it is not guaranteed the page could be freed
    at all.
    
    However the side effect and cost is quite high.  It does not only drop
    the slab caches, but also may drop a significant amount of page caches
    which are associated with inode caches.  It could make the most
    workingset gone in order to just offline a page.  And the offline is not
    guaranteed to succeed at all, actually I really doubt the success rate
    for real life workload.
    
    Furthermore the worse consequence is the system may be locked up and
    unusable since the page cache release may incur huge amount of works
    queued for memcg release.
    
    Actually we ran into such unpleasant case in our production environment.
    Firstly, the workqueue of memory_failure_work_func is locked up as
    below:
    
        BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
        Showing busy workqueues and worker pools:
        workqueue events: flags=0x0
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
          in-flight: 409271:memory_failure_work_func
          pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
        workqueue mm_percpu_wq: flags=0x8
         pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
          pending: vmstat_update
        workqueue cgroup_destroy: flags=0x0
          pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
            pending: css_release_work_fn
    
    There were over 12K css_release_work_fn queued, and this caused a few
    lockups due to the contention of worker pool lock with IRQ disabled, for
    example:
    
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
        Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
        CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
        Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
        Workqueue: events memory_failure_work_func
        RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
        Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
        RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
        RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
        RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
        R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
        R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
        FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
        Call Trace:
         __queue_work+0xd6/0x3c0
         queue_work_on+0x1c/0x30
         uncharge_batch+0x10e/0x110
         mem_cgroup_uncharge_list+0x6d/0x80
         release_pages+0x37f/0x3f0
         __pagevec_release+0x1c/0x50
         __invalidate_mapping_pages+0x348/0x380
         inode_lru_isolate+0x10a/0x160
         __list_lru_walk_one+0x7b/0x170
         list_lru_walk_one+0x4a/0x60
         prune_icache_sb+0x37/0x50
         super_cache_scan+0x123/0x1a0
         do_shrink_slab+0x10c/0x2c0
         shrink_slab+0x1f1/0x290
         drop_slab_node+0x4d/0x70
         soft_offline_page+0x1ac/0x5b0
         memory_failure_work_func+0x6a/0x90
         process_one_work+0x19e/0x340
         worker_thread+0x30/0x360
         kthread+0x116/0x130
    
    The lockup made the machine is quite unusable.  And it also made the
    most workingset gone, the reclaimabled slab caches were reduced from 12G
    to 300MB, the page caches were decreased from 17G to 4G.
    
    But the most disappointing thing is all the effort doesn't make the page
    offline, it just returns:
    
        soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
    
    It seems the aggressive behavior for non-LRU page didn't pay back, so it
    doesn't make too much sense to keep it considering the terrible side
    effect.
    
    Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.comSigned-off-by: default avatarYang Shi <shy828301@gmail.com>
    Reported-by: default avatarDavid Mackey <tdmackey@twitter.com>
    Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
    Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d0505e9f
memory-failure.c 58.9 KB