• Dan Schatzberg's avatar
    mm: add swappiness= arg to memory.reclaim · 68cd9050
    Dan Schatzberg authored
    Allow proactive reclaimers to submit an additional swappiness=<val>
    argument to memory.reclaim.  This overrides the global or per-memcg
    swappiness setting for that reclaim attempt.
    
    For example:
    
    echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
    
    will perform reclaim on the rootcg with a swappiness setting of 0 (no
    swap) regardless of the vm.swappiness sysctl setting.
    
    Userspace proactive reclaimers use the memory.reclaim interface to trigger
    reclaim.  The memory.reclaim interface does not allow for any way to
    effect the balance of file vs anon during proactive reclaim.  The only
    approach is to adjust the vm.swappiness setting.  However, there are a few
    reasons we look to control the balance of file vs anon during proactive
    reclaim, separately from reactive reclaim:
    
    * Swapout should be limited to manage SSD write endurance.  In near-OOM
      situations we are fine with lots of swap-out to avoid OOMs.  As these
      are typically rare events, they have relatively little impact on write
      endurance.  However, proactive reclaim runs continuously and so its
      impact on SSD write endurance is more significant.  Therefore it is
      desireable to control swap-out for proactive reclaim separately from
      reactive reclaim
    
    * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
      swap exhaustion.  This makes sense if the swap exhaustion is triggered
      due to reactive reclaim but less so if it is triggered due to proactive
      reclaim (e.g.  one could see OOMs when free memory is ample but anon is
      just particularly cold).  Therefore, it's desireable to have proactive
      reclaim reduce or stop swap-out before the threshold at which OOM
      killing occurs.
    
    In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness
    before writes to memory.reclaim[2].  This has been in production for
    nearly two years and has addressed our needs to control proactive vs
    reactive reclaim behavior but is still not ideal for a number of reasons:
    
    * vm.swappiness is a global setting, adjusting it can race/interfere
      with other system administration that wishes to control vm.swappiness. 
      In our case, we need to disable Senpai before adjusting vm.swappiness.
    
    * vm.swappiness is stateful - so a crash or restart of Senpai can leave
      a misconfigured setting.  This requires some additional management to
      record the "desired" setting and ensure Senpai always adjusts to it.
    
    With this patch, we avoid these downsides of adjusting vm.swappiness
    globally.
    
    [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
    [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
    
    Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.comSigned-off-by: default avatarDan Schatzberg <schatzberg.dan@gmail.com>
    Suggested-by: default avatarYosry Ahmed <yosryahmed@google.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Acked-by: default avatarChris Li <chrisl@kernel.org>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Shakeel Butt <shakeel.butt@linux.dev>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Yue Zhao <findns94@gmail.com>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Nhat Pham <nphamcs@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    68cd9050
memcontrol.c 148 KB