1. 22 Mar, 2022 40 commits
    • Miaohe Lin's avatar
      mm/memory.c: use helper function range_in_vma() · 88a35912
      Miaohe Lin authored
      Use helper function range_in_vma() to check if address, address + size are
      within the vma range.  Minor readability improvement.
      
      Link: https://lkml.kernel.org/r/20220219021441.29173-1-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88a35912
    • Randy Dunlap's avatar
      mm/mmap: return 1 from stack_guard_gap __setup() handler · e6d09493
      Randy Dunlap authored
      __setup() handlers should return 1 if the command line option is handled
      and 0 if not (or maybe never return 0; it just pollutes init's
      environment).  This prevents:
      
        Unknown kernel command line parameters \
        "BOOT_IMAGE=/boot/bzImage-517rc5 stack_guard_gap=100", will be \
        passed to user space.
      
        Run /sbin/init as init process
         with arguments:
           /sbin/init
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc5
           stack_guard_gap=100
      
      Return 1 to indicate that the boot option has been handled.
      
      Note that there is no warning message if someone enters:
      	stack_guard_gap=anything_invalid
      and 'val' and stack_guard_gap are both set to 0 due to the use of
      simple_strtoul(). This could be improved by using kstrtoxxx() and
      checking for an error.
      
      It appears that having stack_guard_gap == 0 is valid (if unexpected) since
      using "stack_guard_gap=0" on the kernel command line does that.
      
      Link: https://lkml.kernel.org/r/20220222005817.11087-1-rdunlap@infradead.org
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Fixes: 1be7107f ("mm: larger stack guard gap, between vmas")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6d09493
    • Peter Xu's avatar
      mm: rework swap handling of zap_pte_range · 8018db85
      Peter Xu authored
      Clean the code up by merging the device private/exclusive swap entry
      handling with the rest, then we merge the pte clear operation too.
      
      struct* page is defined in multiple places in the function, move it
      upward.
      
      free_swap_and_cache() is only useful for !non_swap_entry() case, put it
      into the condition.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220216094810.60572-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8018db85
    • Peter Xu's avatar
      mm: change zap_details.zap_mapping into even_cows · 2e148f1e
      Peter Xu authored
      Currently we have a zap_mapping pointer maintained in zap_details, when
      it is specified we only want to zap the pages that has the same mapping
      with what the caller has specified.
      
      But what we want to do is actually simpler: we want to skip zapping
      private (COW-ed) pages in some cases.  We can refer to
      unmap_mapping_pages() callers where we could have passed in different
      even_cows values.  The other user is unmap_mapping_folio() where we
      always want to skip private pages.
      
      According to Hugh, we used a mapping pointer for historical reason, as
      explained here:
      
        https://lore.kernel.org/lkml/391aa58d-ce84-9d4-d68d-d98a9c533255@google.com/
      
      Quoting partly from Hugh:
      
        Which raises the question again of why I did not just use a boolean flag
        there originally: aah, I think I've found why.  In those days there was a
        horrible "optimization", for better performance on some benchmark I guess,
        which when you read from /dev/zero into a private mapping, would map the zero
        page there (look up read_zero_pagealigned() and zeromap_page_range() if you
        dare).  So there was another category of page to be skipped along with the
        anon COWs, and I didn't want multiple tests in the zap loop, so checking
        check_mapping against page->mapping did both.  I think nowadays you could do
        it by checking for PageAnon page (or genuine swap entry) instead.
      
      This patch replaces the zap_details.zap_mapping pointer into the even_cows
      boolean, then we check it against PageAnon.
      
      Link: https://lkml.kernel.org/r/20220216094810.60572-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e148f1e
    • Peter Xu's avatar
      mm: rename zap_skip_check_mapping() to should_zap_page() · 254ab940
      Peter Xu authored
      The previous name is against the natural way people think.  Invert the
      meaning and also the return value.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20220216094810.60572-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      254ab940
    • Peter Xu's avatar
      mm: don't skip swap entry even if zap_details specified · 5abfd71d
      Peter Xu authored
      Patch series "mm: Rework zap ptes on swap entries", v5.
      
      Patch 1 should fix a long standing bug for zap_pte_range() on
      zap_details usage.  The risk is we could have some swap entries skipped
      while we should have zapped them.
      
      Migration entries are not the major concern because file backed memory
      always zap in the pattern that "first time without page lock, then
      re-zap with page lock" hence the 2nd zap will always make sure all
      migration entries are already recovered.
      
      However there can be issues with real swap entries got skipped
      errornoously.  There's a reproducer provided in commit message of patch
      1 for that.
      
      Patch 2-4 are cleanups that are based on patch 1.  After the whole
      patchset applied, we should have a very clean view of zap_pte_range().
      
      Only patch 1 needs to be backported to stable if necessary.
      
      This patch (of 4):
      
      The "details" pointer shouldn't be the token to decide whether we should
      skip swap entries.
      
      For example, when the callers specified details->zap_mapping==NULL, it
      means the user wants to zap all the pages (including COWed pages), then
      we need to look into swap entries because there can be private COWed
      pages that was swapped out.
      
      Skipping some swap entries when details is non-NULL may lead to wrongly
      leaving some of the swap entries while we should have zapped them.
      
      A reproducer of the problem:
      
      ===8<===
              #define _GNU_SOURCE         /* See feature_test_macros(7) */
              #include <stdio.h>
              #include <assert.h>
              #include <unistd.h>
              #include <sys/mman.h>
              #include <sys/types.h>
      
              int page_size;
              int shmem_fd;
              char *buffer;
      
              void main(void)
              {
                      int ret;
                      char val;
      
                      page_size = getpagesize();
                      shmem_fd = memfd_create("test", 0);
                      assert(shmem_fd >= 0);
      
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                      MAP_PRIVATE, shmem_fd, 0);
                      assert(buffer != MAP_FAILED);
      
                      /* Write private page, swap it out */
                      buffer[page_size] = 1;
                      madvise(buffer, page_size * 2, MADV_PAGEOUT);
      
                      /* This should drop private buffer[page_size] already */
                      ret = ftruncate(shmem_fd, page_size);
                      assert(ret == 0);
                      /* Recover the size */
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      /* Re-read the data, it should be all zero */
                      val = buffer[page_size];
                      if (val == 0)
                              printf("Good\n");
                      else
                              printf("BUG\n");
              }
      ===8<===
      
      We don't need to touch up the pmd path, because pmd never had a issue with
      swap entries.  For example, shmem pmd migration will always be split into
      pte level, and same to swapping on anonymous.
      
      Add another helper should_zap_cows() so that we can also check whether we
      should zap private mappings when there's no page pointer specified.
      
      This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
      we should do the same check upon migration entry, hwpoison entry and
      genuine swap entries too.
      
      To be explicit, we should still remember to keep the private entries if
      even_cows==false, and always zap them when even_cows==true.
      
      The issue seems to exist starting from the initial commit of git.
      
      [peterx@redhat.com: comment tweaks]
        Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5abfd71d
    • Muchun Song's avatar
      mm: replace multiple dcache flush with flush_dcache_folio() · 3150be8f
      Muchun Song authored
      Simplify the code by using flush_dcache_folio().
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-8-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3150be8f
    • Muchun Song's avatar
      mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic() · 7c25a0b8
      Muchun Song authored
      userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not
      do any cache flushing for the target page.  Then the target page will be
      mapped to the user space with a different address (user address), which
      might have an alias issue with the kernel address used to copy the data
      from the user to.  Fix this by insert flush_dcache_page() after
      copy_from_user() succeeds.
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com
      Fixes: b6ebaedb ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic")
      Fixes: c1a4de99 ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c25a0b8
    • Muchun Song's avatar
      mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() · 19b482c2
      Muchun Song authored
      userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache
      flushing for the target page.  Then the target page will be mapped to
      the user space with a different address (user address), which might have
      an alias issue with the kernel address used to copy the data from the
      user to.  Insert flush_dcache_page() in non-zero-page case.  And replace
      clear_highpage() with clear_user_highpage() which already considers the
      cache maintenance.
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com
      Fixes: 8d103963 ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support")
      Fixes: 4c27fe4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19b482c2
    • Muchun Song's avatar
      mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte() · 34892366
      Muchun Song authored
      folio_copy() will copy the data from one page to the target page, then
      the target page will be mapped to the user space address, which might
      have an alias issue with the kernel address used to copy the data from
      the page to.  There are 2 ways to fix this issue.
      
       1) insert flush_dcache_page() after folio_copy().
      
       2) replace folio_copy() with copy_user_huge_page() which already
          considers the cache maintenance.
      
      We chose 2) way to fix the issue since architectures can optimize this
      situation.  It is also make backports easier.
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-5-songmuchun@bytedance.com
      Fixes: 8cc5fcbb ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34892366
    • Muchun Song's avatar
      mm: hugetlb: fix missing cache flush in copy_huge_page_from_user() · e763243c
      Muchun Song authored
      userfaultfd calls copy_huge_page_from_user() which does not do any cache
      flushing for the target page.  Then the target page will be mapped to
      the user space with a different address (user address), which might have
      an alias issue with the kernel address used to copy the data from the
      user to.
      
      Fix this issue by flushing dcache in copy_huge_page_from_user().
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com
      Fixes: fa4d75c1 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e763243c
    • Muchun Song's avatar
      mm: fix missing cache flush for all tail pages of compound page · 2771739a
      Muchun Song authored
      The D-cache maintenance inside move_to_new_page() only consider one
      page, there is still D-cache maintenance issue for tail pages of
      compound page (e.g. THP or HugeTLB).
      
      THP migration is only enabled on x86_64, ARM64 and powerpc, while
      powerpc and arm64 need to maintain the consistency between I-Cache and
      D-Cache, which depends on flush_dcache_page() to maintain the
      consistency between I-Cache and D-Cache.
      
      But there is no issues on arm64 and powerpc since they already considers
      the compound page cache flushing in their icache flush function.
      HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
      riscv, s390 and sh, while arm has handled the compound page cache flush
      in flush_dcache_page(), but most others do not.
      
      In theory, the issue exists on many architectures.  Fix this by not
      using flush_dcache_folio() since it is not backportable.
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2771739a
    • Muchun Song's avatar
      mm: thp: fix wrong cache flush in remove_migration_pmd() · 5cbcf225
      Muchun Song authored
      Patch series "Fix some cache flush bugs", v5.
      
      This series focuses on fixing cache maintenance.
      
      This patch (of 7):
      
      The flush_cache_range() is supposed to be justified only if the page is
      already placed in process page table, and that is done right after
      flush_cache_range().  So using this interface is wrong.  And there is no
      need to invalite cache since it was non-present before in
      remove_migration_pmd().  So just to remove it.
      
      Link: https://lkml.kernel.org/r/20220210123058.79206-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220210123058.79206-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Lars Persson <lars.persson@axis.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cbcf225
    • Stafford Horne's avatar
      mm: remove mmu_gathers storage from remaining architectures · d6d22442
      Stafford Horne authored
      Originally the mmu_gathers were removed in commit 1c395176 ("mm: now
      that all old mmu_gather code is gone, remove the storage").  However,
      the openrisc and hexagon architecture were merged around the same time
      and mmu_gathers was not removed.
      
      This patch removes them from openrisc, hexagon and nds32:
      
      Noticed while cleaning this warning:
      
          arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static?
      
      Link: https://lkml.kernel.org/r/20220205141956.3315419-1-shorne@gmail.comSigned-off-by: default avatarStafford Horne <shorne@gmail.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6d22442
    • Anshuman Khandual's avatar
      mm: merge pte_mkhuge() call into arch_make_huge_pte() · 16785bd7
      Anshuman Khandual authored
      Each call into pte_mkhuge() is invariably followed by
      arch_make_huge_pte().  Instead arch_make_huge_pte() can accommodate
      pte_mkhuge() at the beginning.  This updates generic fallback stub for
      arch_make_huge_pte() and available platforms definitions.  This makes huge
      pte creation much cleaner and easier to follow.
      
      Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16785bd7
    • Guillaume Tucker's avatar
      selftests, x86: fix how check_cc.sh is being invoked · ef696f93
      Guillaume Tucker authored
      The $(CC) variable used in Makefiles could contain several arguments
      such as "ccache gcc".  These need to be passed as a single string to
      check_cc.sh, otherwise only the first argument will be used as the
      compiler command.  Without quotes, the $(CC) variable is passed as
      distinct arguments which causes the script to fail to build trivial
      programs.
      
      Fix this by adding quotes around $(CC) when calling check_cc.sh to pass
      the whole string as a single argument to the script even if it has
      several words such as "ccache gcc".
      
      Link: https://lkml.kernel.org/r/d0d460d7be0107a69e3c52477761a6fe694c1840.1646991629.git.guillaume.tucker@collabora.com
      Fixes: e9886ace ("selftests, x86: Rework x86 target architecture detection")
      Signed-off-by: default avatarGuillaume Tucker <guillaume.tucker@collabora.com>
      Tested-by: default avatar"kernelci.org bot" <bot@kernelci.org>
      Reviewed-by: default avatarGuenter Roeck <groeck@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef696f93
    • Vasily Averin's avatar
      memcg: enable accounting for tty-related objects · c72d8592
      Vasily Averin authored
      At each login the user forces the kernel to create a new terminal and
      allocate up to ~1Kb memory for the tty-related structures.
      
      By default it's allowed to create up to 4096 ptys with 1024 reserve for
      initial mount namespace only and the settings are controlled by host
      admin.
      
      Though this default is not enough for hosters with thousands of
      containers per node.  Host admin can be forced to increase it up to
      NR_UNIX98_PTY_MAX = 1<<20.
      
      By default container is restricted by pty mount_opt.max = 1024, but
      admin inside container can change it via remount.  As a result, one
      container can consume almost all allowed ptys and allocate up to 1Gb of
      unaccounted memory.
      
      It is not enough per-se to trigger OOM on host, however anyway, it
      allows to significantly exceed the assigned memcg limit and leads to
      troubles on the over-committed node.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/5d4bca06-7d4f-a905-e518-12981ebca1b3@virtuozzo.comSigned-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c72d8592
    • Muchun Song's avatar
      mm: memcontrol: rename memcg_cache_id to memcg_kmem_id · 7c52f65d
      Muchun Song authored
      The memcg_cache_id() introduced by commit 2633d7a0 ("slab/slub:
      consider a memcg parameter in kmem_create_cache") is used to index in the
      kmem_cache->memcg_params->memcg_caches array.  Since
      kmem_cache->memcg_params.memcg_caches has been removed by commit
      9855609b ("mm: memcg/slab: use a single set of kmem_caches for all
      accounted allocations").  So the name does not need to reflect cache
      related.  Just rename it to memcg_kmem_id.  And it can reflect kmem
      related.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-17-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c52f65d
    • Muchun Song's avatar
      mm: list_lru: rename list_lru_per_memcg to list_lru_memcg · d7011070
      Muchun Song authored
      The name of list_lru_memcg was occupied before and became free since
      last commit.  Rename list_lru_per_memcg to list_lru_memcg since the name
      is brief.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-16-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7011070
    • Muchun Song's avatar
      mm: memcontrol: fix cannot alloc the maximum memcg ID · be740503
      Muchun Song authored
      The idr_alloc() does not include @max ID.  So in the current
      implementation, the maximum memcg ID is 65534 instead of 65535.  It
      seems a bug.  So fix this.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-15-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be740503
    • Muchun Song's avatar
      mm: memcontrol: reuse memory cgroup ID for kmem ID · f9c69d63
      Muchun Song authored
      There are two idrs being used by memory cgroup, one is for kmem ID,
      another is for memory cgroup ID.  The maximum ID of both is 64Ki.  Both
      of them can limit the total number of memory cgroups.  Actually, we can
      reuse memory cgroup ID for kmem ID to simplify the code.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-14-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9c69d63
    • Muchun Song's avatar
      mm: list_lru: replace linear array with xarray · bbca91cc
      Muchun Song authored
      If we run 10k containers in the system, the size of the
      list_lru_memcg->lrus can be ~96KB per list_lru.  When we decrease the
      number containers, the size of the array will not be shrinked.  It is
      not scalable.  The xarray is a good choice for this case.  We can save a
      lot of memory when there are tens of thousands continers in the system.
      If we use xarray, we also can remove the logic code of resizing array,
      which can simplify the code.
      
      [akpm@linux-foundation.org: remove unused local]
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-13-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbca91cc
    • Muchun Song's avatar
      mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus · 1f391eb2
      Muchun Song authored
      The purpose of the memcg_drain_all_list_lrus() is list_lrus reparenting.
      It is very similar to memcg_reparent_objcgs().  Rename it to
      memcg_reparent_list_lrus() so that the name can more consistent with
      memcg_reparent_objcgs().
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-12-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f391eb2
    • Muchun Song's avatar
      mm: list_lru: allocate list_lru_one only when needed · 5abc1e37
      Muchun Song authored
      In our server, we found a suspected memory leak problem.  The kmalloc-32
      consumes more than 6GB of memory.  Other kmem_caches consume less than
      2GB memory.
      
      After our in-depth analysis, the memory consumption of kmalloc-32 slab
      cache is the cause of list_lru_one allocation.
      
        crash> p memcg_nr_cache_ids
        memcg_nr_cache_ids = $2 = 24574
      
      memcg_nr_cache_ids is very large and memory consumption of each list_lru
      can be calculated with the following formula.
      
        num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
      
      There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
      
        crash> list super_blocks | wc -l
        952
      
      Every mount will register 2 list lrus, one is for inode, another is for
      dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
      MB (~5.6GB).  But the number of memory cgroup is less than 500.  So I
      guess more than 12286 containers have been deployed on this machine (I do
      not know why there are so many containers, it may be a user's bug or the
      user really want to do that).  And memcg_nr_cache_ids has not been reduced
      to a suitable value.  This can waste a lot of memory.
      
      Now the infrastructure for dynamic list_lru_one allocation is ready, so
      remove statically allocated memory code to save memory.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-11-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5abc1e37
    • Muchun Song's avatar
      mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() · da0efe30
      Muchun Song authored
      It will simplify the code if moving memcg_online_kmem() to
      mem_cgroup_css_online() and do not need to set ->kmemcg_id to -1 to
      indicate the memcg is offline.  In the next patch, ->kmemcg_id will be
      used to sync list lru reparenting which requires not to change
      ->kmemcg_id.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-10-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da0efe30
    • Muchun Song's avatar
      xarray: use kmem_cache_alloc_lru to allocate xa_node · 9bbdc0f3
      Muchun Song authored
      The workingset will add the xa_node to the shadow_nodes list.  So the
      allocation of xa_node should be done by kmem_cache_alloc_lru().  Using
      xas_set_lru() to pass the list_lru which we want to insert xa_node into to
      set up the xa_node reclaim context correctly.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-9-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bbdc0f3
    • Muchun Song's avatar
      mm: dcache: use kmem_cache_alloc_lru() to allocate dentry · f53bf711
      Muchun Song authored
      Like inode cache, the dentry will also be added to its memcg list_lru.  So
      replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-8-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f53bf711
    • Muchun Song's avatar
      f2fs: allocate inode by using alloc_inode_sb() · 65d3af64
      Muchun Song authored
      The inode allocation is supposed to use alloc_inode_sb(), so convert
      kmem_cache_alloc() to alloc_inode_sb().
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-6-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65d3af64
    • Muchun Song's avatar
      fs: allocate inode by using alloc_inode_sb() · fd60b288
      Muchun Song authored
      The inode allocation is supposed to use alloc_inode_sb(), so convert
      kmem_cache_alloc() of all filesystems to alloc_inode_sb().
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd60b288
    • Muchun Song's avatar
      fs: introduce alloc_inode_sb() to allocate filesystems specific inode · 8b9f3ac5
      Muchun Song authored
      The allocated inode cache is supposed to be added to its memcg list_lru
      which should be allocated as well in advance.  That can be done by
      kmem_cache_alloc_lru() which allocates object and list_lru.  The file
      systems is main user of it.  So introduce alloc_inode_sb() to allocate
      file system specific inodes and set up the inode reclaim context
      properly.  The file system is supposed to use alloc_inode_sb() to
      allocate inodes.
      
      In later patches, we will convert all users to the new API.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-4-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b9f3ac5
    • Muchun Song's avatar
      mm: introduce kmem_cache_alloc_lru · 88f2ef73
      Muchun Song authored
      We currently allocate scope for every memcg to be able to tracked on
      every superblock instantiated in the system, regardless of whether that
      superblock is even accessible to that memcg.
      
      These huge memcg counts come from container hosts where memcgs are
      confined to just a small subset of the total number of superblocks that
      instantiated at any given point in time.
      
      For these systems with huge container counts, list_lru does not need the
      capability of tracking every memcg on every superblock.  What it comes
      down to is that adding the memcg to the list_lru at the first insert.
      So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
      In the later patch, we will convert all inode and dentry allocation from
      kmem_cache_alloc to kmem_cache_alloc_lru.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88f2ef73
    • Muchun Song's avatar
      mm: list_lru: transpose the array of per-node per-memcg lru lists · 6a6b7b77
      Muchun Song authored
      Patch series "Optimize list lru memory consumption", v6.
      
      In our server, we found a suspected memory leak problem.  The kmalloc-32
      consumes more than 6GB of memory.  Other kmem_caches consume less than
      2GB memory.
      
      After our in-depth analysis, the memory consumption of kmalloc-32 slab
      cache is the cause of list_lru_one allocation.
      
        crash> p
        memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574
      
      memcg_nr_cache_ids is very large and memory consumption of each list_lru
      can be calculated with the following formula.
      
        num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
      
      There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
      
        crash> list super_blocks | wc -l
        952
      
      Every mount will register 2 list lrus, one is for inode, another is for
      dentry.  There are 952 super_blocks.  So the total memory is 952 * 2 * 3
      MB (~5.6GB).  But now the number of memory cgroups is less than 500.  So
      I guess more than 12286 memory cgroups have been created on this machine
      (I do not know why there are so many cgroups, it may be a user's bug or
      the user really want to do that).  Because memcg_nr_cache_ids has not
      been reduced to a suitable value.  It leads to waste a lot of memory.
      If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server.
      This is not what we want.
      
      In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
      this.  But this did not fundamentally solve the problem.
      
      We currently allocate scope for every memcg to be able to tracked on
      every superblock instantiated in the system, regardless of whether that
      superblock is even accessible to that memcg.
      
      These huge memcg counts come from container hosts where memcgs are
      confined to just a small subset of the total number of superblocks that
      instantiated at any given point in time.
      
      For these systems with huge container counts, list_lru does not need the
      capability of tracking every memcg on every superblock.
      
      What it comes down to is that the list_lru is only needed for a given
      memcg if that memcg is instatiating and freeing objects on a given
      list_lru.
      
      As Dave said, "Which makes me think we should be moving more towards 'add
      the memcg to the list_lru at the first insert' model rather than
      'instantiate all at memcg init time just in case'."
      
      This patchset aims to optimize the list lru memory consumption from
      different aspects.
      
      I had done a easy test to show the optimization.  I create 10k memory
      cgroups and mount 10k filesystems in the systems.  We use free command to
      show how many memory does the systems comsumes after this operation (There
      are 2 numa nodes in the system).
      
              +-----------------------+------------------------+
              |      condition        |   memory consumption   |
              +-----------------------+------------------------+
              | without this patchset |        24464 MB        |
              +-----------------------+------------------------+
              |     after patch 1     |        21957 MB        | <--------+
              +-----------------------+------------------------+          |
              |     after patch 10    |         6895 MB        |          |
              +-----------------------+------------------------+          |
              |     after patch 12    |         4367 MB        |          |
              +-----------------------+------------------------+          |
                                                                          |
              The more the number of nodes, the more obvious the effect---+
      
      BTW, there was a recent discussion [2] on the same issue.
      
      [1] https://lore.kernel.org/all/20210428094949.43579-1-songmuchun@bytedance.com/
      [2] https://lore.kernel.org/all/20210405054848.GA1077931@in.ibm.com/
      
      This series not only optimizes the memory usage of list_lru but also
      simplifies the code.
      
      This patch (of 16):
      
      The current scheme of maintaining per-node per-memcg lru lists looks like:
        struct list_lru {
          struct list_lru_node *node;           (for each node)
            struct list_lru_memcg *memcg_lrus;
              struct list_lru_one *lru[];       (for each memcg)
        }
      
      By effectively transposing the two-dimension array of list_lru_one's structures
      (per-node per-memcg => per-memcg per-node) it's possible to save some memory
      and simplify alloc/dealloc paths. The new scheme looks like:
        struct list_lru {
          struct list_lru_memcg *mlrus;
            struct list_lru_per_memcg *mlru[];  (for each memcg)
              struct list_lru_one node[0];      (for each node)
        }
      
      Memory savings are coming from not only 'struct rcu_head' but also some
      pointer arrays used to store the pointer to 'struct list_lru_one'.  The
      array is per node and its size is 8 (a pointer) * num_memcgs.  So the
      total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids.  After
      this patch, the size becomes 8 * memcg_nr_cache_ids.
      
      Link: https://lkml.kernel.org/r/20220228122126.37293-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220228122126.37293-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Kari Argillander <kari.argillander@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a6b7b77
    • Sebastian Andrzej Siewior's avatar
      mm/memcg: disable migration instead of preemption in drain_all_stock(). · 0790ed62
      Sebastian Andrzej Siewior authored
      Before the for-each-CPU loop, preemption is disabled so that so that
      drain_local_stock() can be invoked directly instead of scheduling a
      worker.  Ensuring that drain_local_stock() completed on the local CPU is
      not correctness problem.  It _could_ be that the charging path will be
      forced to reclaim memory because cached charges are still waiting for
      their draining.
      
      Disabling preemption before invoking drain_local_stock() is problematic
      on PREEMPT_RT due to the sleeping locks involved.  To ensure that no CPU
      migrations happens across for_each_online_cpu() it is enouhg to use
      migrate_disable() which disables migration and keeps context preemptible
      to a sleeping lock can be acquired.  A race with CPU hotplug is not a
      problem because pcp data is not going away.  In the worst case we just
      schedule draining of an empty stock.
      
      Use migrate_disable() instead of get_cpu() around the
      for_each_online_cpu() loop.
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-7-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0790ed62
    • Sebastian Andrzej Siewior's avatar
      mm/memcg: protect memcg_stock with a local_lock_t · 56751146
      Sebastian Andrzej Siewior authored
      The members of the per-CPU structure memcg_stock_pcp are protected by
      disabling interrupts.  This is not working on PREEMPT_RT because it
      creates atomic context in which actions are performed which require
      preemptible context.  One example is obj_cgroup_release().
      
      The IRQ-disable sections can be replaced with local_lock_t which
      preserves the explicit disabling of interrupts while keeps the code
      preemptible on PREEMPT_RT.
      
      drain_obj_stock() drops a reference on obj_cgroup which leads to an
      invocat= ion of obj_cgroup_release() if it is the last object.  This in
      turn leads to recursive locking of the local_lock_t.  To avoid this,
      obj_cgroup_release() = is invoked outside of the locked section.
      
      obj_cgroup_uncharge_pages() can be invoked with the local_lock_t
      acquired a= nd without it.  This will lead later to a recursion in
      refill_stock().  To avoid the locking recursion provide
      obj_cgroup_uncharge_pages_locked() which uses the locked version of
      refill_stock().
      
       - Replace disabling interrupts for memcg_stock with a local_lock_t.
      
       - Let drain_obj_stock() return the old struct obj_cgroup which is
         passed to obj_cgroup_put() outside of the locked section.
      
       - Provide obj_cgroup_uncharge_pages_locked() which uses the locked
         version of refill_stock() to avoid recursive locking in
         drain_obj_stock().
      
      Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020
      Link: https://lkml.kernel.org/r/20220226204144.1008339-6-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56751146
    • Johannes Weiner's avatar
      mm/memcg: opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock() · af9a3b69
      Johannes Weiner authored
      Provide the inner part of refill_stock() as __refill_stock() without
      disabling interrupts.  This eases the integration of local_lock_t where
      recursive locking must be avoided.
      
      Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
      __refill_stock().  The caller of drain_obj_stock() already disables
      interrupts.
      
      [bigeasy@linutronix.de: patch body around Johannes' diff]
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-5-bigeasy@linutronix.deSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af9a3b69
    • Sebastian Andrzej Siewior's avatar
      mm/memcg: protect per-CPU counter by disabling preemption on PREEMPT_RT where needed. · be3e67b5
      Sebastian Andrzej Siewior authored
      The per-CPU counter are modified with the non-atomic modifier.  The
      consistency is ensured by disabling interrupts for the update.  On non
      PREEMPT_RT configuration this works because acquiring a spinlock_t typed
      lock with the _irq() suffix disables interrupts.  On PREEMPT_RT
      configurations the RMW operation can be interrupted.
      
      Another problem is that mem_cgroup_swapout() expects to be invoked with
      disabled interrupts because the caller has to acquire a spinlock_t which
      is acquired with disabled interrupts.  Since spinlock_t never disables
      interrupts on PREEMPT_RT the interrupts are never disabled at this
      point.
      
      The code is never called from in_irq() context on PREEMPT_RT therefore
      disabling preemption during the update is sufficient on PREEMPT_RT.  The
      sections which explicitly disable interrupts can remain on PREEMPT_RT
      because the sections remain short and they don't involve sleeping locks
      (memcg_check_events() is doing nothing on PREEMPT_RT).
      
      Disable preemption during update of the per-CPU variables which do not
      explicitly disable interrupts.
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-4-bigeasy@linutronix.deSigned-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: Shakeel Butt <shakeelb@google.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be3e67b5
    • Sebastian Andrzej Siewior's avatar
      mm/memcg: disable threshold event handlers on PREEMPT_RT · 2343e88d
      Sebastian Andrzej Siewior authored
      During the integration of PREEMPT_RT support, the code flow around
      memcg_check_events() resulted in `twisted code'.  Moving the code around
      and avoiding then would then lead to an additional local-irq-save
      section within memcg_check_events().  While looking better, it adds a
      local-irq-save section to code flow which is usually within an
      local-irq-off block on non-PREEMPT_RT configurations.
      
      The threshold event handler is a deprecated memcg v1 feature.  Instead
      of trying to get it to work under PREEMPT_RT just disable it.  There
      should be no users on PREEMPT_RT.  From that perspective it makes even
      less sense to get it to work under PREEMPT_RT while having zero users.
      
      Make memory.soft_limit_in_bytes and cgroup.event_control return
      -EOPNOTSUPP on PREEMPT_RT.  Make an empty memcg_check_events() and
      memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
      Document that the two knobs are disabled on PREEMPT_RT.
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-3-bigeasy@linutronix.deSuggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Suggested-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2343e88d
    • Michal Hocko's avatar
      mm/memcg: revert ("mm/memcg: optimize user context object stock access") · fead2b86
      Michal Hocko authored
      Patch series "mm/memcg: Address PREEMPT_RT problems instead of disabling it", v5.
      
      This series aims to address the memcg related problem on PREEMPT_RT.
      
      I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the
      tools/testing/selftests/cgroup/* tests and I haven't observed any
      regressions (other than the lockdep report that is already there).
      
      This patch (of 6):
      
      The optimisation is based on a micro benchmark where local_irq_save() is
      more expensive than a preempt_disable().  There is no evidence that it
      is visible in a real-world workload and there are CPUs where the
      opposite is true (local_irq_save() is cheaper than preempt_disable()).
      
      Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
      where preempt_disable() is optimized away.  There is no improvement with
      PREEMPT_DYNAMIC since the preemption counter is always available.
      
      The optimization makes also the PREEMPT_RT integration more complicated
      since most of the assumption are not true on PREEMPT_RT.
      
      Revert the optimisation since it complicates the PREEMPT_RT integration
      and the improvement is hardly visible.
      
      [bigeasy@linutronix.de: patch body around Michal's diff]
      
      Link: https://lkml.kernel.org/r/20220226204144.1008339-1-bigeasy@linutronix.de
      Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4@dhcp22.suse.cz
      Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S@linutronix.de
      Link: https://lkml.kernel.org/r/20220226204144.1008339-2-bigeasy@linutronix.deSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: kernel test robot <oliver.sang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fead2b86
    • Randy Dunlap's avatar
      mm/memcontrol: return 1 from cgroup.memory __setup() handler · 460a79e1
      Randy Dunlap authored
      __setup() handlers should return 1 if the command line option is handled
      and 0 if not (or maybe never return 0; it just pollutes init's
      environment).
      
      The only reason that this particular __setup handler does not pollute
      init's environment is that the setup string contains a '.', as in
      "cgroup.memory".  This causes init/main.c::unknown_boottoption() to
      consider it to be an "Unused module parameter" and ignore it.  (This is
      for parsing of loadable module parameters any time after kernel init.)
      Otherwise the string "cgroup.memory=whatever" would be added to init's
      environment strings.
      
      Instead of relying on this '.' quirk, just return 1 to indicate that the
      boot option has been handled.
      
      Note that there is no warning message if someone enters:
      	cgroup.memory=anything_invalid
      
      Link: https://lkml.kernel.org/r/20220222005811.10672-1-rdunlap@infradead.org
      Fixes: f7e1cb6e ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      460a79e1
    • Shakeel Butt's avatar
      memcg: synchronously enforce memory.high for large overcharges · c9afe31e
      Shakeel Butt authored
      The high limit is used to throttle the workload without invoking the
      oom-killer.  Recently we tried to use the high limit to right size our
      internal workloads.  More specifically dynamically adjusting the limits
      of the workload without letting the workload get oom-killed.  However
      due to the limitation of the implementation of high limit enforcement,
      we observed the mechanism fails for some real workloads.
      
      The high limit is enforced on return-to-userspace i.e.  the kernel let
      the usage goes over the limit and when the execution returns to
      userspace, the high reclaim is triggered and the process can get
      throttled as well.  However this mechanism fails for workloads which do
      large allocations in a single kernel entry e.g.  applications that
      mlock() a large chunk of memory in a single syscall.  Such applications
      bypass the high limit and can trigger the oom-killer.
      
      To make high limit enforcement more robust, this patch makes the limit
      enforcement synchronous only if the accumulated overcharge becomes
      larger than MEMCG_CHARGE_BATCH.  So, most of the allocations would still
      be throttled on the return-to-userspace path but only the extreme
      allocations which accumulates large amount of overcharge without
      returning to the userspace will be throttled synchronously.  The value
      MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
      memcg codebase uses this constant therefore for now uses the same one.
      
      Link: https://lkml.kernel.org/r/20220211064917.2028469-5-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9afe31e