An error occurred fetching the project authors.
  1. 26 Apr, 2024 1 commit
  2. 12 Dec, 2023 2 commits
  3. 11 Dec, 2023 3 commits
  4. 25 Oct, 2023 7 commits
    • Usama Arif's avatar
      hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions · c5ad3233
      Usama Arif authored
      Most function calls in hugetlb.c are made with folio arguments.  This
      brings hugetlb_vmemmap calls inline with them by using folio instead of
      head struct page.  Head struct page is still needed within these
      functions.
      
      The set/clear/test functions for hugepages are also changed to folio
      versions.
      
      Link: https://lkml.kernel.org/r/20231011144557.1720481-2-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5ad3233
    • Mike Kravetz's avatar
      hugetlb: batch TLB flushes when restoring vmemmap · c24f188b
      Mike Kravetz authored
      Update the internal hugetlb restore vmemmap code path such that TLB
      flushing can be batched.  Use the existing mechanism of passing the
      VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
      performed for individual pages.  The routine
      hugetlb_vmemmap_restore_folios is the only user of this new mechanism, and
      it will perform a global flush after all vmemmap is restored.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-9-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c24f188b
    • Joao Martins's avatar
      hugetlb: batch TLB flushes when freeing vmemmap · f13b83fd
      Joao Martins authored
      Now that a list of pages is deduplicated at once, the TLB flush can be
      batched for all vmemmap pages that got remapped.
      
      Expand the flags field value to pass whether to skip the TLB flush on
      remap of the PTE.
      
      The TLB flush is global as we don't have guarantees from caller that the
      set of folios is contiguous, or to add complexity in composing a list of
      kVAs to flush.
      
      Modified by Mike Kravetz to perform TLB flush on single folio if an
      error is encountered.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-8-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f13b83fd
    • Joao Martins's avatar
      hugetlb: batch PMD split for bulk vmemmap dedup · f4b7e3ef
      Joao Martins authored
      In an effort to minimize amount of TLB flushes, batch all PMD splits
      belonging to a range of pages in order to perform only 1 (global) TLB
      flush.
      
      Add a flags field to the walker and pass whether it's a bulk allocation or
      just a single page to decide to remap.  First value
      (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
      flush when we split the PMD.
      
      Rebased and updated by Mike Kravetz
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-7-mike.kravetz@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4b7e3ef
    • Mike Kravetz's avatar
      hugetlb: batch freeing of vmemmap pages · 91f386bf
      Mike Kravetz authored
      Now that batching of hugetlb vmemmap optimization processing is possible,
      batch the freeing of vmemmap pages.  When freeing vmemmap pages for a
      hugetlb page, we add them to a list that is freed after the entire batch
      has been processed.
      
      This enhances the ability to return contiguous ranges of memory to the low
      level allocators.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-6-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91f386bf
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap restoration on a list of pages · cfb8c750
      Mike Kravetz authored
      The routine update_and_free_pages_bulk already performs vmemmap
      restoration on the list of hugetlb pages in a separate step.  In
      preparation for more functionality to be added in this step, create a new
      routine hugetlb_vmemmap_restore_folios() that will restore vmemmap for a
      list of folios.
      
      This new routine must provide sufficient feedback about errors and actual
      restoration performed so that update_and_free_pages_bulk can perform
      optimally.
      
      Special care must be taken when encountering an error from
      hugetlb_vmemmap_restore_folios.  We want to continue making as much
      forward progress as possible.  A new routine bulk_vmemmap_restore_error
      handles this specific situation.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-5-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfb8c750
    • Mike Kravetz's avatar
      hugetlb: perform vmemmap optimization on a list of pages · 79359d6d
      Mike Kravetz authored
      When adding hugetlb pages to the pool, we first create a list of the
      allocated pages before adding to the pool.  Pass this list of pages to a
      new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.
      
      Due to significant differences in vmemmmap initialization for bootmem
      allocated hugetlb pages, a new routine prep_and_add_bootmem_folios is
      created.
      
      We also modify the routine vmemmap_should_optimize() to check for pages
      that are already optimized.  There are code paths that might request
      vmemmap optimization twice and we want to make sure this is not attempted.
      
      Link: https://lkml.kernel.org/r/20231019023113.345257-4-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Konrad Dybcio <konradybcio@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79359d6d
  5. 04 Oct, 2023 5 commits
    • Usama Arif's avatar
      mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO · fde1c4ec
      Usama Arif authored
      The new boot flow when it comes to initialization of gigantic pages is as
      follows:
      
      - At boot time, for a gigantic page during __alloc_bootmem_hugepage, the
        region after the first struct page is marked as noinit.
      
      - This results in only the first struct page to be initialized in
        reserve_bootmem_region.  As the tail struct pages are not initialized at
        this point, there can be a significant saving in boot time if HVO
        succeeds later on.
      
      - Later on in the boot, the head page is prepped and the first
        HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
        are initialized.
      
      - HVO is attempted.  If it is not successful, then the rest of the tail
        struct pages are initialized.  If it is successful, no more tail struct
        pages need to be initialized saving significant boot time.
      
      The WARN_ON for increased ref count in gather_bootmem_prealloc was changed
      to a VM_BUG_ON.  This is OK as there should be no speculative references
      this early in boot process.  The VM_BUG_ON's are there just in case such
      code is introduced.
      
      [akpm@linux-foundation.org: make it nicer for 80 cols]
      Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fde1c4ec
    • Usama Arif's avatar
      mm: hugetlb_vmemmap: use nid of the head page to reallocate it · a9e34ea1
      Usama Arif authored
      Patch series "mm: hugetlb: Skip initialization of gigantic tail struct
      pages if freed by HVO", v5.
      
      This series moves the boot time initialization of tail struct pages of a
      gigantic page to later on in the boot.  Only the
      HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
      are initialized at the start.  If HVO is successful, then no more tail
      struct pages need to be initialized.  For a 1G hugepage, this series avoid
      initialization of 262144 - 63 = 262081 struct pages per hugepage.
      
      When tested on a 512G system (allocating 500 1G hugepages), the kexec-boot
      times with DEFERRED_STRUCT_PAGE_INIT enabled are:
      
      - with patches, HVO enabled: 1.32 seconds
      - with patches, HVO disabled: 2.15 seconds
      - without patches, HVO enabled: 3.90  seconds
      - without patches, HVO disabled: 3.58 seconds
      
      This represents an approximately 70% reduction in boot time and will
      significantly reduce server downtime when using a large number of gigantic
      pages.
      
      
      This patch (of 4):
      
      If tail page prep and initialization is skipped, then the "start" page
      will not contain the correct nid.  Use the nid from first vmemap page.
      
      Link: https://lkml.kernel.org/r/20230913105401.519709-1-usama.arif@bytedance.com
      Link: https://lkml.kernel.org/r/20230913105401.519709-2-usama.arif@bytedance.comSigned-off-by: default avatarUsama Arif <usama.arif@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Punit Agrawal <punit.agrawal@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9e34ea1
    • Yuan Can's avatar
      mm: hugetlb_vmemmap: allow alloc vmemmap pages fallback to other nodes · 6a898c27
      Yuan Can authored
      In vmemmap_remap_free(), a new head vmemmap page is allocated to avoid
      breaking a contiguous block of struct page memory, however, the allocation
      can always fail when the given node is movable node.  Remove the
      __GFP_THISNODE to help avoid fragmentation.
      
      Link: https://lkml.kernel.org/r/20230906093157.9737-1-yuancan@huawei.comSigned-off-by: default avatarYuan Can <yuancan@huawei.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a898c27
    • Yuan Can's avatar
      mm: hugetlb_vmemmap: fix hugetlb page number decrease failed on movable nodes · 2eaa6c2a
      Yuan Can authored
      The decreasing of hugetlb pages number failed with the following message
      given:
      
       sh: page allocation failure: order:0, mode:0x204cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_THISNODE)
       CPU: 1 PID: 112 Comm: sh Not tainted 6.5.0-rc7-... #45
       Hardware name: linux,dummy-virt (DT)
       Call trace:
        dump_backtrace.part.6+0x84/0xe4
        show_stack+0x18/0x24
        dump_stack_lvl+0x48/0x60
        dump_stack+0x18/0x24
        warn_alloc+0x100/0x1bc
        __alloc_pages_slowpath.constprop.107+0xa40/0xad8
        __alloc_pages+0x244/0x2d0
        hugetlb_vmemmap_restore+0x104/0x1e4
        __update_and_free_hugetlb_folio+0x44/0x1f4
        update_and_free_hugetlb_folio+0x20/0x68
        update_and_free_pages_bulk+0x4c/0xac
        set_max_huge_pages+0x198/0x334
        nr_hugepages_store_common+0x118/0x178
        nr_hugepages_store+0x18/0x24
        kobj_attr_store+0x18/0x2c
        sysfs_kf_write+0x40/0x54
        kernfs_fop_write_iter+0x164/0x1dc
        vfs_write+0x3a8/0x460
        ksys_write+0x6c/0x100
        __arm64_sys_write+0x1c/0x28
        invoke_syscall+0x44/0x100
        el0_svc_common.constprop.1+0x6c/0xe4
        do_el0_svc+0x38/0x94
        el0_svc+0x28/0x74
        el0t_64_sync_handler+0xa0/0xc4
        el0t_64_sync+0x174/0x178
       Mem-Info:
        ...
      
      The reason is that the hugetlb pages being released are allocated from
      movable nodes, and with hugetlb_optimize_vmemmap enabled, vmemmap pages
      need to be allocated from the same node during the hugetlb pages
      releasing. With GFP_KERNEL and __GFP_THISNODE set, allocating from movable
      node is always failed. Fix this problem by removing __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20230905124503.24899-1-yuancan@huawei.com
      Fixes: ad2fa371 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
      Signed-off-by: default avatarYuan Can <yuancan@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2eaa6c2a
    • Mike Kravetz's avatar
      hugetlb: set hugetlb page flag before optimizing vmemmap · d8f5f7e4
      Mike Kravetz authored
      Currently, vmemmap optimization of hugetlb pages is performed before the
      hugetlb flag (previously hugetlb destructor) is set identifying it as a
      hugetlb folio.  This means there is a window of time where an ordinary
      folio does not have all associated vmemmap present.  The core mm only
      expects vmemmap to be potentially optimized for hugetlb and device dax. 
      This can cause problems in code such as memory error handling that may
      want to write to tail struct pages.
      
      There is only one call to perform hugetlb vmemmap optimization today.  To
      fix this issue, simply set the hugetlb flag before that call.
      
      There was a similar issue in the free hugetlb path that was previously
      addressed.  The two routines that optimize or restore hugetlb vmemmap
      should only be passed hugetlb folios/pages.  To catch any callers not
      following this rule, add VM_WARN_ON calls to the routines.  In the hugetlb
      free code paths, some calls could be made to restore vmemmap after
      clearing the hugetlb flag.  This was 'safe' as in these cases vmemmap was
      already present and the call was a NOOP.  However, for consistency these
      calls where eliminated so that we can add the VM_WARN_ON checks.
      
      Link: https://lkml.kernel.org/r/20230829213734.69673-1-mike.kravetz@oracle.com
      Fixes: f41f2ed4 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8f5f7e4
  6. 18 Aug, 2023 1 commit
  7. 19 Jun, 2023 1 commit
    • Ryan Roberts's avatar
      mm: ptep_get() conversion · c33c7948
      Ryan Roberts authored
      Convert all instances of direct pte_t* dereferencing to instead use
      ptep_get() helper.  This means that by default, the accesses change from a
      C dereference to a READ_ONCE().  This is technically the correct thing to
      do since where pgtables are modified by HW (for access/dirty) they are
      volatile and therefore we should always ensure READ_ONCE() semantics.
      
      But more importantly, by always using the helper, it can be overridden by
      the architecture to fully encapsulate the contents of the pte.  Arch code
      is deliberately not converted, as the arch code knows best.  It is
      intended that arch code (arm64) will override the default with its own
      implementation that can (e.g.) hide certain bits from the core code, or
      determine young/dirty status by mixing in state from another source.
      
      Conversion was done using Coccinelle:
      
      ----
      
      // $ make coccicheck \
      //          COCCI=ptepget.cocci \
      //          SPFLAGS="--include-headers" \
      //          MODE=patch
      
      virtual patch
      
      @ depends on patch @
      pte_t *v;
      @@
      
      - *v
      + ptep_get(v)
      
      ----
      
      Then reviewed and hand-edited to avoid multiple unnecessary calls to
      ptep_get(), instead opting to store the result of a single call in a
      variable, where it is correct to do so.  This aims to negate any cost of
      READ_ONCE() and will benefit arch-overrides that may be more complex.
      
      Included is a fix for an issue in an earlier version of this patch that
      was pointed out by kernel test robot.  The issue arose because config
      MMU=n elides definition of the ptep helper functions, including
      ptep_get().  HUGETLB_PAGE=n configs still define a simple
      huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
      So when both configs are disabled, this caused a build error because
      ptep_get() is not defined.  Fix by continuing to do a direct dereference
      when MMU=n.  This is safe because for this config the arch code cannot be
      trying to virtualize the ptes because none of the ptep helpers are
      defined.
      
      Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Airlie <airlied@gmail.com>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c33c7948
  8. 09 Jun, 2023 1 commit
  9. 18 Apr, 2023 1 commit
  10. 28 Mar, 2023 2 commits
  11. 21 Feb, 2023 1 commit
    • Ondrej Mosnacek's avatar
      sysctl: fix proc_dobool() usability · f1aa2eb5
      Ondrej Mosnacek authored
      Currently proc_dobool expects a (bool *) in table->data, but sizeof(int)
      in table->maxsize, because it uses do_proc_dointvec() directly.
      
      This is unsafe for at least two reasons:
      1. A sysctl table definition may use { .data = &variable, .maxsize =
         sizeof(variable) }, not realizing that this makes the sysctl unusable
         (see the Fixes: tag) and that they need to use the completely
         counterintuitive sizeof(int) instead.
      2. proc_dobool() will currently try to parse an array of values if given
         .maxsize >= 2*sizeof(int), but will try to write values of type bool
         by offsets of sizeof(int), so it will not work correctly with neither
         an (int *) nor a (bool *). There is no .maxsize validation to prevent
         this.
      
      Fix this by:
      1. Constraining proc_dobool() to allow only one value and .maxsize ==
         sizeof(bool).
      2. Wrapping the original struct ctl_table in a temporary one with .data
         pointing to a local int variable and .maxsize set to sizeof(int) and
         passing this one to proc_dointvec(), converting the value to/from
         bool as needed (using proc_dou8vec_minmax() as an example).
      3. Extending sysctl_check_table() to enforce proc_dobool() expectations.
      4. Fixing the proc_dobool() docstring (it was just copy-pasted from
         proc_douintvec, apparently...).
      5. Converting all existing proc_dobool() users to set .maxsize to
         sizeof(bool) instead of sizeof(int).
      
      Fixes: 83efeeeb ("tty: Allow TIOCSTI to be disabled")
      Fixes: a2071573 ("sysctl: introduce new proc handler proc_dobool")
      Signed-off-by: default avatarOndrej Mosnacek <omosnace@redhat.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      f1aa2eb5
  12. 30 Nov, 2022 2 commits
    • Joao Martins's avatar
      mm/hugetlb_vmemmap: remap head page to newly allocated page · 11aad263
      Joao Martins authored
      Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
      back to page allocator is as following: for a 2M hugetlb page it will reuse
      the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
      1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
      that means that it breaks the first 4K of a potentially contiguous chunk of
      memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
      this reason the memory that it's free back to page allocator cannot be used
      for hugetlb to allocate huge pages of the same size, but rather only of a
      smaller huge page size:
      
      Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
      having 64G):
      
      * Before allocation:
      Free pages count per migrate type at order       0      1      2      3
      4      5      6      7      8      9     10
      ...
      Node    0, zone   Normal, type      Movable    340    100     32     15
      1      2      0      0      0      1  15558
      
      $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
       31987
      
      * After:
      
      Node    0, zone   Normal, type      Movable  30893  32006  31515      7
      0      0      0      0      0      0      0
      
      Notice how the memory freed back are put back into 4K / 8K / 16K page
      pools. And it allocates a total of 31987 pages (63974M).
      
      To fix this behaviour rather than remapping second vmemmap page (thus
      breaking the contiguous block of memory backing the struct pages)
      repopulate the first vmemmap page with a new one. We allocate and copy
      from the currently mapped vmemmap page, and then remap it later on.
      The same algorithm works if there's a pre initialized walk::reuse_page
      and the head page doesn't need to be skipped and instead we remap it
      when the @addr being changed is the @reuse_addr.
      
      The new head page is allocated in vmemmap_remap_free() given that on
      restore there's no need for functional change. Note that, because right
      now one hugepage is remapped at a time, thus only one free 4K page at a
      time is needed to remap the head page. Should it fail to allocate said
      new page, it reuses the one that's already mapped just like before. As a
      result, for every 64G of contiguous hugepages it can give back 1G more
      of contiguous memory per 64G, while needing in total 128M new 4K pages
      (for 2M hugetlb) or 256k (for 1G hugetlb).
      
      After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
      guest, each node with 64G):
      
      * Before allocation
      Free pages count per migrate type at order       0      1      2      3
      4      5      6      7      8      9     10
      ...
      Node    0, zone   Normal, type      Movable      1      1      1      0
      0      1      0      0      1      1  15564
      
      $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
      32394
      
      * After:
      
      Node    0, zone   Normal, type      Movable      0     50     97    108
      96     81     70     46     18      0      0
      
      In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
      of the 32394 (64788M) allocated. So the memory freed back is indeed being
      used back in hugetlb and there's no massive order-0..order-2 pages
      accumulated unused.
      
      [joao.m.martins@oracle.com: v3]
        Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
      [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
        Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
      Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      11aad263
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: remove redundant list_del() · 1cc53a04
      Muchun Song authored
      The ->lru field will be assigned to a new value in __free_page().  So it
      is unnecessary to delete it from the @list.  Just remove it to simplify
      the code.
      
      Link: https://lkml.kernel.org/r/20221027033641.66709-1-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1cc53a04
  13. 08 Nov, 2022 1 commit
  14. 12 Sep, 2022 2 commits
  15. 09 Aug, 2022 6 commits
  16. 04 Jul, 2022 1 commit
  17. 27 Jun, 2022 1 commit
  18. 01 Jun, 2022 1 commit
  19. 13 May, 2022 1 commit
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl · 78f39084
      Muchun Song authored
      We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
      reboot the server to enable or disable the feature of optimizing vmemmap
      pages associated with HugeTLB pages.  However, rebooting usually takes a
      long time.  So add a sysctl to enable or disable the feature at runtime
      without rebooting.  Why we need this?  There are 3 use cases.
      
      1) The feature of minimizing overhead of struct page associated with
         each HugeTLB is disabled by default without passing
         "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
         deliver the servers to the users who want to enable this feature, they
         have to configure the grub (change boot cmdline) and reboot the
         servers, whereas rebooting usually takes a long time (we have thousands
         of servers).  It's a very bad experience for the users.  So we need a
         approach to enable this feature after rebooting.  This is a use case in
         our practical environment.
      
      2) Some use cases are that HugeTLB pages are allocated 'on the fly'
         instead of being pulled from the HugeTLB pool, those workloads would be
         affected with this feature enabled.  Those workloads could be
         identified by the characteristics of they never explicitly allocating
         huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
         and then let the pages be allocated from the buddy allocator at fault
         time.  We can confirm it is a real use case from the commit
         099730d6.  For those workloads, the page fault time could be ~2x
         slower than before.  We suspect those users want to disable this
         feature if the system has enabled this before and they don't think the
         memory savings benefit is enough to make up for the performance drop.
      
      3) If the workload which wants vmemmap pages to be optimized and the
         workload which wants to set 'nr_overcommit_hugepages' and does not want
         the extera overhead at fault time when the overcommitted pages be
         allocated from the buddy allocator are deployed in the same server. 
         The user could enable this feature and set 'nr_hugepages' and
         'nr_overcommit_hugepages', then disable the feature.  In this case, the
         overcommited HugeTLB pages will not encounter the extra overhead at
         fault time.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78f39084