1. 17 Sep, 2024 18 commits
    • Peter Xu's avatar
      mm/arm64: support large pfn mappings · 3e509c9b
      Peter Xu authored
      Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
      pmds/puds.  Provide the pmd/pud helpers to set/get special bit.
      
      There's one more thing missing for arm64 which is the pxx_pgprot() for
      pmd/pud.  Add them too, which is mostly the same as the pte version by
      dropping the pfn field.  These helpers are essential to be used in the new
      follow_pfnmap*() API to report valid pgprot_t results.
      
      Note that arm64 doesn't yet support huge PUD yet, but it's still
      straightforward to provide the pud helpers that we need altogether.  Only
      PMD helpers will make an immediate benefit until arm64 will support huge
      PUDs first in general (e.g.  in THPs).
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e509c9b
    • Peter Xu's avatar
      mm/x86: support large pfn mappings · 75182022
      Peter Xu authored
      Helpers to install and detect special pmd/pud entries.  In short, bit 9 on
      x86 is not used for pmd/pud, so we can directly define them the same as
      the pte level.  One note is that it's also used in _PAGE_BIT_CPA_TEST but
      that is only used in the debug test, and shouldn't conflict in this case.
      
      One note is that pxx_set|clear_flags() for pmd/pud will need to be moved
      upper so that they can be referenced by the new special bit helpers. 
      There's no change in the code that was moved.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-18-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75182022
    • Peter Xu's avatar
      mm: remove follow_pte() · b0a1c0d0
      Peter Xu authored
      follow_pte() users have been converted to follow_pfnmap*().  Remove the
      API.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-17-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0a1c0d0
    • Peter Xu's avatar
      mm/access_process_vm: use the new follow_pfnmap API · b17269a5
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-16-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b17269a5
    • Peter Xu's avatar
      acrn: use the new follow_pfnmap API · e6bc784c
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-15-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6bc784c
    • Peter Xu's avatar
      vfio: use the new follow_pfnmap API · a77f9489
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a77f9489
    • Peter Xu's avatar
      mm/x86/pat: use the new follow_pfnmap API · cbea8536
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbea8536
    • Peter Xu's avatar
      s390/pci_mmio: use follow_pfnmap API · bd8c2d18
      Peter Xu authored
      Use the new API that can understand huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd8c2d18
    • Peter Xu's avatar
      KVM: use follow_pfnmap API · 5731aacd
      Peter Xu authored
      Use the new pfnmap API to allow huge MMIO mappings for VMs.  The rest work
      is done perfectly on the other side (host_pfn_mapping_level()).
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5731aacd
    • Peter Xu's avatar
      mm: new follow_pfnmap API · 6da8e963
      Peter Xu authored
      Introduce a pair of APIs to follow pfn mappings to get entry information. 
      It's very similar to what follow_pte() does before, but different in that
      it recognizes huge pfn mappings.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6da8e963
    • Peter Xu's avatar
      mm: always define pxx_pgprot() · 0515e022
      Peter Xu authored
      There're:
      
        - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
        support pte_pgprot().
      
        - 2 archs (x86, sparc) that support pmd_pgprot().
      
        - 1 arch (x86) that support pud_pgprot().
      
      Always define them to be used in generic code, and then we don't need to
      fiddle with "#ifdef"s when doing so.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0515e022
    • Peter Xu's avatar
      mm/fork: accept huge pfnmap entries · bc02afbd
      Peter Xu authored
      Teach the fork code to properly copy pfnmaps for pmd/pud levels.  Pud is
      much easier, the write bit needs to be persisted though for writable and
      shared pud mappings like PFNMAP ones, otherwise a follow up write in
      either parent or child process will trigger a write fault.
      
      Do the same for pmd level.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc02afbd
    • Peter Xu's avatar
      mm/pagewalk: check pfnmap for folio_walk_start() · 10d83d77
      Peter Xu authored
      Teach folio_walk_start() to recognize special pmd/pud mappings, and fail
      them properly as it means there's no folio backing them.
      
      [peterx@redhat.com: remove some stale comments, per David]
        Link: https://lkml.kernel.org/r/20240829202237.2640288-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10d83d77
    • Peter Xu's avatar
      mm/gup: detect huge pfnmap entries in gup-fast · ae3c99e6
      Peter Xu authored
      Since gup-fast doesn't have the vma reference, teach it to detect such huge
      pfnmaps by checking the special bit for pmd/pud too, just like ptes.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ae3c99e6
    • Peter Xu's avatar
      mm: allow THP orders for PFNMAPs · 5dd40721
      Peter Xu authored
      This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
      dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
      the macro to THP_ORDERS_ALL_SPECIAL.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5dd40721
    • Peter Xu's avatar
      mm: mark special bits for huge pfn mappings when inject · 3c8e44c9
      Peter Xu authored
      We need these special bits to be around on pfnmaps.  Mark properly for
      !devmap case, reflecting that there's no page struct backing the entry.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.comReviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c8e44c9
    • Peter Xu's avatar
      mm: drop is_huge_zero_pud() · ef713ec3
      Peter Xu authored
      It constantly returns false since 2017.  One assertion is added in 2019 but
      it should never have triggered, IOW it means what is checked should be
      asserted instead.
      
      If it didn't exist for 7 years maybe it's good idea to remove it and only
      add it when it comes.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef713ec3
    • Peter Xu's avatar
      mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud · 6857be5f
      Peter Xu authored
      Patch series "mm: Support huge pfnmaps", v2.
      
      Overview
      ========
      
      This series implements huge pfnmaps support for mm in general.  Huge
      pfnmap allows e.g.  VM_PFNMAP vmas to map in either PMD or PUD levels,
      similar to what we do with dax / thp / hugetlb so far to benefit from TLB
      hits.  Now we extend that idea to PFN mappings, e.g.  PCI MMIO bars where
      it can grow as large as 8GB or even bigger.
      
      Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
      patch (from Alex Williamson) will be the first user of huge pfnmap, so as
      to enable vfio-pci driver to fault in huge pfn mappings.
      
      Implementation
      ==============
      
      In reality, it's relatively simple to add such support comparing to many
      other types of mappings, because of PFNMAP's specialties when there's no
      vmemmap backing it, so that most of the kernel routines on huge mappings
      should simply already fail for them, like GUPs or old-school follow_page()
      (which is recently rewritten to be folio_walk* APIs by David).
      
      One trick here is that we're still unmature on PUDs in generic paths here
      and there, as DAX is so far the only user.  This patchset will add the 2nd
      user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
      go on smoothly, but to be discussed later.
      
      The other trick is how to allow gup-fast working for such huge mappings
      even if there's no direct sign of knowing whether it's a normal page or
      MMIO mapping.  This series chose to keep the pte_special solution, so that
      it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
      that gup-fast will be able to identify them and fail properly.
      
      Along the way, we'll also notice that the major pgtable pfn walker, aka,
      follow_pte(), will need to retire soon due to the fact that it only works
      with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
      be able to do whatever follow_pte() can already do, plus that it can also
      process huge pfnmaps now.  Half of this series is about that and
      converting all existing pfnmap walkers to use the new API properly. 
      Hopefully the new API also looks better to avoid exposing e.g.  pgtable
      lock details into the callers, so that it can be used in an even more
      straightforward way.
      
      Here, three more options will be introduced and involved in huge pfnmap:
      
        - ARCH_SUPPORTS_HUGE_PFNMAP
      
          Arch developers will need to select this option when huge pfnmap is
          supported in arch's Kconfig.  After this patchset applied, both x86_64
          and arm64 will start to enable it by default.
      
        - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
      
          These options are for driver developers to identify whether current
          arch / config supports huge pfnmaps, making decision on whether it can
          use the huge pfnmap APIs to inject them.  One can refer to the last
          vfio-pci patch from Alex on the use of them properly in a device
          driver.
      
      So after the whole set applied, and if one would enable some dynamic debug
      lines in vfio-pci core files, we should observe things like:
      
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
      
      In this specific case, it says that vfio-pci faults in PMDs properly for a
      few BAR0 offsets.
      
      Patch Layout
      ============
      
      Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
      Patch 2:         A tiny cleanup
      Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                       special bit for pmd/pud)
      Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                       then drop follow_pte() API
      Patch 17:        Add huge pfnmap support for x86_64
      Patch 18:        Add huge pfnmap support for arm64
      Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
      
      TODO
      ====
      
      More architectures / More page sizes
      ------------------------------------
      
      Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems
      to have plan to support arm64 1G later on top of this series [2].
      
      Any arch will need to first support THP / THP_1G, then provide a special
      bit in pmds/puds to support huge pfnmaps.
      
      remap_pfn_range() support
      -------------------------
      
      Currently, remap_pfn_range() still only maps PTEs.  With the new option,
      remap_pfn_range() can logically start to inject either PMDs or PUDs when
      the alignment requirements match on the VAs.
      
      When the support is there, it should be able to silently benefit all
      drivers that is using remap_pfn_range() in its mmap() handler on better
      TLB hit rate and overall faster MMIO accesses similar to processor on
      hugepages.
      
      More driver support
      -------------------
      
      VFIO is so far the only consumer for the huge pfnmaps after this series
      applied.  Besides above remap_pfn_range() generic optimization, device
      driver can also try to optimize its mmap() on a better VA alignment for
      either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
      as the driver doesn't normally decide the VA to map a bar.  But I don't
      think I know all the drivers to know the full picture.
      
      Credits all go to Alex on help testing the GPU/NIC use cases above.
      
      [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
      [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
      [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
      
      
      This patch (of 19):
      
      This patch introduces the option to introduce special pte bit into
      pmd/puds.  Archs can start to define pmd_special / pud_special when
      supported by selecting the new option.  Per-arch support will be added
      later.
      
      Before that, create fallbacks for these helpers so that they are always
      available.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6857be5f
  2. 09 Sep, 2024 22 commits
    • Yu Zhao's avatar
      mm/codetag: add pgalloc_tag_copy() · e0a955bf
      Yu Zhao authored
      Add pgalloc_tag_copy() to transfer the codetag from the old folio to the
      new one during migration.  This makes original allocation sites persist
      cross migration rather than lump into the get_new_folio callbacks passed
      into migrate_pages(), e.g., compaction_alloc():
      
        # echo 1 >/proc/sys/vm/compact_memory
        # grep compaction_alloc /proc/allocinfo
      
      Before this patch:
        132968448  32463  mm/compaction.c:1880 func:compaction_alloc
      
      After this patch:
                0      0  mm/compaction.c:1880 func:compaction_alloc
      
      Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com
      Fixes: dcfe378c ("lib: introduce support for page allocation tagging")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e0a955bf
    • Yu Zhao's avatar
      mm/codetag: fix pgalloc_tag_split() · 95599ef6
      Yu Zhao authored
      The current assumption is that a large folio can only be split into
      order-0 folios.  That is not the case for hugeTLB demotion, nor for THP
      split: see commit c010d47f ("mm: thp: split huge page to any lower
      order pages").
      
      When a large folio is split into ones of a lower non-zero order, only the
      new head pages should be tagged.  Tagging tail pages can cause imbalanced
      "calls" counters, since only head pages are untagged by pgalloc_tag_sub()
      and the "calls" counts on tail pages are leaked, e.g.,
      
        # echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size
        # echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
        # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote
        # echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
        # grep alloc_gigantic_folio /proc/allocinfo
      
      Before this patch:
        0  549427200  mm/hugetlb.c:1549 func:alloc_gigantic_folio
      
        real  0m2.057s
        user  0m0.000s
        sys   0m2.051s
      
      After this patch:
        0          0  mm/hugetlb.c:1549 func:alloc_gigantic_folio
      
        real  0m1.711s
        user  0m0.000s
        sys   0m1.704s
      
      Not tagging tail pages also improves the splitting time, e.g., by about
      15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above.
      
      Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com
      Fixes: be25d1d4 ("mm: create new codetag references during page splitting")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      95599ef6
    • Yu Zhao's avatar
      mm/codetag: fix a typo · eebc0f48
      Yu Zhao authored
      Link: https://lkml.kernel.org/r/20240906042108.1150526-1-yuzhao@google.com
      Fixes: 22d407b1 ("lib: add allocation tagging support for memory allocation profiling")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eebc0f48
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc.c: use "high-order" in description non 0-order pages · 6004fe00
      Uladzislau Rezki (Sony) authored
      In many places, in the comments, we use both "higher-order" and
      "high-order" to describe the non 0-order pages.  That is confusing,
      because a "higher-order" statement does not reflect what it is compared
      with.
      
      Link: https://lkml.kernel.org/r/20240906095049.3486-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Suggested-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6004fe00
    • ZhangPeng's avatar
      mm/vmalloc.c: use helper function va_size() · b44f71e3
      ZhangPeng authored
      Use helper function va_size() to improve code readability. No functional
      modification involved.
      
      Link: https://lkml.kernel.org/r/20240906102539.3537207-1-zhangpeng362@huawei.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b44f71e3
    • Shakeel Butt's avatar
      mm: replace xa_get_order with xas_get_order where appropriate · 354a595a
      Shakeel Butt authored
      The tracing of invalidation and truncation operations on large files
      showed that xa_get_order() is among the top functions where kernel spends
      a lot of CPUs.  xa_get_order() needs to traverse the tree to reach the
      right node for a given index and then extract the order of the entry. 
      However it seems like at many places it is being called within an already
      happening tree traversal where there is no need to do another traversal. 
      Just use xas_get_order() at those places.
      
      Link: https://lkml.kernel.org/r/20240906230512.124643-1-shakeel.butt@linux.devSigned-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      354a595a
    • Liam R. Howlett's avatar
      maple_tree: mark three functions as __maybe_unused · 1930c6ad
      Liam R. Howlett authored
      People keep trying to remove three functions that are going to be used in
      a feature that is being developed.  Dropping the functions entirely may
      end up with people trying to use the bit for other uses, as people have
      tried in the past.
      
      Adding __maybe_unused stops compilers complaining about the unused
      functions so they can be silently optimised out of the compiled code and
      people won't try to claim the bit for another use.
      
      Link: https://lore.kernel.org/all/20230726080916.17454-2-zhangpeng.00@bytedance.com/
      Link: https://lore.kernel.org/all/202408310728.S7EE59BN-lkp@intel.com/
      Link: https://lkml.kernel.org/r/20240907021506.4018676-1-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarKuan-Wei Chiu <visitorckw@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1930c6ad
    • Kinsey Ho's avatar
      mm: clean up mem_cgroup_iter() · aa50b501
      Kinsey Ho authored
      A clean up to make variable names more clear and to improve code
      readability.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20240905003058.1859929-6-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Reviewed-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa50b501
    • Kinsey Ho's avatar
      mm: restart if multiple traversals raced · ec0db74b
      Kinsey Ho authored
      Currently, if multiple reclaimers raced on the same position, the
      reclaimers which detect the race will still reclaim from the same memcg. 
      Instead, the reclaimers which detect the race should move on to the next
      memcg in the hierarchy.
      
      So, in the case where multiple traversals race, jump back to the start of
      the mem_cgroup_iter() function to find the next memcg in the hierarchy to
      reclaim from.
      
      Link: https://lkml.kernel.org/r/20240905003058.1859929-5-kinseyho@google.com
      Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/Signed-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Reviewed-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec0db74b
    • Kinsey Ho's avatar
      mm: increment gen # before restarting traversal · 3d150e31
      Kinsey Ho authored
      The generation number in struct mem_cgroup_reclaim_iter should be
      incremented on every round-trip.  Currently, it is possible for a
      concurrent reclaimer to jump in at the end of the hierarchy, causing a
      traversal restart (resetting the iteration position) without incrementing
      the generation number.
      
      By resetting the position without incrementing the generation, it's
      possible for another ongoing mem_cgroup_iter() thread to walk the tree
      twice.
      
      Move the traversal restart such that the generation number is
      incremented before the restart.
      
      Link: https://lkml.kernel.org/r/20240905003058.1859929-4-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Reviewed-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d150e31
    • Kinsey Ho's avatar
      mm: don't hold css->refcnt during traversal · 4a2698b0
      Kinsey Ho authored
      To obtain the pointer to the next memcg position, mem_cgroup_iter()
      currently holds css->refcnt during memcg traversal only to put css->refcnt
      at the end of the routine.  This isn't necessary as an rcu_read_lock is
      already held throughout the function.  The use of the RCU read lock with
      css_next_descendant_pre() guarantees that sibling linkage is safe without
      holding a ref on the passed-in @css.
      
      Remove css->refcnt usage during traversal by leveraging RCU.
      
      Link: https://lkml.kernel.org/r/20240905003058.1859929-3-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Reviewed-by: default avatarT.J. Mercier <tjmercier@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a2698b0
    • Kinsey Ho's avatar
      cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU · 0e40cf2a
      Kinsey Ho authored
      Patch series "Improve mem_cgroup_iter()", v4.
      
      Incremental cgroup iteration is being used again [1]. This patchset
      improves the reliability of mem_cgroup_iter(). It also improves
      simplicity and code readability.
      
      [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
      
      
      This patch (of 5):
      
      Explicitly document that css sibling/descendant linkage is protected by
      cgroup_mutex or RCU.  Also, document in css_next_descendant_pre() and
      similar functions that it isn't necessary to hold a ref on @pos.
      
      The following changes in this patchset rely on this clarification for
      simplification in memcg iteration code.
      
      Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
      Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.comSuggested-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Signed-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: T.J. Mercier <tjmercier@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e40cf2a
    • Andrew Morton's avatar
      mm/page_alloc: fix build with CONFIG_UNACCEPTED_MEMORY=n · 6e94da94
      Andrew Morton authored
      When has_unaccepted_memory() is unused, it prevents kernel builds
      with clang, `make W=1` and CONFIG_WERROR=y:
      
      mm/page_alloc.c:7036:20: error: unused function 'has_unaccepted_memory' [-Werror,-Wunused-function]
      7036 | static inline bool has_unaccepted_memory(void)
      |                    ^~~~~~~~~~~~~~~~~~~~~
      
      Fix it by removeing the CONFIG_UNACCEPTED_MEMORY=n stub.
      
      Link: https://lkml.kernel.org/r/20240905142220.49d93337a0abce5690e515d9@linux-foundation.orgReported-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Closes: https://lkml.kernel.org/r/20240905171553.275054-1-andriy.shevchenko@linux.intel.comSigned-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e94da94
    • Kefeng Wang's avatar
      mm: migrate: remove unused includes · cfc81938
      Kefeng Wang authored
      random.h is not needed since commit 6c542ab7 ("mm/demotion: build
      demotion targets based on explicit memory tiers"), all functions moved
      into memory-tiers.
      
      nsproxy.h is not needed since commit 228ebcbe ("Uninline
      find_task_by_xxx set of functions"), no nsproxy, we only call
      find_task_by_vpid() now.
      
      hugetlb_cgroup.h is not needed since commit ab5ac90a ("mm, hugetlb: do
      not rely on overcommit limit during migration"), move_hugetlb_state() is
      called and it belongs to hugetlb.h, which is already included.
      
      balloon_compaction.h, we have more general movable_operations for non-lru
      movable page migration, so it could be dropped.
      
      memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page
      migration, but all functions are moved into migrate_device.c, so no needed
      anymore too.
      
      Link: https://lkml.kernel.org/r/20240905152432.626877-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfc81938
    • Nanyong Sun's avatar
      mm: thp: simplify split_huge_pages_pid() · e4bfc678
      Nanyong Sun authored
      The helper find_get_task_by_vpid() can totally replace the task_struct
      find logic in split_huge_pages_pid(), so use it to simplify the code. 
      Also delete the needless comments for the helper function name already
      explains what it's doing here.
      
      Link: https://lkml.kernel.org/r/20240905153028.1205128-1-sunnanyong@huawei.comSigned-off-by: default avatarNanyong Sun <sunnanyong@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e4bfc678
    • Nanyong Sun's avatar
      mm: migrate: simplify find_mm_struct() · 46dcc7c9
      Nanyong Sun authored
      Use find_get_task_by_vpid() to replace the task_struct find logic in
      find_mm_struct(), note that this patch move the ptrace_may_access() call
      out from rcu_read_lock() scope, this is ok because it actually does not
      need it, find_get_task_by_vpid() already get the pid and task safely,
      ptrace_may_access() can use the task safely, like what
      sched_core_share_pid() similarly do.
      
      Link: https://lkml.kernel.org/r/20240905153118.1205173-1-sunnanyong@huawei.comSigned-off-by: default avatarNanyong Sun <sunnanyong@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46dcc7c9
    • SeongJae Park's avatar
      mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if aggr_interval is zero · 25e8acbc
      SeongJae Park authored
      The aggregation interval of test purpose damon_attrs for
      damon_test_nr_accesses_to_accesses_bp() becomes zero on 32 bit
      architecture, since size of int and long types are same.  As a result,
      damon_nr_accesses_to_accesses_bp() call with the test data triggers
      divide-by-zero exception.  damon_nr_accesses_to_accesses_bp() shouldn't
      be called with such data, and the non-test code avoids that by checking
      the case on damon_update_monitoring_results().  Skip the test code in
      the case, and add an explicit caution of the case on the comment for the
      test target function.
      
      Link: https://lkml.kernel.org/r/20240905162423.74053-1-sj@kernel.org
      Fixes: 5e06ad59 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Closes: https://lore.kernel.org/c771b962-a58f-435b-89e4-1211a9323181@roeck-us.net
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      25e8acbc
    • Sven Schnelle's avatar
      uprobes: use vm_special_mapping close() functionality · 08e28de1
      Sven Schnelle authored
      The following KASAN splat was shown:
      
      [   44.505448] ==================================================================                                                                      20:37:27 [3421/145075]
      [   44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8
      [   44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384
      [   44.505479]
      [   44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496
      [   44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
      [   44.505508] Call Trace:
      [   44.505511]  [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108
      [   44.505521]  [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0
      [   44.505529]  [<000b0324d2f5464c>] print_report+0x44/0x138
      [   44.505536]  [<000b0324d1383192>] kasan_report+0xc2/0x140
      [   44.505543]  [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8
      [   44.505550]  [<000b0324d12c7978>] remove_vma+0x78/0x120
      [   44.505557]  [<000b0324d128a2c6>] exit_mmap+0x326/0x750
      [   44.505563]  [<000b0324d0ba655a>] __mmput+0x9a/0x370
      [   44.505570]  [<000b0324d0bbfbe0>] exit_mm+0x240/0x340
      [   44.505575]  [<000b0324d0bc0228>] do_exit+0x548/0xd70
      [   44.505580]  [<000b0324d0bc1102>] do_group_exit+0x132/0x390
      [   44.505586]  [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60
      [   44.505592]  [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430
      [   44.505599]  [<000b0324d2f78434>] __do_syscall+0xa4/0x170
      [   44.505606]  [<000b0324d2f9454c>] system_call+0x74/0x98
      [   44.505614]
      [   44.505616] Allocated by task 1384:
      [   44.505621]  kasan_save_stack+0x40/0x70
      [   44.505630]  kasan_save_track+0x28/0x40
      [   44.505636]  __kasan_kmalloc+0xa0/0xc0
      [   44.505642]  __create_xol_area+0xfa/0x410
      [   44.505648]  get_xol_area+0xb0/0xf0
      [   44.505652]  uprobe_notify_resume+0x27a/0x470
      [   44.505657]  irqentry_exit_to_user_mode+0x15e/0x1d0
      [   44.505664]  pgm_check_handler+0x122/0x170
      [   44.505670]
      [   44.505672] Freed by task 1384:
      [   44.505676]  kasan_save_stack+0x40/0x70
      [   44.505682]  kasan_save_track+0x28/0x40
      [   44.505687]  kasan_save_free_info+0x4a/0x70
      [   44.505693]  __kasan_slab_free+0x5a/0x70
      [   44.505698]  kfree+0xe8/0x3f0
      [   44.505704]  __mmput+0x20/0x370
      [   44.505709]  exit_mm+0x240/0x340
      [   44.505713]  do_exit+0x548/0xd70
      [   44.505718]  do_group_exit+0x132/0x390
      [   44.505722]  __s390x_sys_exit_group+0x56/0x60
      [   44.505727]  do_syscall+0x2f6/0x430
      [   44.505732]  __do_syscall+0xa4/0x170
      [   44.505738]  system_call+0x74/0x98
      
      The problem is that uprobe_clear_state() kfree's struct xol_area, which
      contains struct vm_special_mapping *xol_mapping. This one is passed to
      _install_special_mapping() in xol_add_vma().
      __mput reads:
      
      static inline void __mmput(struct mm_struct *mm)
      {
              VM_BUG_ON(atomic_read(&mm->mm_users));
      
              uprobe_clear_state(mm);
              exit_aio(mm);
              ksm_exit(mm);
              khugepaged_exit(mm); /* must run before exit_mmap */
              exit_mmap(mm);
              ...
      }
      
      So uprobe_clear_state() in the beginning free's the memory area
      containing the vm_special_mapping data, but exit_mmap() uses this
      address later via vma->vm_private_data (which was set in
      _install_special_mapping().
      
      Fix this by moving uprobe_clear_state() to uprobes.c and use it as
      close() callback.
      
      [usama.anjum@collabora.com: remove unneeded condition]
        Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com
      Fixes: 223febc6 ("mm: add optional close() to struct vm_special_mapping")
      Signed-off-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      08e28de1
    • Yosry Ahmed's avatar
      mm: page_alloc: fix missed updates of PGFREE in free_unref_{page/folios} · ec867977
      Yosry Ahmed authored
      PGFREE is currently updated in two code paths:
      
      - __free_pages_ok(): for pages freed to the buddy allocator.
      - free_unref_page_commit(): for pages freed to the pcplists.
      
      Before commit df1acc85 ("mm/page_alloc: avoid conflating IRQs disabled
      with zone->lock"), free_unref_page_commit() used to fallback to freeing
      isolated pages directly to the buddy allocator through free_one_page(). 
      This was done _after_ updating PGFREE, so the counter was correctly
      updated.
      
      However, that commit moved the fallback logic to its callers (now called
      free_unref_page() and free_unref_folios()), so PGFREE was no longer
      updated in this fallback case.
      
      Now that the code has developed, there are more cases in free_unref_page()
      and free_unref_folios() where we fallback to calling free_one_page() (e.g.
      !pcp_allowed_order(), pcp_spin_trylock() fails).  These cases also miss
      updating PGFREE.
      
      To make sure PGFREE is updated in all cases where pages are freed to the
      buddy allocator, move the update down the stack to free_one_page().
      
      This was noticed through code inspection, although it should be noticeable
      at runtime (at least with some workloads).
      
      Link: https://lkml.kernel.org/r/20240904205419.821776-1-yosryahmed@google.com
      Fixes: df1acc85 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Brendan Jackman <jackmanb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec867977
    • Mark Brown's avatar
      mm: care about shadow stack guard gap when getting an unmapped area · df7e1286
      Mark Brown authored
      As covered in the commit log for c44357c2 ("x86/mm: care about shadow
      stack guard gap during placement") our current mmap() implementation does
      not take care to ensure that a new mapping isn't placed with existing
      mappings inside it's own guard gaps.  This is particularly important for
      shadow stacks since if two shadow stacks end up getting placed adjacent to
      each other then they can overflow into each other which weakens the
      protection offered by the feature.
      
      On x86 there is a custom arch_get_unmapped_area() which was updated by the
      above commit to cover this case by specifying a start_gap for allocations
      with VM_SHADOW_STACK.  Both arm64 and RISC-V have equivalent features and
      use the generic implementation of arch_get_unmapped_area() so let's make
      the equivalent change there so they also don't get shadow stack pages
      placed without guard pages.  x86 uses a single page guard, this is also
      sufficient for arm64 where we either do single word pops and pushes or
      unconstrained writes.
      
      Architectures which do not have this feature will define VM_SHADOW_STACK
      to VM_NONE and hence be unaffected.
      
      Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-3-a46b8b6dc0ed@kernel.orgSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Suggested-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Acked-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N Rao <naveen@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      df7e1286
    • Mark Brown's avatar
      mm: pass vm_flags to generic_get_unmapped_area() · 540e00a7
      Mark Brown authored
      In preparation for using vm_flags to ensure guard pages for shadow stacks
      supply them as an argument to generic_get_unmapped_area().  The only user
      outside of the core code is the PowerPC book3s64 implementation which is
      trivially wrapping the generic implementation in the radix_enabled() case.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.orgSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Acked-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Naveen N Rao <naveen@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      540e00a7
    • Mark Brown's avatar
      mm: make arch_get_unmapped_area() take vm_flags by default · 25d4054c
      Mark Brown authored
      Patch series "mm: Care about shadow stack guard gap when getting an
      unmapped area", v2.
      
      As covered in the commit log for c44357c2 ("x86/mm: care about shadow
      stack guard gap during placement") our current mmap() implementation does
      not take care to ensure that a new mapping isn't placed with existing
      mappings inside it's own guard gaps.  This is particularly important for
      shadow stacks since if two shadow stacks end up getting placed adjacent to
      each other then they can overflow into each other which weakens the
      protection offered by the feature.
      
      On x86 there is a custom arch_get_unmapped_area() which was updated by the
      above commit to cover this case by specifying a start_gap for allocations
      with VM_SHADOW_STACK.  Both arm64 and RISC-V have equivalent features and
      use the generic implementation of arch_get_unmapped_area() so let's make
      the equivalent change there so they also don't get shadow stack pages
      placed without guard pages.  The arm64 and RISC-V shadow stack
      implementations are currently on the list:
      
         https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743
         https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
      
      Given the addition of the use of vm_flags in the generic implementation we
      also simplify the set of possibilities that have to be dealt with in the
      core code by making arch_get_unmapped_area() take vm_flags as standard. 
      This is a bit invasive since the prototype change touches quite a few
      architectures but since the parameter is ignored the change is
      straightforward, the simplification for the generic code seems worth it.
      
      
      This patch (of 3):
      
      When we introduced arch_get_unmapped_area_vmflags() in 96114870 ("mm:
      introduce arch_get_unmapped_area_vmflags()") we did so as part of properly
      supporting guard pages for shadow stacks on x86_64, which uses a custom
      arch_get_unmapped_area().  Equivalent features are also present on both
      arm64 and RISC-V, both of which use the generic implementation of
      arch_get_unmapped_area() and will require equivalent modification there. 
      Rather than continue to deal with having two versions of the functions
      let's bite the bullet and have all implementations of
      arch_get_unmapped_area() take vm_flags as a parameter.
      
      The new parameter is currently ignored by all implementations other than
      x86.  The only caller that doesn't have a vm_flags available is
      mm_get_unmapped_area(), as for the x86 implementation and the wrapper used
      on other architectures this is modified to supply no flags.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org
      Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.orgSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Acked-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: Helge Deller <deller@gmx.de>	[parisc]
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N Rao <naveen@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      25d4054c