• Kan Liang's avatar
    perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE · 8d97e718
    Kan Liang authored
    
    
    Current perf can report both virtual addresses and physical addresses,
    but not the MMU page size. Without the MMU page size information of the
    utilized page, users cannot decide whether to promote/demote large pages
    to optimize memory usage.
    
    Add a new sample type for the data MMU page size.
    
    Current perf already has a facility to collect data virtual addresses.
    A page walker is required to walk the pages tables and calculate the
    MMU page size from a given virtual address.
    
    On some platforms, e.g., X86, the page walker is invoked in an NMI
    handler. So the page walker must be NMI-safe and low overhead. Besides,
    the page walker should work for both user and kernel virtual address.
    The existing generic page walker, e.g., walk_page_range_novma(), is a
    little bit complex and doesn't guarantee the NMI-safe. The follow_page()
    is only for user-virtual address.
    
    Add a new function perf_get_page_size() to walk the page tables and
    calculate the MMU page size. In the function:
    - Interrupts have to be disabled to prevent any teardown of the page
      tables.
    - For user space threads, the current->mm is used for the page walker.
      For kernel threads and the like, the current->mm is NULL. The init_mm
      is used for the page walker. The active_mm is not used here, because
      it can be NULL.
      Quote from Peter Zijlstra,
      "context_switch() can set prev->active_mm to NULL when it transfers it
       to @next. It does this before @current is updated. So an NMI that
       comes in between this active_mm swizzling and updating @current will
       see !active_mm."
    - The MMU page size is calculated from the page table level.
    
    The method should work for all architectures, but it has only been
    verified on X86. Should there be some architectures, which support perf,
    where the method doesn't work, it can be fixed later separately.
    Reporting the wrong page size would not be fatal for the architecture.
    
    Some under discussion features may impact the method in the future.
    Quote from Dave Hansen,
      "There are lots of weird things folks are trying to do with the page
       tables, like Address Space Isolation.  For instance, if you get a
       perf NMI when running userspace, current->mm->pgd is *different* than
       the PGD that was in use when userspace was running. It's close enough
       today, but it might not stay that way."
    If the case happens later, lots of consecutive page walk errors will
    happen. The worst case is that lots of page-size '0' are returned, which
    would not be fatal.
    In the perf tool, a check is implemented to detect this case. Once it
    happens, a kernel patch could be implemented accordingly then.
    Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
    Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
    8d97e718
core.c 313 KB