An error occurred fetching the project authors.
  1. 24 Jun, 2017 1 commit
    • Hugh Dickins's avatar
      mm: larger stack guard gap, between vmas · cfc0eb40
      Hugh Dickins authored
      commit 1be7107f upstream.
      
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: default avatarOleg Nesterov <oleg@redhat.com>
      Original-patch-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [wt: backport to 4.11: adjust context]
      [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cfc0eb40
  2. 21 Apr, 2017 1 commit
  3. 20 Oct, 2016 1 commit
    • Andy Lutomirski's avatar
      fs/proc: Stop trying to report thread stacks · b18cb64e
      Andy Lutomirski authored
      This reverts more of:
      
        b7643757 ("procfs: mark thread stack correctly in proc/<pid>/maps")
      
      ... which was partially reverted by:
      
        65376df5 ("proc: revert /proc/<pid>/maps [stack:TID] annotation")
      
      Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps.
      
      In current kernels, /proc/PID/maps (or /proc/TID/maps even for
      threads) shows "[stack]" for VMAs in the mm's stack address range.
      
      In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the
      target thread's stack's VMA.  This is racy, probably returns garbage
      and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone:
      KSTK_ESP is not safe to use on tasks that aren't known to be running
      ordinary process-context kernel code.
      
      This patch removes the difference and just shows "[stack]" for VMAs
      in the mm's stack range.  This is IMO much more sensible -- the
      actual "stack" address really is treated specially by the VM code,
      and the current thread stack isn't even well-defined for programs
      that frequently switch stacks on their own.
      Reported-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux API <linux-api@vger.kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tycho Andersen <tycho.andersen@canonical.com>
      Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b18cb64e
  4. 08 Oct, 2016 2 commits
    • Robert Ho's avatar
      mm, proc: fix region lost in /proc/self/smaps · 855af072
      Robert Ho authored
      Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
      more detailed info please refer to:
      
         https://bugzilla.redhat.com/show_bug.cgi?id=1365721
      
      Actually, this bug is not only for NVDIMM/DAX but also for any other
      file systems.  This simple test case abstracted from nvml can easily
      reproduce this bug in common environment:
      
      -------------------------- testcase.c -----------------------------
      
      int
      is_pmem_proc(const void *addr, size_t len)
      {
              const char *caddr = addr;
      
              FILE *fp;
              if ((fp = fopen("/proc/self/smaps", "r")) == NULL) {
                      printf("!/proc/self/smaps");
                      return 0;
              }
      
              int retval = 0;         /* assume false until proven otherwise */
              char line[PROCMAXLEN];  /* for fgets() */
              char *lo = NULL;        /* beginning of current range in smaps file */
              char *hi = NULL;        /* end of current range in smaps file */
              int needmm = 0;         /* looking for mm flag for current range */
              while (fgets(line, PROCMAXLEN, fp) != NULL) {
                      static const char vmflags[] = "VmFlags:";
                      static const char mm[] = " wr";
      
                      /* check for range line */
                      if (sscanf(line, "%p-%p", &lo, &hi) == 2) {
                              if (needmm) {
                                      /* last range matched, but no mm flag found */
                                      printf("never found mm flag.\n");
                                      break;
                              } else if (caddr < lo) {
                                      /* never found the range for caddr */
                                      printf("#######no match for addr %p.\n", caddr);
                                      break;
                              } else if (caddr < hi) {
                                      /* start address is in this range */
                                      size_t rangelen = (size_t)(hi - caddr);
      
                                      /* remember that matching has started */
                                      needmm = 1;
      
                                      /* calculate remaining range to search for */
                                      if (len > rangelen) {
                                              len -= rangelen;
                                              caddr += rangelen;
                                              printf("matched %zu bytes in range "
                                                      "%p-%p, %zu left over.\n",
                                                              rangelen, lo, hi, len);
                                      } else {
                                              len = 0;
                                              printf("matched all bytes in range "
                                                              "%p-%p.\n", lo, hi);
                                      }
                              }
                      } else if (needmm && strncmp(line, vmflags,
                                              sizeof(vmflags) - 1) == 0) {
                              if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) {
                                      printf("mm flag found.\n");
                                      if (len == 0) {
                                              /* entire range matched */
                                              retval = 1;
                                              break;
                                      }
                                      needmm = 0;     /* saw what was needed */
                              } else {
                                      /* mm flag not set for some or all of range */
                                      printf("range has no mm flag.\n");
                                      break;
                              }
                      }
              }
      
              fclose(fp);
      
              printf("returning %d.\n", retval);
              return retval;
      }
      
      void *Addr;
      size_t Size;
      
      /*
       * worker -- the work each thread performs
       */
      static void *
      worker(void *arg)
      {
              int *ret = (int *)arg;
              *ret =  is_pmem_proc(Addr, Size);
              return NULL;
      }
      
      int main(int argc, char *argv[])
      {
              if (argc <  2 || argc > 3) {
                      printf("usage: %s file [env].\n", argv[0]);
                      return -1;
              }
      
              int fd = open(argv[1], O_RDWR);
      
              struct stat stbuf;
              fstat(fd, &stbuf);
      
              Size = stbuf.st_size;
              Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
      
              close(fd);
      
              pthread_t threads[NTHREAD];
              int ret[NTHREAD];
      
              /* kick off NTHREAD threads */
              for (int i = 0; i < NTHREAD; i++)
                      pthread_create(&threads[i], NULL, worker, &ret[i]);
      
              /* wait for all the threads to complete */
              for (int i = 0; i < NTHREAD; i++)
                      pthread_join(threads[i], NULL);
      
              /* verify that all the threads return the same value */
              for (int i = 1; i < NTHREAD; i++) {
                      if (ret[0] != ret[i]) {
                              printf("Error i %d ret[0] = %d ret[i] = %d.\n", i,
                                      ret[0], ret[i]);
                      }
              }
      
              printf("%d", ret[0]);
              return 0;
      }
      
      It failed as some threads can not find the memory region in
      "/proc/self/smaps" which is allocated in the main process
      
      It is caused by proc fs which uses 'file->version' to indicate the VMA that
      is the last one has already been handled by read() system call. When the
      next read() issues, it uses the 'version' to find the VMA, then the next
      VMA is what we want to handle, the related code is as follows:
      
              if (last_addr) {
                      vma = find_vma(mm, last_addr);
                      if (vma && (vma = m_next_vma(priv, vma)))
                              return vma;
              }
      
      However, VMA will be lost if the last VMA is gone, e.g:
      
      The process VMA list is A->B->C->D
      
      CPU 0                                  CPU 1
      read() system call
         handle VMA B
         version = B
      return to userspace
      
                                         unmap VMA B
      
      issue read() again to continue to get
      the region info
         find_vma(version) will get VMA C
         m_next_vma(C) will get VMA D
         handle D
         !!! VMA C is lost !!!
      
      In order to fix this bug, we make 'file->version' indicate the end address
      of the current VMA.  m_start will then look up a vma which with vma_start
      < last_vm_end and moves on to the next vma if we found the same or an
      overlapping vma.  This will guarantee that we will not miss an exclusive
      vma but we can still miss one if the previous vma was shrunk.  This is
      acceptable because guaranteeing "never miss a vma" is simply not feasible.
      User has to cope with some inconsistencies if the file is not read in one
      go.
      
      [mhocko@suse.com: changelog fixes]
      Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.comAcked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: default avatarRobert Hu <robert.hu@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      855af072
    • James Morse's avatar
      fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious · 0f30206b
      James Morse authored
      Trying to walk all of virtual memory requires architecture specific
      knowledge.  On x86_64, addresses must be sign extended from bit 48,
      whereas on arm64 the top VA_BITS of address space have their own set of
      page tables.
      
      clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
      provides a test_walk() callback that only expects to be walking over
      VMAs.  Currently walk_pmd_range() will skip memory regions that don't
      have a VMA, reporting them as a hole.
      
      As this call only expects to walk user address space, make it walk 0 to
      'highest_vm_end'.
      
      Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.comSigned-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f30206b
  5. 10 Sep, 2016 1 commit
    • Dan Williams's avatar
      mm: fix show_smap() for zone_device-pmd ranges · ca120cf6
      Dan Williams authored
      Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings
      currently results in the following VM_BUG_ONs:
      
       kernel BUG at mm/huge_memory.c:1105!
       task: ffff88045f16b140 task.stack: ffff88045be14000
       RIP: 0010:[<ffffffff81268f9b>]  [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340
       [..]
       Call Trace:
        [<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0
        [<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0
        [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
        [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
        [<ffffffff81307656>] show_smap+0xa6/0x2b0
      
       kernel BUG at fs/proc/task_mmu.c:585!
       RIP: 0010:[<ffffffff81306469>]  [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0
       Call Trace:
        [<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0
        [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
        [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
        [<ffffffff81307696>] show_smap+0xa6/0x2b0
      
      These locations are sanity checking page flags that must be set for an
      anonymous transparent huge page, but are not set for the zone_device
      pages associated with dax mappings.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      ca120cf6
  6. 26 Jul, 2016 1 commit
  7. 24 May, 2016 1 commit
  8. 29 Apr, 2016 1 commit
    • Gerald Schaefer's avatar
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer authored
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f
  9. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  10. 18 Feb, 2016 1 commit
    • Dave Hansen's avatar
      x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps · c1192f84
      Dave Hansen authored
      The protection key can now be just as important as read/write
      permissions on a VMA.  We need some debug mechanism to help
      figure out if it is in play.  smaps seems like a logical
      place to expose it.
      
      arch/x86/kernel/setup.c is a bit of a weirdo place to put
      this code, but it already had seq_file.h and there was not
      a much better existing place to put it.
      
      We also use no #ifdef.  If protection keys is .config'd out we
      will effectively get the same function as if we used the weak
      generic function.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Mark Williamson <mwilliamson@undo-software.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210227.4F8EB3F8@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c1192f84
  11. 03 Feb, 2016 2 commits
    • Johannes Weiner's avatar
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner authored
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their requirement.  However, I did do this on my
        own time because I thought it was an interesting project for me and nobody
        really gave any feedback then as to its utility, so as far as I am
        concerned you could roll back the main thread maps information since the
        information is available in the thread-specific files"
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65376df5
    • Michael Holzheu's avatar
      numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390 · 5c2ff95e
      Michael Holzheu authored
      When working with hugetlbfs ptes (which are actually pmds) is not valid to
      directly use pte functions like pte_present() because the hardware bit
      layout of pmds and ptes can be different.  This is the case on s390.
      Therefore we have to convert the hugetlbfs ptes first into a valid pte
      encoding with huge_ptep_get().
      
      Currently the /proc/<pid>/numa_maps code uses hugetlbfs ptes without
      huge_ptep_get().  On s390 this leads to the following two problems:
      
      1) The pte_present() function returns false (instead of true) for
         PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
         completely in the "numa_maps" output.
      
      2) The pte_dirty() function always returns false for all hugetlb ptes.
         Therefore these pages are reported as "mapped=xxx" instead of
         "dirty=xxx".
      
      Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
      Signed-off-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c2ff95e
  12. 22 Jan, 2016 1 commit
  13. 21 Jan, 2016 1 commit
  14. 16 Jan, 2016 2 commits
  15. 15 Jan, 2016 7 commits
    • Konstantin Khlebnikov's avatar
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov authored
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • Oleg Nesterov's avatar
      mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() · 0e41e277
      Oleg Nesterov authored
      clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
      VM_SOFTDIRTY was already cleared before walk_page_range().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e41e277
    • Jerome Marchand's avatar
      mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status · 8cee852e
      Jerome Marchand authored
      There are several shortcomings with the accounting of shared memory
      (SysV shm, shared anonymous mapping, mapping of a tmpfs file).  The
      values in /proc/<pid>/status and <...>/statm don't allow to distinguish
      between shmem memory and a shared mapping to a regular file, even though
      theirs implication on memory usage are quite different: during reclaim,
      file mapping can be dropped or written back on disk, while shmem needs a
      place in swap.
      
      Also, to distinguish the memory occupied by anonymous and file mappings,
      one has to read the /proc/pid/statm file, which has a field for the file
      mappings (again, including shmem) and total memory occupied by these
      mappings (i.e.  equivalent to VmRSS in the <...>/status file.  Getting
      the value for anonymous mappings only is thus not exactly user-friendly
      (the statm file is intended to be rather efficiently machine-readable).
      
      To address both of these shortcomings, this patch adds a breakdown of
      VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and
      RssShmem, making use of the previous preparatory patch.  These fields
      tell the user the memory occupied by private anonymous pages, mapped
      regular files and shmem, respectively.  Other existing fields in /status
      and /statm files are left without change.  The /statm file can be
      extended in the future, if there's a need for that.
      
      Example (part of) /proc/pid/status output including the new Rss* fields:
      
      VmPeak:  2001008 kB
      VmSize:  2001004 kB
      VmLck:         0 kB
      VmPin:         0 kB
      VmHWM:      5108 kB
      VmRSS:      5108 kB
      RssAnon:              92 kB
      RssFile:            1324 kB
      RssShmem:           3692 kB
      VmData:      192 kB
      VmStk:       136 kB
      VmExe:         4 kB
      VmLib:      1784 kB
      VmPTE:      3928 kB
      VmPMD:        20 kB
      VmSwap:        0 kB
      HugetlbPages:          0 kB
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cee852e
    • Jerome Marchand's avatar
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand authored
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • Vlastimil Babka's avatar
      mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03
      Vlastimil Babka authored
      Following the previous patch, further reduction of /proc/pid/smaps cost
      is possible for private writable shmem mappings with unpopulated areas
      where the page walk invokes the .pte_hole function.  We can use radix
      tree iterator for each such area instead of calling find_get_entry() in
      a loop.  This is possible at the extra maintenance cost of introducing
      another shmem function shmem_partial_swap_usage().
      
      To demonstrate the diference, I have measured this on a process that
      creates a private writable 2GB mapping of a partially swapped out
      /dev/shm/file (which cannot employ the optimizations from the prvious
      patch) and doesn't populate it at all.  I time how long does it take to
      cat /proc/pid/smaps of this process 100 times.
      
      Before this patch:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      After this patch:
      
      real    0m1.176s
      user    0m0.180s
      sys     0m0.684s
      
      The time is similar to the case where a radix tree iterator is employed
      on the whole mapping.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48131e03
    • Vlastimil Babka's avatar
      mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370
      Vlastimil Babka authored
      The previous patch has improved swap accounting for shmem mapping, which
      however made /proc/pid/smaps more expensive for shmem mappings, as we
      consult the radix tree for each pte_none entry, so the overal complexity
      is O(n*log(n)).
      
      We can reduce this significantly for mappings that cannot contain COWed
      pages, because then we can either use the statistics tha shmem object
      itself tracks (if the mapping contains the whole object, or the swap
      usage of the whole object is zero), or use the radix tree iterator,
      which is much more effective than repeated find_get_entry() calls.
      
      This patch therefore introduces a function shmem_swap_usage(vma) and
      makes /proc/pid/smaps use it when possible.  Only for writable private
      mappings of shmem objects (i.e.  tmpfs files) with the shmem object
      itself (partially) swapped outwe have to resort to the find_get_entry()
      approach.
      
      Hopefully such mappings are relatively uncommon.
      
      To demonstrate the diference, I have measured this on a process that
      creates a 2GB mapping and dirties single pages with a stride of 2MB, and
      time how long does it take to cat /proc/pid/smaps of this process 100
      times.
      
      Private writable mapping of a /dev/shm/file (the most complex case):
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
      (which needs to employ the radix tree iterator).
      
      real    0m1.351s
      user    0m0.096s
      sys     0m0.768s
      
      Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
      
      real    0m0.935s
      user    0m0.128s
      sys     0m0.344s
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      The cost is now much closer to the private anonymous mapping case, unless
      the shmem mapping is private and writable.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a15a370
    • Vlastimil Babka's avatar
      mm, proc: account for shmem swap in /proc/pid/smaps · c261e7d9
      Vlastimil Babka authored
      Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
      shmem-backed mappings, even if the mapped portion does contain pages
      that were swapped out.  This is because unlike private anonymous
      mappings, shmem does not change pte to swap entry, but pte_none when
      swapping the page out.  In the smaps page walk, such page thus looks
      like it was never faulted in.
      
      This patch changes smaps_pte_entry() to determine the swap status for
      such pte_none entries for shmem mappings, similarly to how
      mincore_page() does it.  Swapped out shmem pages are thus accounted for.
      For private mappings of tmpfs files that COWed some of the pages, swaped
      out status of the original shmem pages is naturally ignored.  If some of
      the private copies was also swapped out, they are accounted via their
      page table swap entries, so the resulting reported swap usage is then a
      sum of both swapped out private copies, and swapped out shmem pages that
      were not COWed.  No double accounting can thus happen.
      
      The accounting is arguably still not as precise as for private anonymous
      mappings, since now we will count also pages that the process in
      question never accessed, but another process populated them and then let
      them become swapped out.  I believe it is still less confusing and
      subtle than not showing any swap usage by shmem mappings at all.
      Swapped out counter might of interest of users who would like to prevent
      from future swapins during performance critical operation and pre-fault
      them at their convenience.  Especially for larger swapped out regions
      the cost of swapin is much higher than a fresh page allocation.  So a
      differentiation between pte_none vs.  swapped out is important for those
      usecases.
      
      One downside of this patch is that it makes /proc/pid/smaps more
      expensive for shmem mappings, as we consult the radix tree for each
      pte_none entry, so the overal complexity is O(n*log(n)).  I have
      measured this on a process that creates a 2GB mapping and dirties single
      pages with a stride of 2MB, and time how long does it take to cat
      /proc/pid/smaps of this process 100 times.
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      Mapping of a /dev/shm/file:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      The difference is rather substantial, so the next patch will reduce the
      cost for shared or read-only mappings.
      
      In a less controlled experiment, I've gathered pids of processes on my
      desktop that have either '/dev/shm/*' or 'SYSV*' in smaps.  This
      included the Chrome browser and some KDE processes.  Again, I've run cat
      /proc/pid/smaps on each 100 times.
      
      Before this patch:
      
      real    0m9.050s
      user    0m0.518s
      sys     0m8.066s
      
      After this patch:
      
      real    0m9.221s
      user    0m0.541s
      sys     0m8.187s
      
      This suggests low impact on average systems.
      
      Note that this patch doesn't attempt to adjust the SwapPss field for
      shmem mappings, which would need extra work to determine who else could
      have the pages mapped.  Thus the value stays zero except for COWed
      swapped out pages in a shmem mapping, which are accounted as usual.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c261e7d9
  16. 06 Nov, 2015 4 commits
  17. 14 Oct, 2015 1 commit
  18. 10 Sep, 2015 1 commit
    • Vladimir Davydov's avatar
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov authored
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  19. 08 Sep, 2015 6 commits
  20. 04 Sep, 2015 1 commit
    • Andrea Arcangeli's avatar
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 16ba6f81
      Andrea Arcangeli authored
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16ba6f81
  21. 23 Jun, 2015 1 commit
  22. 17 Mar, 2015 1 commit
  23. 13 Feb, 2015 1 commit
    • Rafael Aquini's avatar
      fs: proc: task_mmu: show page size in /proc/<pid>/numa_maps · 198d1597
      Rafael Aquini authored
      The output of /proc/$pid/numa_maps is in terms of number of pages like
      anon=22 or dirty=54.  Here's some output:
      
        7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
        7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
        7fff8d425000 default stack anon=50 dirty=50 N0=50
      
      Looks like we have a stack and a couple of anonymous hugetlbfs
      areas page which both use the same amount of memory.  They don't.
      
      The 'bigfile' uses 1GB pages and takes up ~50GB of space.  The
      anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
      uses normal 4k pages.  You can go over to smaps to figure out what the
      page size _really_ is with KernelPageSize or MMUPageSize.  But, I think
      this is a pretty nasty and counterintuitive interface as it stands.
      
      This patch introduces 'kernelpagesize_kB' line element to
      /proc/<pid>/numa_maps report file in order to help identifying the size of
      pages that are backing memory areas mapped by a given task.  This is
      specially useful to help differentiating between HUGE and GIGANTIC page
      backed VMAs.
      
      This patch is based on Dave Hansen's proposal and reviewer's follow-ups
      taken from the following dicussion threads:
       * https://lkml.org/lkml/2011/9/21/454
       * https://lkml.org/lkml/2014/12/20/66Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      198d1597