1. 26 Aug, 2017 2 commits
  2. 18 Jul, 2017 1 commit
    • Kees Cook's avatar
      mm: Tighten x86 /dev/mem with zeroing reads · 3cbd86d2
      Kees Cook authored
      commit a4866aa8
      
       upstream.
      
      Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is
      disallowed. However, on x86, the first 1MB was always allowed for BIOS
      and similar things, regardless of it actually being System RAM. It was
      possible for heap to end up getting allocated in low 1MB RAM, and then
      read by things like x86info or dd, which would trip hardened usercopy:
      
      usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes)
      
      This changes the x86 exception for the low 1MB by reading back zeros for
      System RAM areas instead of blindly allowing them. More work is needed to
      extend this to mmap, but currently mmap doesn't go through usercopy, so
      hardened usercopy won't Oops the kernel.
      Reported-by: default avatarTommi Rantala <tommi.t.rantala@nokia.com>
      Tested-by: default avatarTommi Rantala <tommi.t.rantala@nokia.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      [bwh: Backported to 3.16: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3cbd86d2
  3. 02 Jul, 2017 1 commit
    • Hugh Dickins's avatar
      mm: larger stack guard gap, between vmas · 978b8aa1
      Hugh Dickins authored
      commit 1be7107f
      
       upstream.
      
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: default avatarOleg Nesterov <oleg@redhat.com>
      Original-patch-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [Hugh Dickins: Backported to 3.16]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      978b8aa1
  4. 30 Apr, 2016 2 commits
  5. 17 Feb, 2016 1 commit
    • Matt Fleming's avatar
      x86/mm/pat: Avoid truncation when converting cpa->numpages to address · 0972cb11
      Matt Fleming authored
      commit 74256377 upstream.
      
      There are a couple of nasty truncation bugs lurking in the pageattr
      code that can be triggered when mapping EFI regions, e.g. when we pass
      a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
      left by PAGE_SHIFT will truncate the resultant address to 32-bits.
      
      Viorel-Cătălin managed to trigger this bug on his Dell machine that
      provides a ~5GB EFI region which requires 1236992 pages to be mapped.
      When calling populate_pud() the end of the region gets calculated
      incorrectly in the following buggy expression,
      
        end = start + (cpa->numpages << PAGE_SHIFT);
      
      And only 188416 pages are mapped. Next, populate_pud() gets invoked
      for a second time because of the loop in __change_page_attr_set_clr(),
      only this time no pages get mapped because shifting the remaining
      number of pages (1048576) by PAGE_SHIFT is zero. At which point the
      loop in __change_page_attr_set_clr() spins forever because we fail to
      map progress.
      
      Hitting this bug depends very much on the virtual address we pick to
      map the large region at and how many pages we map on the initial run
      through the loop. This explains why this issue was only recently hit
      with the introduction of commit
      
        a5caa209
      
       ("x86/efi: Fix boot crash by mapping EFI memmap
         entries bottom-up at runtime, instead of top-down")
      
      It's interesting to note that safe uses of cpa->numpages do exist in
      the pageattr code. If instead of shifting ->numpages we multiply by
      PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
      so the result is unsigned long.
      
      To avoid surprises when users try to convert very large cpa->numpages
      values to addresses, change the data type from 'int' to 'unsigned
      long', thereby making it suitable for shifting by PAGE_SHIFT without
      any type casting.
      
      The alternative would be to make liberal use of casting, but that is
      far more likely to cause problems in the future when someone adds more
      code and fails to cast properly; this bug was difficult enough to
      track down in the first place.
      Reported-and-tested-by: default avatarViorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com>
      Acked-by: default avatarBorislav Petkov <bp@alien8.de>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131
      Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.uk
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      0972cb11
  6. 02 Feb, 2016 1 commit
    • Andy Lutomirski's avatar
      x86/mm: Add barriers and document switch_mm()-vs-flush synchronization · bab48cc4
      Andy Lutomirski authored
      commit 71b3c126
      
       upstream.
      
      When switch_mm() activates a new PGD, it also sets a bit that
      tells other CPUs that the PGD is in use so that TLB flush IPIs
      will be sent.  In order for that to work correctly, the bit
      needs to be visible prior to loading the PGD and therefore
      starting to fill the local TLB.
      
      Document all the barriers that make this work correctly and add
      a couple that were missing.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [ luis: backported to 3.16:
        - dropped N/A comment in flush_tlb_mm_range()
        - adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      bab48cc4
  7. 28 Oct, 2015 1 commit
    • Stephen Smalley's avatar
      x86/mm: Set NX on gap between __ex_table and rodata · 45771796
      Stephen Smalley authored
      commit ab76f7b4
      
       upstream.
      
      Unused space between the end of __ex_table and the start of
      rodata can be left W+x in the kernel page tables.  Extend the
      setting of the NX bit to cover this gap by starting from
      text_end rather than rodata_start.
      
        Before:
        ---[ High Kernel Mapping ]---
        0xffffffff80000000-0xffffffff81000000          16M                               pmd
        0xffffffff81000000-0xffffffff81600000           6M     ro         PSE     GLB x  pmd
        0xffffffff81600000-0xffffffff81754000        1360K     ro                 GLB x  pte
        0xffffffff81754000-0xffffffff81800000         688K     RW                 GLB x  pte
        0xffffffff81800000-0xffffffff81a00000           2M     ro         PSE     GLB NX pmd
        0xffffffff81a00000-0xffffffff81b3b000        1260K     ro                 GLB NX pte
        0xffffffff81b3b000-0xffffffff82000000        4884K     RW                 GLB NX pte
        0xffffffff82000000-0xffffffff82200000           2M     RW         PSE     GLB NX pmd
        0xffffffff82200000-0xffffffffa0000000         478M                               pmd
      
        After:
        ---[ High Kernel Mapping ]---
        0xffffffff80000000-0xffffffff81000000          16M                               pmd
        0xffffffff81000000-0xffffffff81600000           6M     ro         PSE     GLB x  pmd
        0xffffffff81600000-0xffffffff81754000        1360K     ro                 GLB x  pte
        0xffffffff81754000-0xffffffff81800000         688K     RW                 GLB NX pte
        0xffffffff81800000-0xffffffff81a00000           2M     ro         PSE     GLB NX pmd
        0xffffffff81a00000-0xffffffff81b3b000        1260K     ro                 GLB NX pte
        0xffffffff81b3b000-0xffffffff82000000        4884K     RW                 GLB NX pte
        0xffffffff82000000-0xffffffff82200000           2M     RW         PSE     GLB NX pmd
        0xffffffff82200000-0xffffffffa0000000         478M                               pmd
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1443704662-3138-1-git-send-email-sds@tycho.nsa.gov
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      45771796
  8. 30 Sep, 2015 1 commit
  9. 02 Mar, 2015 2 commits
    • Hector Marco-Gisbert's avatar
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · b515b1b0
      Hector Marco-Gisbert authored
      commit 4e7c22d4
      
       upstream.
      
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: default avatarHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: default avatarIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
      
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      b515b1b0
    • Naoya Horiguchi's avatar
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · 24bb9d18
      Naoya Horiguchi authored
      commit cbef8478 upstream.
      
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4
      
       ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      24bb9d18
  10. 16 Feb, 2015 1 commit
  11. 04 Feb, 2015 1 commit
    • Linus Torvalds's avatar
      vm: add VM_FAULT_SIGSEGV handling support · 903575f1
      Linus Torvalds authored
      commit 33692f27 upstream.
      
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d
      
       ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: default avatarTakashi Iwai <tiwai@suse.de>
      Tested-by: default avatarJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [ luis: backported to 3.16:
        - file renamed: arch/powerpc/mm/copro_fault.c ->
          arch/powerpc/platforms/cell/spu_fault.c
        - dropped changes to arch/nios2/mm/fault.c ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      903575f1
  12. 01 Dec, 2014 1 commit
    • Kees Cook's avatar
      x86, mm: Set NX across entire PMD at boot · de32229b
      Kees Cook authored
      commit 45e2a9d4
      
       upstream.
      
      When setting up permissions on kernel memory at boot, the end of the
      PMD that was split from bss remained executable. It should be NX like
      the rest. This performs a PMD alignment instead of a PAGE alignment to
      get the correct span of memory.
      
      Before:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82c00000    10M     RW   PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82df5000  2004K     RW       GLB NX pte
      0xffffffff82df5000-0xffffffff82e00000    44K     RW       GLB x  pte
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      After:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82e00000    12M     RW   PSE GLB NX pmd
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment.
              We really should unmap the reminder along with the holes
              caused by init,initdata etc. but thats a different issue ]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.net
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      de32229b
  13. 13 Nov, 2014 1 commit
  14. 04 Jun, 2014 4 commits
    • Emil Medve's avatar
      arch/x86/mm/numa.c: use for_each_memblock() · af4459d3
      Emil Medve authored
      Signed-off-by: default avatarEmil Medve <Emilian.Medve@Freescale.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af4459d3
    • Yinghai Lu's avatar
      x86, mm: probe memory block size for generic x86 64bit · 982792c7
      Yinghai Lu authored
      On system with 2TiB ram, current x86_64 have 128M as section size, and
      one memory_block only include one section.  So will have 16400 entries
      under /sys/devices/system/memory/.
      
      Current code try to use block id to find block pointer in /sys for any
      section, and reuse that block pointer.  that finding will take some time
      even after commit 7c243c71
      
       ("mm: speedup in __early_pfn_to_nid")
      that will skip the search in that case during booting up.
      
      So solution could be increase block size just like SGI UV system did.
      (harded code to 2g).
      
      This patch is trying to probe the block size to make it match mmio remap
      size.  for example, Intel Nehalem later system will have memory range [0,
      TOML), [4g, TOMH].  If the memory hole is 2g and total is 128g, TOM will
      be 2g, and TOM2 will be 130g.
      
      We could use 2g as block size instead of default 128M.  That will reduce
      number of entries in /sys/devices/system/memory/
      
      On system 6TiB system will reduce boot time by 35 seconds.
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      982792c7
    • Mel Gorman's avatar
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman authored
      
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
    • Naoya Horiguchi's avatar
      hugetlb: restrict hugepage_migration_support() to x86_64 · c177c81e
      Naoya Horiguchi authored
      
      Currently hugepage migration is available for all archs which support
      pmd-level hugepage, but testing is done only for x86_64 and there're
      bugs for other archs.  So to avoid breaking such archs, this patch
      limits the availability strictly to x86_64 until developers of other
      archs get interested in enabling this feature.
      
      Simply disabling hugepage migration on non-x86_64 archs is not enough to
      fix the reported problem where sys_move_pages() hits the BUG_ON() in
      follow_page(FOLL_GET), so let's fix this by checking if hugepage
      migration is supported in vma_migratable().
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c177c81e
  15. 20 May, 2014 2 commits
  16. 05 May, 2014 3 commits
  17. 02 May, 2014 1 commit
    • Roland Dreier's avatar
      x86, ioremap: Speed up check for RAM pages · c81c8a1e
      Roland Dreier authored
      
      In __ioremap_caller() (the guts of ioremap), we loop over the range of
      pfns being remapped and checks each one individually with page_is_ram().
      For large ioremaps, this can be very slow.  For example, we have a
      device with a 256 GiB PCI BAR, and ioremapping this BAR can take 20+
      seconds -- sometimes long enough to trigger the soft lockup detector!
      
      Internally, page_is_ram() calls walk_system_ram_range() on a single
      page.  Instead, we can make a single call to walk_system_ram_range()
      from __ioremap_caller(), and do our further checks only for any RAM
      pages that we find.  For the common case of MMIO, this saves an enormous
      amount of work, since the range being ioremapped doesn't intersect
      system RAM at all.
      
      With this change, ioremap on our 256 GiB BAR takes less than 1 second.
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Link: http://lkml.kernel.org/r/1399054721-1331-1-git-send-email-roland@kernel.org
      
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      c81c8a1e
  18. 30 Apr, 2014 1 commit
    • H. Peter Anvin's avatar
      x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack · 3891a04a
      H. Peter Anvin authored
      The IRET instruction, when returning to a 16-bit segment, only
      restores the bottom 16 bits of the user space stack pointer.  This
      causes some 16-bit software to break, but it also leaks kernel state
      to user space.  We have a software workaround for that ("espfix") for
      the 32-bit kernel, but it relies on a nonzero stack segment base which
      is not available in 64-bit mode.
      
      In checkin:
      
          b3b42ac2
      
       x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
      
      we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
      the logic that 16-bit support is crippled on 64-bit kernels anyway (no
      V86 support), but it turns out that people are doing stuff like
      running old Win16 binaries under Wine and expect it to work.
      
      This works around this by creating percpu "ministacks", each of which
      is mapped 2^16 times 64K apart.  When we detect that the return SS is
      on the LDT, we copy the IRET frame to the ministack and use the
      relevant alias to return to userspace.  The ministacks are mapped
      readonly, so if IRET faults we promote #GP to #DF which is an IST
      vector and thus has its own stack; we then do the fixup in the #DF
      handler.
      
      (Making #GP an IST exception would make the msr_safe functions unsafe
      in NMI/MC context, and quite possibly have other effects.)
      
      Special thanks to:
      
      - Andy Lutomirski, for the suggestion of using very small stack slots
        and copy (as opposed to map) the IRET frame there, and for the
        suggestion to mark them readonly and let the fault promote to #DF.
      - Konrad Wilk for paravirt fixup and testing.
      - Borislav Petkov for testing help and useful comments.
      Reported-by: default avatarBrian Gerst <brgerst@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Lutomriski <amluto@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dirk Hohndel <dirk@hohndel.org>
      Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
      Cc: comex <comexk@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: <stable@vger.kernel.org> # consider after upstream merge
      3891a04a
  19. 24 Apr, 2014 1 commit
    • Masami Hiramatsu's avatar
      kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation · 9326638c
      Masami Hiramatsu authored
      
      Use NOKPROBE_SYMBOL macro for protecting functions
      from kprobes instead of __kprobes annotation under
      arch/x86.
      
      This applies nokprobe_inline annotation for some cases,
      because NOKPROBE_SYMBOL() will inhibit inlining by
      referring the symbol address.
      
      This just folds a bunch of previous NOKPROBE_SYMBOL()
      cleanup patches for x86 to one patch.
      Signed-off-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp
      
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Lebon <jlebon@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9326638c
  20. 16 Apr, 2014 1 commit
  21. 08 Apr, 2014 1 commit
  22. 07 Apr, 2014 2 commits
    • Mark Salter's avatar
      x86: use generic early_ioremap · 5b7c73e0
      Mark Salter authored
      
      Move x86 over to the generic early ioremap implementation.
      Signed-off-by: default avatarMark Salter <msalter@redhat.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b7c73e0
    • Dave Young's avatar
      x86/mm: sparse warning fix for early_memremap · 6b550f6f
      Dave Young authored
      
      This patch series takes the common bits from the x86 early ioremap
      implementation and creates a generic implementation which may be used by
      other architectures.  The early ioremap interfaces are intended for
      situations where boot code needs to make temporary virtual mappings
      before the normal ioremap interfaces are available.  Typically, this
      means before paging_init() has run.
      
      This patch (of 6):
      
      There's a lot of sparse warnings for code like below: void *a =
      early_memremap(phys_addr, size);
      
      early_memremap intend to map kernel memory with ioremap facility, the
      return pointer should be a kernel ram pointer instead of iomem one.
      
      For making the function clearer and supressing sparse warnings this patch
      do below two things:
      1. cast to (__force void *) for the return value of early_memremap
      2. add early_memunmap function and pass (__force void __iomem *) to iounmap
      
      From Boris:
        "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
         FWIW, all callers of early_memremap use the memory they get remapped
         as normal memory so we should be safe"
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarMark Salter <msalter@redhat.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b550f6f
  23. 13 Mar, 2014 1 commit
  24. 06 Mar, 2014 1 commit
  25. 05 Mar, 2014 1 commit
  26. 04 Mar, 2014 3 commits
  27. 27 Feb, 2014 2 commits