1. 05 Jan, 2018 30 commits
    • Hugh Dickins's avatar
      kaiser: asm/tlbflush.h handle noPGE at lower level · 0651b3ad
      Hugh Dickins authored
      
      I found asm/tlbflush.h too twisty, and think it safer not to avoid
      __native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
      but instead let it handle kaiser_enabled along with cr3: it can just
      use __native_flush_tlb() for that, no harm in re-disabling preemption.
      
      (This is not the same change as Kirill and Dave have suggested for
      upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
      check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)
      
      Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
      preference from __native_flush_tlb(): unlike the invpcid_flush_all()
      preference in __native_flush_tlb_global(), it's not seen in upstream
      4.14, and was recently reported to be surprisingly slow.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0651b3ad
    • Hugh Dickins's avatar
      kaiser: drop is_atomic arg to kaiser_pagetable_walk() · 28c6de54
      Hugh Dickins authored
      
      I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
      use of kaiser_add_mapping() in our tree (why not?), but like upstream
      we have not provided a way for that to pass is_atomic true down to
      kaiser_pagetable_walk(), and at startup it's far from a likely source
      of trouble: so just delete the walk's is_atomic arg and might_sleep().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28c6de54
    • Hugh Dickins's avatar
      kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush · 2dff99eb
      Hugh Dickins authored
      
      Now that we're playing the ALTERNATIVE game, use that more efficient
      method: instead of user-mapping an extra page, and reading an extra
      cacheline each time for x86_cr3_pcid_noflush.
      
      Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
      is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
      but the one line with $63 in looks clearer, so let's stick with that.
      
      Worried about what happens with an ALTERNATIVE between the jump and
      jump label in another ALTERNATIVE?  I was, but have checked the
      combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
      and it does a good job.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2dff99eb
    • Borislav Petkov's avatar
      x86/kaiser: Check boottime cmdline params · e405a064
      Borislav Petkov authored
      
      AMD (and possibly other vendors) are not affected by the leak
      KAISER is protecting against.
      
      Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
      like upstream.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e405a064
    • Borislav Petkov's avatar
      x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling · dea9aa9f
      Borislav Petkov authored
      
      Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dea9aa9f
    • Hugh Dickins's avatar
      kaiser: add "nokaiser" boot option, using ALTERNATIVE · e345dcc9
      Hugh Dickins authored
      
      Added "nokaiser" boot option: an early param like "noinvpcid".
      Most places now check int kaiser_enabled (#defined 0 when not
      CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
      and entry_64_compat.S are using the ALTERNATIVE technique, which
      patches in the preferred instructions at runtime.  That technique
      is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
      
      Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
      but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
      nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
      neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
      won't get set in some obscure corner, or something add PGE into CR4.
      By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
      all page table setup which uses pte_pfn() masks it out of the ptes.
      
      It's slightly shameful that the same declaration versus definition of
      kaiser_enabled appears in not one, not two, but in three header files
      (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
      than with #including any of those in any of the others; and did not
      feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
      them all, so we shall hear about it if they get out of synch.
      
      Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
      from kaiser.c; removed the unused native_get_normal_pgd(); removed
      the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
      comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
      the manual is clear that it does not matter whether it's 0 or 1 when
      4-level-pts are enabled, but I was distracted to find cr4 different on
      BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e345dcc9
    • Hugh Dickins's avatar
      kaiser: fix unlikely error in alloc_ldt_struct() · 500943e5
      Hugh Dickins authored
      
      An error from kaiser_add_mapping() here is not at all likely, but
      Eric Biggers rightly points out that __free_ldt_struct() relies on
      new_ldt->size being initialized: move that up.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      500943e5
    • Hugh Dickins's avatar
      kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls · d41f46f7
      Hugh Dickins authored
      
      Synthetic filesystem mempressure testing has shown softlockups, with
      hour-long page allocation stalls, and pgd_alloc() trying for order:1
      with __GFP_REPEAT in one of the backtraces each time.
      
      That's _pgd_alloc() going for a Kaiser double-pgd, using the __GFP_REPEAT
      common to all page table allocations, but actually having no effect on
      order:0 (see should_alloc_oom() and should_continue_reclaim() in this
      tree, but beware that ports to another tree might behave differently).
      
      Order:1 stack allocation has been working satisfactorily without
      __GFP_REPEAT forever, and page table allocation only asks __GFP_REPEAT
      for awkward occasions in a long-running process: it's not appropriate
      at fork or exec time, and seems to be doing much more harm than good:
      getting those contiguous pages under very heavy mempressure can be
      hard (though even without it, Kaiser does generate more mempressure).
      
      Mask out that __GFP_REPEAT inside _pgd_alloc().  Why not take it out
      of the PGALLOG_GFP altogether, as v4.7 commit a3a9a59d ("x86: get
      rid of superfluous __GFP_REPEAT") did?  Because I think that might
      make a difference to our page table memcg charging, which I'd prefer
      not to interfere with at this time.
      
      hughd adds: __alloc_pages_slowpath() in the 4.4.89-stable tree handles
      __GFP_REPEAT a little differently than in prod kernel or 3.18.72-stable,
      so it may not always be exactly a no-op on order:0 pages, as said above;
      but I think still appropriate to omit it from Kaiser or non-Kaiser pgd.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d41f46f7
    • Hugh Dickins's avatar
      kaiser: paranoid_entry pass cr3 need to paranoid_exit · fc8334e6
      Hugh Dickins authored
      
      Neel Natu points out that paranoid_entry() was wrong to assume that
      an entry that did not need swapgs would not need SWITCH_KERNEL_CR3:
      paranoid_entry (used for debug breakpoint, int3, double fault or MCE;
      though I think it's only the MCE case that is cause for concern here)
      can break in at an awkward time, between cr3 switch and swapgs, but
      its handling always needs kernel gs and kernel cr3.
      
      Easy to fix in itself, but paranoid_entry() also needs to convey to
      paranoid_exit() (and my reading of macro idtentry says paranoid_entry
      and paranoid_exit are always paired) how to restore the prior state.
      The swapgs state is already conveyed by %ebx (0 or 1), so extend that
      also to convey when SWITCH_USER_CR3 will be needed (2 or 3).
      
      (Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other
      way round: and a convention shared with error_entry() and error_exit(),
      which I don't want to touch.  Perhaps I should have inverted the bit
      for switch cr3 too, but did not.)
      
      paranoid_exit() would be straightforward, except for TRACE_IRQS: it
      did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG
      when not: which is it supposed to use when SWITCH_USER_CR3 is split
      apart from that?  As best as I can determine, commit 5963e317
      ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")
      missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG
      there too (the discrepancy has nothing to do with the liberal use
      of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has
      just been used in all cases); discrepancy lovingly preserved across
      several paranoid_exit() cleanups, but I'm now removing it.
      
      Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in
      paranoid_exit() is now not only unnecessary but unsafe: might corrupt
      syscall entry's unsafe_stack_register_backup of %rax.  Just use
      SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether,
      before we make the mistake of using it again.
      
      hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs
      part of the series, and ought to be moved earlier, if you decided
      to make a release of Kaiser-without-PCIDs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fc8334e6
    • Hugh Dickins's avatar
      kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user · 20268a10
      Hugh Dickins authored
      
      Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and
      X86_CR3_PCID_USER_VAR: we usually name variables in lower-case.
      
      But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)?
      Ah, it's a leftover from when kaiser_add_user_map() once complained
      about mapping the same page twice.  Make it __read_mostly instead.
      (I'm a little uneasy about all the unrelated data which shares its
      page getting user-mapped too, but that was so before, and not a big
      deal: though we call it user-mapped, it's not mapped with _PAGE_USER.)
      
      And there is a little change around the two calls to do_nmi().
      Previously they set the NOFLUSH bit (if PCID supported) when
      forcing to kernel context before do_nmi(); now they also have the
      NOFLUSH bit set (if PCID supported) when restoring context after:
      nothing done in do_nmi() should require a TLB to be flushed here.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20268a10
    • Hugh Dickins's avatar
      kaiser: PCID 0 for kernel and 128 for user · 3b4ce0e1
      Hugh Dickins authored
      
      Why was 4 chosen for kernel PCID and 6 for user PCID?
      No good reason in a backport where PCIDs are only used for Kaiser.
      
      If we continue with those, then we shall need to add Andy Lutomirski's
      4.13 commit 6c690ee1 ("x86/mm: Split read_cr3() into read_cr3_pa()
      and __read_cr3()"), which deals with the problem of read_cr3() callers
      finding stray bits in the cr3 that they expected to be page-aligned;
      and for hibernation, his 4.14 commit f34902c5 ("x86/hibernate/64:
      Mask off CR3's PCID bits in the saved CR3").
      
      But if 0 is used for kernel PCID, then there's no need to add in those
      commits - whenever the kernel looks, it sees 0 in the lower bits; and
      0 for kernel seems an obvious choice.
      
      And I naughtily propose 128 for user PCID.  Because there's a place
      in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH,
      but needs to reset that to NOFLUSH for the next occasion.  Currently
      it does so with a "movb $(0x80)" into the high byte of the per-cpu
      quadword, but that will cause a machine without PCID support to crash.
      Now, if %al just happened to have 0x80 in it at that point, on a
      machine with PCID support, but 0 on a machine without PCID support...
      
      (That will go badly wrong once the pgd can be at a physical address
      above 2^56, but even with 5-level paging, physical goes up to 2^52.)
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b4ce0e1
    • Hugh Dickins's avatar
      kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user · 0731188f
      Hugh Dickins authored
      
      We have many machines (Westmere, Sandybridge, Ivybridge) supporting
      PCID but not INVPCID: on these load_new_mm_cr3() simply crashed.
      
      Flushing user context inside load_new_mm_cr3() without the use of
      invpcid is difficult: momentarily switch from kernel to user context
      and back to do so?  I'm not sure whether that can be safely done at
      all, and would risk polluting user context with kernel internals,
      and kernel context with stale user externals.
      
      Instead, follow the hint in the comment that was there: change
      X86_CR3_PCID_USER_VAR to be a per-cpu variable, then load_new_mm_cr3()
      can leave a note in it, for SWITCH_USER_CR3 on return to userspace to
      flush user context TLB, instead of default X86_CR3_PCID_USER_NOFLUSH.
      
      Which works well enough that there's no need to do it this way only
      when invpcid is unsupported: it's a good alternative to invpcid here.
      But there's a couple of inlines in asm/tlbflush.h that need to do the
      same trick, so it's best to localize all this per-cpu business in
      mm/kaiser.c: moving that part of the initialization from setup_pcid()
      to kaiser_setup_pcid(); with kaiser_flush_tlb_on_return_to_user() the
      function for noting an X86_CR3_PCID_USER_FLUSH.  And let's keep a
      KAISER_SHADOW_PGD_OFFSET in there, to avoid the extra OR on exit.
      
      I did try to make the feature tests in asm/tlbflush.h more consistent
      with each other: there seem to be far too many ways of performing such
      tests, and I don't have a good grasp of their differences.  At first
      I converted them all to be static_cpu_has(): but that proved to be a
      mistake, as the comment in __native_flush_tlb_single() hints; so then
      I reversed and made them all this_cpu_has().  Probably all gratuitous
      change, but that's the way it's working at present.
      
      I am slightly bothered by the way non-per-cpu X86_CR3_PCID_KERN_VAR
      gets re-initialized by each cpu (before and after these changes):
      no problem when (as usual) all cpus on a machine have the same
      features, but in principle incorrect.  However, my experiment
      to per-cpu-ify that one did not end well...
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0731188f
    • Dave Hansen's avatar
      kaiser: enhanced by kernel and user PCIDs · eb82151d
      Dave Hansen authored
      
      Merged performance improvements to Kaiser, using distinct kernel
      and user Process Context Identifiers to minimize the TLB flushing.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb82151d
    • Hugh Dickins's avatar
      kaiser: vmstat show NR_KAISERTABLE as nr_overhead · 3e3d38fd
      Hugh Dickins authored
      
      The kaiser update made an interesting choice, never to free any shadow
      page tables.  Contention on global spinlock was worrying, particularly
      with it held across page table scans when freeing.  Something had to be
      done: I was going to add refcounting; but simply never to free them is
      an appealing choice, minimizing contention without complicating the code
      (the more a page table is found already, the less the spinlock is used).
      
      But leaking pages in this way is also a worry: can we get away with it?
      At the very least, we need a count to show how bad it actually gets:
      in principle, one might end up wasting about 1/256 of memory that way
      (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
      for when they are user-mapped from the vmalloc area on another occasion
      (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).
      
      Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
      shared pgd entries, and 1 for each intermediate page table added
      thereafter for user-mapping - but leave out the 1 per mm, for its
      shadow pgd, because that distracts from the monotonic increase.
      Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).
      
      In practice, it doesn't look so bad so far: more like 1/12000 after
      nine hours of gtests below; and movable pageblock segregation should
      tend to cluster the kaiser tables into a subset of the address space
      (if not, they will be bad for compaction too).  But production may
      tell a different story: keep an eye on this number, and bring back
      lighter freeing if it gets out of control (maybe a shrinker).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e3d38fd
    • Hugh Dickins's avatar
      kaiser: delete KAISER_REAL_SWITCH option · b9d2ccc5
      Hugh Dickins authored
      
      We fail to see what CONFIG_KAISER_REAL_SWITCH is for: it seems to be
      left over from early development, and now just obscures tricky parts
      of the code.  Delete it before adding PCIDs, or nokaiser boot option.
      
      (Or if there is some good reason to keep the option, then it needs
      a help text - and a "depends on KAISER", so that all those without
      KAISER are not asked the question.)
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9d2ccc5
    • Hugh Dickins's avatar
      kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET · aeda21d7
      Hugh Dickins authored
      
      There's a 0x1000 in various places, which looks better with a name.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aeda21d7
    • Hugh Dickins's avatar
      kaiser: cleanups while trying for gold link · c52e55a2
      Hugh Dickins authored
      
      While trying to get our gold link to work, four cleanups:
      matched the gdt_page declaration to its definition;
      in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes;
      lined up the backslashes according to convention in percpu-defs.h;
      deleted the unused irq_stack_pointer addition to irq_stack_union.
      
      Sad to report that aligning backslashes does not appear to help gold
      align to 8192: but while these did not help, they are worth keeping.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c52e55a2
    • Hugh Dickins's avatar
      kaiser: kaiser_remove_mapping() move along the pgd · f127705d
      Hugh Dickins authored
      
      When removing the bogus comment from kaiser_remove_mapping(),
      I really ought to have checked the extent of its bogosity: as
      Neel points out, there is nothing to stop unmap_pud_range_nofree()
      from continuing beyond the end of a pud (and starting in the wrong
      position on the next).
      
      Fix kaiser_remove_mapping() to constrain the extent and advance pgd
      pointer correctly: use pgd_addr_end() macro as used throughout base
      mm (but don't assume page-rounded start and size in this case).
      
      But this bug was very unlikely to trigger in this backport: since
      any buddy allocation is contained within a single pud extent, and
      we are not using vmapped stacks (and are only mapping one page of
      stack anyway): the only way to hit this bug here would be when
      freeing a large modified ldt.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f127705d
    • Hugh Dickins's avatar
      kaiser: tidied up kaiser_add/remove_mapping slightly · 0c68228f
      Hugh Dickins authored
      
      Yes, unmap_pud_range_nofree()'s declaration ought to be in a
      header file really, but I'm not sure we want to use it anyway:
      so for now just declare it inside kaiser_remove_mapping().
      And there doesn't seem to be such a thing as unmap_p4d_range(),
      even in a 5-level paging tree.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c68228f
    • Hugh Dickins's avatar
      kaiser: tidied up asm/kaiser.h somewhat · 5fbd46c4
      Hugh Dickins authored
      
      Mainly deleting a surfeit of blank lines, and reflowing header comment.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fbd46c4
    • Hugh Dickins's avatar
      kaiser: ENOMEM if kaiser_pagetable_walk() NULL · 407c3ff6
      Hugh Dickins authored
      
      kaiser_add_user_map() took no notice when kaiser_pagetable_walk() failed.
      And avoid its might_sleep() when atomic (though atomic at present unused).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      407c3ff6
    • Hugh Dickins's avatar
      kaiser: fix perf crashes · 20cbe9a3
      Hugh Dickins authored
      
      Avoid perf crashes: place debug_store in the user-mapped per-cpu area
      instead of allocating, and use page allocator plus kaiser_add_mapping()
      to keep the BTS and PEBS buffers user-mapped (that is, present in the
      user mapping, though visible only to kernel and hardware).  The PEBS
      fixup buffer does not need this treatment.
      
      The need for a user-mapped struct debug_store showed up before doing
      any conscious perf testing: in a couple of kernel paging oopses on
      Westmere, implicating the debug_store offset of the per-cpu area.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20cbe9a3
    • Hugh Dickins's avatar
      kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER · 487f0b73
      Hugh Dickins authored
      
      pjt has observed that nmi's second (nmi_from_kernel) call to do_nmi()
      adjusted the %rdi regs arg, rightly when CONFIG_KAISER, but wrongly
      when not CONFIG_KAISER.
      
      Although the minimal change is to add an #ifdef CONFIG_KAISER around
      the addq line, that looks cluttered, and I prefer how the first call
      to do_nmi() handled it: prepare args in %rdi and %rsi before getting
      into the CONFIG_KAISER block, since it does not touch them at all.
      
      And while we're here, place the "#ifdef CONFIG_KAISER" that follows
      each, to enclose the "Unconditionally restore CR3" comment: matching
      how the "Unconditionally use kernel CR3" comment above is enclosed.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      487f0b73
    • Hugh Dickins's avatar
      kaiser: KAISER depends on SMP · d94df201
      Hugh Dickins authored
      
      It is absurd that KAISER should depend on SMP, but apparently nobody
      has tried a UP build before: which breaks on implicit declaration of
      function 'per_cpu_offset' in arch/x86/mm/kaiser.c.
      
      Now, you would expect that to be trivially fixed up; but looking at
      the System.map when that block is #ifdef'ed out of kaiser_init(),
      I see that in a UP build __per_cpu_user_mapped_end is precisely at
      __per_cpu_user_mapped_start, and the items carefully gathered into
      that section for user-mapping on SMP, dispersed elsewhere on UP.
      
      So, some other kind of section assignment will be needed on UP,
      but implementing that is not a priority: just make KAISER depend
      on SMP for now.
      
      Also inserted a blank line before the option, tidied up the
      brief Kconfig help message, and added an "If unsure, Y".
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d94df201
    • Hugh Dickins's avatar
      kaiser: fix build and FIXME in alloc_ldt_struct() · 9b94cf97
      Hugh Dickins authored
      
      Include linux/kaiser.h instead of asm/kaiser.h to build ldt.c without
      CONFIG_KAISER.  kaiser_add_mapping() does already return an error code,
      so fix the FIXME.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b94cf97
    • Hugh Dickins's avatar
      kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE · 003e4767
      Hugh Dickins authored
      
      Kaiser only needs to map one page of the stack; and
      kernel/fork.c did not build on powerpc (no __PAGE_KERNEL).
      It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack()
      and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's
      kaiser_add_mapping() and kaiser_remove_mapping().  And use
      linux/kaiser.h in init/main.c to avoid the #ifdefs there.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      003e4767
    • Hugh Dickins's avatar
      kaiser: do not set _PAGE_NX on pgd_none · edde7320
      Hugh Dickins authored
      
      native_pgd_clear() uses native_set_pgd(), so native_set_pgd() must
      avoid setting the _PAGE_NX bit on an otherwise pgd_none() entry:
      usually that just generated a warning on exit, but sometimes
      more mysterious and damaging failures (our production machines
      could not complete booting).
      
      The original fix to this just avoided adding _PAGE_NX to
      an empty entry; but eventually more problems surfaced with kexec,
      and EFI mapping expected to be a problem too.  So now instead
      change native_set_pgd() to update shadow only if _PAGE_USER:
      
      A few places (kernel/machine_kexec_64.c, platform/efi/efi_64.c for sure)
      use set_pgd() to set up a temporary internal virtual address space, with
      physical pages remapped at what Kaiser regards as userspace addresses:
      Kaiser then assumes a shadow pgd follows, which it will try to corrupt.
      
      This appears to be responsible for the recent kexec and kdump failures;
      though it's unclear how those did not manifest as a problem before.
      Ah, the shadow pgd will only be assumed to "follow" if the requested
      pgd is on an even-numbered page: so I suppose it was going wrong 50%
      of the time all along.
      
      What we need is a flag to set_pgd(), to tell it we're dealing with
      userspace.  Er, isn't that what the pgd's _PAGE_USER bit is saying?
      Add a test for that.  But we cannot do the same for pgd_clear()
      (which may be called to clear corrupted entries - set aside the
      question of "corrupt in which pgd?" until later), so there just
      rely on pgd_clear() not being called in the problematic cases -
      with a WARN_ON_ONCE() which should fire half the time if it is.
      
      But this is getting too big for an inline function: move it into
      arch/x86/mm/kaiser.c (which then demands a boot/compressed mod);
      and de-void and de-space native_get_shadow/normal_pgd() while here.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edde7320
    • Dave Hansen's avatar
      kaiser: merged update · bed9bb7f
      Dave Hansen authored
      
      Merged fixes and cleanups, rebased to 4.4.89 tree (no 5-level paging).
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bed9bb7f
    • Richard Fellner's avatar
      KAISER: Kernel Address Isolation · 8a43ddfb
      Richard Fellner authored
      
      This patch introduces our implementation of KAISER (Kernel Address Isolation to
      have Side-channels Efficiently Removed), a kernel isolation technique to close
      hardware side channels on kernel address information.
      
      More information about the patch can be found on:
      
              https://github.com/IAIK/KAISER
      
      From: Richard Fellner <richard.fellner@student.tugraz.at>
      From: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
      X-Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
      Date: Thu, 4 May 2017 14:26:50 +0200
      Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
      Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5
      
      To: <linux-kernel@vger.kernel.org>
      To: <kernel-hardening@lists.openwall.com>
      Cc: <clementine.maurice@iaik.tugraz.at>
      Cc: <moritz.lipp@iaik.tugraz.at>
      Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
      Cc: Richard Fellner <richard.fellner@student.tugraz.at>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <kirill.shutemov@linux.intel.com>
      Cc: <anders.fogh@gdata-adan.de>
      
      After several recent works [1,2,3] KASLR on x86_64 was basically
      considered dead by many researchers. We have been working on an
      efficient but effective fix for this problem and found that not mapping
      the kernel space when running in user mode is the solution to this
      problem [4] (the corresponding paper [5] will be presented at ESSoS17).
      
      With this RFC patch we allow anybody to configure their kernel with the
      flag CONFIG_KAISER to add our defense mechanism.
      
      If there are any questions we would love to answer them.
      We also appreciate any comments!
      
      Cheers,
      Daniel (+ the KAISER team from Graz University of Technology)
      
      [1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
      [2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
      [3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
      [4] https://github.com/IAIK/KAISER
      [5] https://gruss.cc/files/kaiser.pdf
      
      [patch based also on
      https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch]
      Signed-off-by: default avatarRichard Fellner <richard.fellner@student.tugraz.at>
      Signed-off-by: default avatarMoritz Lipp <moritz.lipp@iaik.tugraz.at>
      Signed-off-by: default avatarDaniel Gruss <daniel.gruss@iaik.tugraz.at>
      Signed-off-by: default avatarMichael Schwarz <michael.schwarz@iaik.tugraz.at>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a43ddfb
    • Tom Lendacky's avatar
      x86/boot: Add early cmdline parsing for options with arguments · 0fa147b4
      Tom Lendacky authored
      commit e505371d upstream.
      
      Add a cmdline_find_option() function to look for cmdline options that
      take arguments. The argument is returned in a supplied buffer and the
      argument length (regardless of whether it fits in the supplied buffer)
      is returned, with -1 indicating not found.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kvm@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0fa147b4
  2. 02 Jan, 2018 10 commits