1. 10 Jan, 2018 20 commits
  2. 05 Jan, 2018 20 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.9.75 · 9f747558
      Greg Kroah-Hartman authored
      9f747558
    • Guenter Roeck's avatar
      kaiser: Set _PAGE_NX only if supported · 92fd81f7
      Guenter Roeck authored
      This resolves a crash if loaded under qemu + haxm under windows.
      See https://www.spinics.net/lists/kernel/msg2689835.html for details.
      Here is a boot log (the log is from chromeos-4.4, but Tao Wu says that
      the same log is also seen with vanilla v4.4.110-rc1).
      
      [    0.712750] Freeing unused kernel memory: 552K
      [    0.721821] init: Corrupted page table at address 57b029b332e0
      [    0.722761] PGD 80000000bb238067 PUD bc36a067 PMD bc369067 PTE 45d2067
      [    0.722761] Bad pagetable: 000b [#1] PREEMPT SMP 
      [    0.722761] Modules linked in:
      [    0.722761] CPU: 1 PID: 1 Comm: init Not tainted 4.4.96 #31
      [    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
      [    0.722761] task: ffff8800bc290000 ti: ffff8800bc28c000 task.ti: ffff8800bc28c000
      [    0.722761] RIP: 0010:[<ffffffff83f4129e>]  [<ffffffff83f4129e>] __clear_user+0x42/0x67
      [    0.722761] RSP: 0000:ffff8800bc28fcf8  EFLAGS: 00010202
      [    0.722761] RAX: 0000000000000000 RBX: 00000000000001a4 RCX: 00000000000001a4
      [    0.722761] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000057b029b332e0
      [    0.722761] RBP: ffff8800bc28fd08 R08: ffff8800bc290000 R09: ffff8800bb2f4000
      [    0.722761] R10: ffff8800bc290000 R11: ffff8800bb2f4000 R12: 000057b029b332e0
      [    0.722761] R13: 0000000000000000 R14: 000057b029b33340 R15: ffff8800bb1e2a00
      [    0.722761] FS:  0000000000000000(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
      [    0.722761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [    0.722761] CR2: 000057b029b332e0 CR3: 00000000bb2f8000 CR4: 00000000000006e0
      [    0.722761] Stack:
      [    0.722761]  000057b029b332e0 ffff8800bb95fa80 ffff8800bc28fd18 ffffffff83f4120c
      [    0.722761]  ffff8800bc28fe18 ffffffff83e9e7a1 ffff8800bc28fd68 0000000000000000
      [    0.722761]  ffff8800bc290000 ffff8800bc290000 ffff8800bc290000 ffff8800bc290000
      [    0.722761] Call Trace:
      [    0.722761]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
      [    0.722761]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
      [    0.722761]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
      [    0.722761]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
      [    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
      [    0.722761]  [<ffffffff83de40be>] do_execve+0x23/0x25
      [    0.722761]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
      [    0.722761]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
      [    0.722761]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
      [    0.722761]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
      [    0.722761] Code: 86 84 be 12 00 00 00 e8 87 0d e8 ff 66 66 90 48 89 d8 48 c1
      eb 03 4c 89 e7 83 e0 07 48 89 d9 be 08 00 00 00 31 d2 48 85 c9 74 0a <48> 89 17
      48 01 f7 ff c9 75 f6 48 89 c1 85 c9 74 09 88 17 48 ff 
      [    0.722761] RIP  [<ffffffff83f4129e>] __clear_user+0x42/0x67
      [    0.722761]  RSP <ffff8800bc28fcf8>
      [    0.722761] ---[ end trace def703879b4ff090 ]---
      [    0.722761] BUG: sleeping function called from invalid context at /mnt/host/source/src/third_party/kernel/v4.4/kernel/locking/rwsem.c:21
      [    0.722761] in_atomic(): 0, irqs_disabled(): 1, pid: 1, name: init
      [    0.722761] CPU: 1 PID: 1 Comm: init Tainted: G      D         4.4.96 #31
      [    0.722761] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
      [    0.722761]  0000000000000086 dcb5d76098c89836 ffff8800bc28fa30 ffffffff83f34004
      [    0.722761]  ffffffff84839dc2 0000000000000015 ffff8800bc28fa40 ffffffff83d57dc9
      [    0.722761]  ffff8800bc28fa68 ffffffff83d57e6a ffffffff84a53640 0000000000000000
      [    0.722761] Call Trace:
      [    0.722761]  [<ffffffff83f34004>] dump_stack+0x4d/0x63
      [    0.722761]  [<ffffffff83d57dc9>] ___might_sleep+0x13a/0x13c
      [    0.722761]  [<ffffffff83d57e6a>] __might_sleep+0x9f/0xa6
      [    0.722761]  [<ffffffff84502788>] down_read+0x20/0x31
      [    0.722761]  [<ffffffff83cc5d9b>] __blocking_notifier_call_chain+0x35/0x63
      [    0.722761]  [<ffffffff83cc5ddd>] blocking_notifier_call_chain+0x14/0x16
      [    0.800374] usb 1-1: new full-speed USB device number 2 using uhci_hcd
      [    0.722761]  [<ffffffff83cefe97>] profile_task_exit+0x1a/0x1c
      [    0.802309]  [<ffffffff83cac84e>] do_exit+0x39/0xe7f
      [    0.802309]  [<ffffffff83ce5938>] ? vprintk_default+0x1d/0x1f
      [    0.802309]  [<ffffffff83d7bb95>] ? printk+0x57/0x73
      [    0.802309]  [<ffffffff83c46e25>] oops_end+0x80/0x85
      [    0.802309]  [<ffffffff83c7b747>] pgtable_bad+0x8a/0x95
      [    0.802309]  [<ffffffff83ca7f4a>] __do_page_fault+0x8c/0x352
      [    0.802309]  [<ffffffff83eefba5>] ? file_has_perm+0xc4/0xe5
      [    0.802309]  [<ffffffff83ca821c>] do_page_fault+0xc/0xe
      [    0.802309]  [<ffffffff84507682>] page_fault+0x22/0x30
      [    0.802309]  [<ffffffff83f4129e>] ? __clear_user+0x42/0x67
      [    0.802309]  [<ffffffff83f4127f>] ? __clear_user+0x23/0x67
      [    0.802309]  [<ffffffff83f4120c>] clear_user+0x2e/0x30
      [    0.802309]  [<ffffffff83e9e7a1>] load_elf_binary+0xa7f/0x18f7
      [    0.802309]  [<ffffffff83de2088>] search_binary_handler+0x86/0x19c
      [    0.802309]  [<ffffffff83de389e>] do_execveat_common.isra.26+0x909/0xf98
      [    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
      [    0.802309]  [<ffffffff83de40be>] do_execve+0x23/0x25
      [    0.802309]  [<ffffffff83c002e3>] run_init_process+0x2b/0x2d
      [    0.802309]  [<ffffffff844fec4d>] kernel_init+0x6d/0xda
      [    0.802309]  [<ffffffff84505b2f>] ret_from_fork+0x3f/0x70
      [    0.802309]  [<ffffffff844febe0>] ? rest_init+0x87/0x87
      [    0.830559] Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009
      [    0.830559] 
      [    0.831305] Kernel Offset: 0x2c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [    0.831305] ---[ end Kernel panic - not syncing: Attempted to kill init!  exitcode=0x00000009
      
      The crash part of this problem may be solved with the following patch
      (thanks to Hugh for the hint). There is still another problem, though -
      with this patch applied, the qemu session aborts with "VCPU Shutdown
      request", whatever that means.
      
      Cc: lepton <ytht.net@gmail.com>
      Signed-off-by: default avatarGuenter Roeck <groeck@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92fd81f7
    • Kees Cook's avatar
      KPTI: Report when enabled · ea6cd39d
      Kees Cook authored
      Make sure dmesg reports when KPTI is enabled.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea6cd39d
    • Kees Cook's avatar
      KPTI: Rename to PAGE_TABLE_ISOLATION · e71fac01
      Kees Cook authored
      This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e71fac01
    • Borislav Petkov's avatar
      x86/kaiser: Move feature detection up · 59094faf
      Borislav Petkov authored
      
      ... before the first use of kaiser_enabled as otherwise funky
      things happen:
      
        about to get started...
        (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000]
        (XEN) Pagetable walk from ffff88022a449090:
        (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
        (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
        (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
        entry.o#create_bounce_frame+0x135/0x14d
        (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
        (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
        (XEN) CPU:    0
        (XEN) RIP:    e033:[<ffffffff81007460>]
        (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59094faf
    • Jiri Kosina's avatar
      kaiser: disabled on Xen PV · 402e63de
      Jiri Kosina authored
      
      Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3).
      This does not work with KAISER as the CR3 switch from and to user space PGD
      would require to map the whole XEN_PV machinery into both.
      
      More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV
      guests use distinct %cr3 values for kernel and user already.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      402e63de
    • Borislav Petkov's avatar
      x86/kaiser: Reenable PARAVIRT · 2c272175
      Borislav Petkov authored
      
      Now that the required bits have been addressed, reenable
      PARAVIRT.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c272175
    • Thomas Gleixner's avatar
      x86/paravirt: Dont patch flush_tlb_single · 1817d2c2
      Thomas Gleixner authored
      
      commit a0357954 upstream
      
      native_flush_tlb_single() will be changed with the upcoming
      PAGE_TABLE_ISOLATION feature. This requires to have more code in
      there than INVLPG.
      
      Remove the paravirt patching for it.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Cc: michael.schwarz@iaik.tugraz.at
      Cc: moritz.lipp@iaik.tugraz.at
      Cc: richard.fellner@student.tugraz.at
      Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1817d2c2
    • Hugh Dickins's avatar
      kaiser: kaiser_flush_tlb_on_return_to_user() check PCID · fe5cb75f
      Hugh Dickins authored
      
      Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
      check, instead of each caller doing it inline first: nobody needs
      to optimize for the noPCID case, it's clearer this way, and better
      suits later changes.  Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
      by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe5cb75f
    • Hugh Dickins's avatar
      kaiser: asm/tlbflush.h handle noPGE at lower level · b72c26e9
      Hugh Dickins authored
      
      I found asm/tlbflush.h too twisty, and think it safer not to avoid
      __native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
      but instead let it handle kaiser_enabled along with cr3: it can just
      use __native_flush_tlb() for that, no harm in re-disabling preemption.
      
      (This is not the same change as Kirill and Dave have suggested for
      upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
      check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)
      
      Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
      preference from __native_flush_tlb(): unlike the invpcid_flush_all()
      preference in __native_flush_tlb_global(), it's not seen in upstream
      4.14, and was recently reported to be surprisingly slow.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b72c26e9
    • Hugh Dickins's avatar
      kaiser: drop is_atomic arg to kaiser_pagetable_walk() · 8c2f8a5c
      Hugh Dickins authored
      
      I have not observed a might_sleep() warning from setup_fixmap_gdt()'s
      use of kaiser_add_mapping() in our tree (why not?), but like upstream
      we have not provided a way for that to pass is_atomic true down to
      kaiser_pagetable_walk(), and at startup it's far from a likely source
      of trouble: so just delete the walk's is_atomic arg and might_sleep().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c2f8a5c
    • Hugh Dickins's avatar
      kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush · 169b369f
      Hugh Dickins authored
      
      Now that we're playing the ALTERNATIVE game, use that more efficient
      method: instead of user-mapping an extra page, and reading an extra
      cacheline each time for x86_cr3_pcid_noflush.
      
      Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
      is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
      but the one line with $63 in looks clearer, so let's stick with that.
      
      Worried about what happens with an ALTERNATIVE between the jump and
      jump label in another ALTERNATIVE?  I was, but have checked the
      combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
      and it does a good job.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      169b369f
    • Borislav Petkov's avatar
      x86/kaiser: Check boottime cmdline params · 8018307a
      Borislav Petkov authored
      
      AMD (and possibly other vendors) are not affected by the leak
      KAISER is protecting against.
      
      Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
      like upstream.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8018307a
    • Borislav Petkov's avatar
      x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling · 50624dd1
      Borislav Petkov authored
      
      Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50624dd1
    • Hugh Dickins's avatar
      kaiser: add "nokaiser" boot option, using ALTERNATIVE · 23e09439
      Hugh Dickins authored
      
      Added "nokaiser" boot option: an early param like "noinvpcid".
      Most places now check int kaiser_enabled (#defined 0 when not
      CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
      and entry_64_compat.S are using the ALTERNATIVE technique, which
      patches in the preferred instructions at runtime.  That technique
      is tied to x86 cpu features, so X86_FEATURE_KAISER is fabricated.
      
      Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
      but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
      nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
      neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
      won't get set in some obscure corner, or something add PGE into CR4.
      By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
      all page table setup which uses pte_pfn() masks it out of the ptes.
      
      It's slightly shameful that the same declaration versus definition of
      kaiser_enabled appears in not one, not two, but in three header files
      (asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
      than with #including any of those in any of the others; and did not
      feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
      them all, so we shall hear about it if they get out of synch.
      
      Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
      from kaiser.c; removed the unused native_get_normal_pgd(); removed
      the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
      comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
      the manual is clear that it does not matter whether it's 0 or 1 when
      4-level-pts are enabled, but I was distracted to find cr4 different on
      BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23e09439
    • Hugh Dickins's avatar
      kaiser: fix unlikely error in alloc_ldt_struct() · cb7d8d7e
      Hugh Dickins authored
      
      An error from kaiser_add_mapping() here is not at all likely, but
      Eric Biggers rightly points out that __free_ldt_struct() relies on
      new_ldt->size being initialized: move that up.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb7d8d7e
    • Hugh Dickins's avatar
      kaiser: kaiser_remove_mapping() move along the pgd · 3df14617
      Hugh Dickins authored
      
      When removing the bogus comment from kaiser_remove_mapping(),
      I really ought to have checked the extent of its bogosity: as
      Neel points out, there is nothing to stop unmap_pud_range_nofree()
      from continuing beyond the end of a pud (and starting in the wrong
      position on the next).
      
      Fix kaiser_remove_mapping() to constrain the extent and advance pgd
      pointer correctly: use pgd_addr_end() macro as used throughout base
      mm (but don't assume page-rounded start and size in this case).
      
      But this bug was very unlikely to trigger in this backport: since
      any buddy allocation is contained within a single pud extent, and
      we are not using vmapped stacks (and are only mapping one page of
      stack anyway): the only way to hit this bug here would be when
      freeing a large modified ldt.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3df14617
    • Hugh Dickins's avatar
      kaiser: paranoid_entry pass cr3 need to paranoid_exit · 05ddad14
      Hugh Dickins authored
      
      Neel Natu points out that paranoid_entry() was wrong to assume that
      an entry that did not need swapgs would not need SWITCH_KERNEL_CR3:
      paranoid_entry (used for debug breakpoint, int3, double fault or MCE;
      though I think it's only the MCE case that is cause for concern here)
      can break in at an awkward time, between cr3 switch and swapgs, but
      its handling always needs kernel gs and kernel cr3.
      
      Easy to fix in itself, but paranoid_entry() also needs to convey to
      paranoid_exit() (and my reading of macro idtentry says paranoid_entry
      and paranoid_exit are always paired) how to restore the prior state.
      The swapgs state is already conveyed by %ebx (0 or 1), so extend that
      also to convey when SWITCH_USER_CR3 will be needed (2 or 3).
      
      (Yes, I'd much prefer that 0 meant no swapgs, whereas it's the other
      way round: and a convention shared with error_entry() and error_exit(),
      which I don't want to touch.  Perhaps I should have inverted the bit
      for switch cr3 too, but did not.)
      
      paranoid_exit() would be straightforward, except for TRACE_IRQS: it
      did TRACE_IRQS_IRETQ when doing swapgs, but TRACE_IRQS_IRETQ_DEBUG
      when not: which is it supposed to use when SWITCH_USER_CR3 is split
      apart from that?  As best as I can determine, commit 5963e317
      ("ftrace/x86: Do not change stacks in DEBUG when calling lockdep")
      missed the swapgs case, and should have used TRACE_IRQS_IRETQ_DEBUG
      there too (the discrepancy has nothing to do with the liberal use
      of _NO_STACK and _UNSAFE_STACK hereabouts: TRACE_IRQS_OFF_DEBUG has
      just been used in all cases); discrepancy lovingly preserved across
      several paranoid_exit() cleanups, but I'm now removing it.
      
      Neel further indicates that to use SWITCH_USER_CR3_NO_STACK there in
      paranoid_exit() is now not only unnecessary but unsafe: might corrupt
      syscall entry's unsafe_stack_register_backup of %rax.  Just use
      SWITCH_USER_CR3: and delete SWITCH_USER_CR3_NO_STACK altogether,
      before we make the mistake of using it again.
      
      hughd adds: this commit fixes an issue in the Kaiser-without-PCIDs
      part of the series, and ought to be moved earlier, if you decided
      to make a release of Kaiser-without-PCIDs.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05ddad14
    • Hugh Dickins's avatar
      kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user · d0142ceb
      Hugh Dickins authored
      
      Mostly this commit is just unshouting X86_CR3_PCID_KERN_VAR and
      X86_CR3_PCID_USER_VAR: we usually name variables in lower-case.
      
      But why does x86_cr3_pcid_noflush need to be __aligned(PAGE_SIZE)?
      Ah, it's a leftover from when kaiser_add_user_map() once complained
      about mapping the same page twice.  Make it __read_mostly instead.
      (I'm a little uneasy about all the unrelated data which shares its
      page getting user-mapped too, but that was so before, and not a big
      deal: though we call it user-mapped, it's not mapped with _PAGE_USER.)
      
      And there is a little change around the two calls to do_nmi().
      Previously they set the NOFLUSH bit (if PCID supported) when
      forcing to kernel context before do_nmi(); now they also have the
      NOFLUSH bit set (if PCID supported) when restoring context after:
      nothing done in do_nmi() should require a TLB to be flushed here.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0142ceb
    • Hugh Dickins's avatar
      kaiser: PCID 0 for kernel and 128 for user · 6a2b4117
      Hugh Dickins authored
      
      Why was 4 chosen for kernel PCID and 6 for user PCID?
      No good reason in a backport where PCIDs are only used for Kaiser.
      
      If we continue with those, then we shall need to add Andy Lutomirski's
      4.13 commit 6c690ee1 ("x86/mm: Split read_cr3() into read_cr3_pa()
      and __read_cr3()"), which deals with the problem of read_cr3() callers
      finding stray bits in the cr3 that they expected to be page-aligned;
      and for hibernation, his 4.14 commit f34902c5 ("x86/hibernate/64:
      Mask off CR3's PCID bits in the saved CR3").
      
      But if 0 is used for kernel PCID, then there's no need to add in those
      commits - whenever the kernel looks, it sees 0 in the lower bits; and
      0 for kernel seems an obvious choice.
      
      And I naughtily propose 128 for user PCID.  Because there's a place
      in _SWITCH_TO_USER_CR3 where it takes note of the need for TLB FLUSH,
      but needs to reset that to NOFLUSH for the next occasion.  Currently
      it does so with a "movb $(0x80)" into the high byte of the per-cpu
      quadword, but that will cause a machine without PCID support to crash.
      Now, if %al just happened to have 0x80 in it at that point, on a
      machine with PCID support, but 0 on a machine without PCID support...
      
      (That will go badly wrong once the pgd can be at a physical address
      above 2^56, but even with 5-level paging, physical goes up to 2^52.)
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a2b4117