1. 10 Mar, 2022 1 commit
  2. 14 Feb, 2022 2 commits
    • Marco Elver's avatar
      stack: Constrain and fix stack offset randomization with Clang builds · efa90c11
      Marco Elver authored
      All supported versions of Clang perform auto-init of __builtin_alloca()
      when stack auto-init is on (CONFIG_INIT_STACK_ALL_{ZERO,PATTERN}).
      
      add_random_kstack_offset() uses __builtin_alloca() to add a stack
      offset. This means, when CONFIG_INIT_STACK_ALL_{ZERO,PATTERN} is
      enabled, add_random_kstack_offset() will auto-init that unused portion
      of the stack used to add an offset.
      
      There are several problems with this:
      
      	1. These offsets can be as large as 1023 bytes. Performing
      	   memset() on them isn't exactly cheap, and this is done on
      	   every syscall entry.
      
      	2. Architectures adding add_random_kstack_offset() to syscall
      	   entry implemented in C require them to be 'noinstr' (e.g. see
      	   x86 and s390). The potential problem here is that a call to
      	   memset may occur, which is not noinstr.
      
      A x86_64 defconfig kernel with Clang 11 and CONFIG_VMLINUX_VALIDATION shows:
      
       | vmlinux.o: warning: objtool: do_syscall_64()+0x9d: call to memset() leaves .noinstr.text section
       | vmlinux.o: warning: objtool: do_int80_syscall_32()+0xab: call to memset() leaves .noinstr.text section
       | vmlinux.o: warning: objtool: __do_fast_syscall_32()+0xe2: call to memset() leaves .noinstr.text section
       | vmlinux.o: warning: objtool: fixup_bad_iret()+0x2f: call to memset() leaves .noinstr.text section
      
      Clang 14 (unreleased) will introduce a way to skip alloca initialization
      via __builtin_alloca_uninitialized() (https://reviews.llvm.org/D115440).
      
      Constrain RANDOMIZE_KSTACK_OFFSET to only be enabled if no stack
      auto-init is enabled, the compiler is GCC, or Clang is version 14+. Use
      __builtin_alloca_uninitialized() if the compiler provides it, as is done
      by Clang 14.
      
      Link: https://lkml.kernel.org/r/YbHTKUjEejZCLyhX@elver.google.com
      Fixes: 39218ff4
      
       ("stack: Optionally randomize kernel stack offset each syscall")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220131090521.1947110-2-elver@google.com
      efa90c11
    • Marco Elver's avatar
      stack: Introduce CONFIG_RANDOMIZE_KSTACK_OFFSET · 8cb37a59
      Marco Elver authored
      
      The randomize_kstack_offset feature is unconditionally compiled in when
      the architecture supports it.
      
      To add constraints on compiler versions, we require a dedicated Kconfig
      variable. Therefore, introduce RANDOMIZE_KSTACK_OFFSET.
      
      Furthermore, this option is now also configurable by EXPERT kernels:
      while the feature is supposed to have zero performance overhead when
      disabled, due to its use of static branches, there are few cases where
      giving a distribution the option to disable the feature entirely makes
      sense. For example, in very resource constrained environments, which
      would never enable the feature to begin with, in which case the
      additional kernel code size increase would be redundant.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220131090521.1947110-1-elver@google.com
      8cb37a59
  3. 20 Jan, 2022 1 commit
  4. 15 Jan, 2022 1 commit
    • Pasha Tatashin's avatar
      mm: page table check · df4e817b
      Pasha Tatashin authored
      Check user page table entries at the time they are added and removed.
      
      Allows to synchronously catch memory corruption issues related to double
      mapping.
      
      When a pte for an anonymous page is added into page table, we verify
      that this pte does not already point to a file backed page, and vice
      versa if this is a file backed page that is being added we verify that
      this page does not have an anonymous mapping
      
      We also enforce that read-only sharing for anonymous pages is allowed
      (i.e.  cow after fork).  All other sharing must be for file pages.
      
      Page table check allows to protect and debug cases where "struct page"
      metadata became corrupted for some reason.  For example, when refcnt or
      mapcount become invalid.
      
      Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df4e817b
  5. 09 Dec, 2021 1 commit
    • Jarkko Sakkinen's avatar
      x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node · 50468e43
      Jarkko Sakkinen authored
      
      == Problem ==
      
      The amount of SGX memory on a system is determined by the BIOS and it
      varies wildly between systems.  It can be as small as dozens of MB's
      and as large as many GB's on servers.  Just like how applications need
      to know how much regular RAM is available, enclave builders need to
      know how much SGX memory an enclave can consume.
      
      == Solution ==
      
      Introduce a new sysfs file:
      
      	/sys/devices/system/node/nodeX/x86/sgx_total_bytes
      
      to enumerate the amount of SGX memory available in each NUMA node.
      This serves the same function for SGX as /proc/meminfo or
      /sys/devices/system/node/nodeX/meminfo does for normal RAM.
      
      'sgx_total_bytes' is needed today to help drive the SGX selftests.
      SGX-specific swap code is exercised by creating overcommitted enclaves
      which are larger than the physical SGX memory on the system.  They
      currently use a CPUID-based approach which can diverge from the actual
      amount of SGX memory available.  'sgx_total_bytes' ensures that the
      selftests can work efficiently and do not attempt stupid things like
      creating a 100,000 MB enclave on a system with 128 MB of SGX memory.
      
      == Implementation Details ==
      
      Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
      arch specific attribute group, and add an attribute for the amount of
      SGX memory in bytes to each NUMA node:
      
      == ABI Design Discussion ==
      
      As opposed to the per-node ABI, a single, global ABI was considered.
      However, this would prevent enclaves from being able to size
      themselves so that they fit on a single NUMA node.  Essentially, a
      single value would rule out NUMA optimizations for enclaves.
      
      Create a new "x86/" directory inside each "nodeX/" sysfs directory.
      'sgx_total_bytes' is expected to be the first of at least a few
      sgx-specific files to be placed in the new directory.  Just scanning
      /proc/meminfo, these are the no-brainers that we have for RAM, but we
      need for SGX:
      
      	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
      	MemFree:        yyyy kB // sgx_free_bytes
      	SwapTotal:      zzzz kB // sgx_swapped_bytes
      
      So, at *least* three.  I think we will eventually end up needing
      something more along the lines of a dozen.  A new directory (as
      opposed to being in the nodeX/ "root") directory avoids cluttering the
      root with several "sgx_*" files.
      
      Place the new file in a new "nodeX/x86/" directory because SGX is
      highly x86-specific.  It is very unlikely that any other architecture
      (or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
      as opposed to "x86/" was also considered.  But, there is a real chance
      this can get used for other arch-specific purposes.
      
      [ dhansen: rewrite changelog ]
      Signed-off-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
      50468e43
  6. 02 Dec, 2021 1 commit
  7. 27 Nov, 2021 1 commit
  8. 26 Oct, 2021 2 commits
    • Masami Hiramatsu's avatar
      kprobes: Add a test case for stacktrace from kretprobe handler · 1f6d3a8f
      Masami Hiramatsu authored
      Add a test case for stacktrace from kretprobe handler and
      nested kretprobe handlers.
      
      This test checks both of stack trace inside kretprobe handler
      and stack trace from pt_regs. Those stack trace must include
      actual function return address instead of kretprobe trampoline.
      The nested kretprobe stacktrace test checks whether the unwinder
      can correctly unwind the call frame on the stack which has been
      modified by the kretprobe.
      
      Since the stacktrace on kretprobe is correctly fixed only on x86,
      this introduces a meta kconfig ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
      which tells user that the stacktrace on kretprobe is correct or not.
      
      The test results will be shown like below;
      
       TAP version 14
       1..1
           # Subtest: kprobes_test
           1..6
           ok 1 - test_kprobe
           ok 2 - test_kprobes
           ok 3 - test_kretprobe
           ok 4 - test_kretprobes
           ok 5 - test_stacktrace_on_kretprobe
           ok 6 - test_stacktrace_on_nested_kretprobe
       # kprobes_test: pass:6 fail:0 skip:0 total:6
       # Totals: pass:6 fail:0 skip:0 total:6
       ok 1 - kprobes_test
      
      Link: https://lkml.kernel.org/r/163516211244.604541.18350507860972214415.stgit@devnote2
      
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1f6d3a8f
    • Thomas Gleixner's avatar
      signal: Add an optional check for altstack size · 1bdda24c
      Thomas Gleixner authored
      
      New x86 FPU features will be very large, requiring ~10k of stack in
      signal handlers.  These new features require a new approach called
      "dynamic features".
      
      The kernel currently tries to ensure that altstacks are reasonably
      sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
      However, that 2k is a constant. Simply raising that 2k requirement
      to >10k for the new features would break existing apps which have a
      compiled-in size of 2k.
      
      Instead of universally enforcing a larger stack, prohibit a process from
      using dynamic features without properly-sized altstacks. This must be
      enforced in two places:
      
       * A dynamic feature can not be enabled without an large-enough altstack
         for each process thread.
       * Once a dynamic feature is enabled, any request to install a too-small
         altstack will be rejected
      
      The dynamic feature enabling code must examine each thread in a
      process to ensure that the altstacks are large enough. Add a new lock
      (sigaltstack_lock()) to ensure that threads can not race and change
      their altstack after being examined.
      
      Add the infrastructure in form of a config option and provide empty
      stubs for architectures which do not need dynamic altstack size checks.
      
      This implementation will be fleshed out for x86 in a future patch called
      
        x86/arch_prctl: Add controls for dynamic XSTATE components
      
        [dhansen: commit message. ]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com
      1bdda24c
  9. 04 Oct, 2021 1 commit
  10. 08 Sep, 2021 1 commit
  11. 16 Aug, 2021 1 commit
  12. 28 Jul, 2021 1 commit
  13. 22 Jun, 2021 1 commit
  14. 26 May, 2021 2 commits
  15. 30 Apr, 2021 1 commit
    • Nicholas Piggin's avatar
      mm/vmalloc: hugepage vmalloc mappings · 121e6f32
      Nicholas Piggin authored
      Support huge page vmalloc mappings.  Config option HAVE_ARCH_HUGE_VMALLOC
      enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
      supports PMD sized vmap mappings.
      
      vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
      larger, and fall back to small pages if that was unsuccessful.
      
      Architectures must ensure that any arch specific vmalloc allocations that
      require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx)
      use the VM_NOHUGE flag to inhibit larger mappings.
      
      This can result in more internal fragmentation and memory overhead for a
      given allocation, an option nohugevmalloc is added to disable at boot.
      
      [colin.king@canonical.com: fix read of uninitialized pointer area]
        Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com
      
      Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.com
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ding Tianhong <dingtianhong@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      121e6f32
  16. 24 Apr, 2021 1 commit
    • Masahiro Yamada's avatar
      kbuild: check the minimum assembler version in Kconfig · ba64beb1
      Masahiro Yamada authored
      Documentation/process/changes.rst defines the minimum assembler version
      (binutils version), but we have never checked it in the build time.
      
      Kbuild never invokes 'as' directly because all assembly files in the
      kernel tree are *.S, hence must be preprocessed. I do not expect
      raw assembly source files (*.s) would be added to the kernel tree.
      
      Therefore, we always use $(CC) as the assembler driver, and commit
      aa824e0c ("kbuild: remove AS variable") removed 'AS'. However,
      we are still interested in the version of the assembler acting behind.
      
      As usual, the --version option prints the version string.
      
        $ as --version | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      But, we do not have $(AS). So, we can add the -Wa prefix so that
      $(CC) passes --version down to the backing assembler.
      
        $ gcc -Wa,--version | head -n 1
        gcc: fatal error: no input files
        compilation terminated.
      
      OK, we need to input something to satisfy gcc.
      
        $ gcc -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      The combination of Clang and GNU assembler works in the same way:
      
        $ clang -no-integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        GNU assembler (GNU Binutils for Ubuntu) 2.35.1
      
      Clang with the integrated assembler fails like this:
      
        $ clang -integrated-as -Wa,--version -c -x assembler /dev/null -o /dev/null | head -n 1
        clang: error: unsupported argument '--version' to option 'Wa,'
      
      For the last case, checking the error message is fragile. If the
      proposal for -Wa,--version support [1] is accepted, this may not be
      even an error in the future.
      
      One easy way is to check if -integrated-as is present in the passed
      arguments. We did not pass -integrated-as to CLANG_FLAGS before, but
      we can make it explicit.
      
      Nathan pointed out -integrated-as is the default for all of the
      architectures/targets that the kernel cares about, but it goes
      along with "explicit is better than implicit" policy. [2]
      
      With all this in my mind, I implemented scripts/as-version.sh to
      check the assembler version in Kconfig time.
      
        $ scripts/as-version.sh gcc
        GNU 23501
        $ scripts/as-version.sh clang -no-integrated-as
        GNU 23501
        $ scripts/as-version.sh clang -integrated-as
        LLVM 0
      
      [1]: https://github.com/ClangBuiltLinux/linux/issues/1320
      [2]: https://lore.kernel.org/linux-kbuild/20210307044253.v3h47ucq6ng25iay@archlinux-ax161/
      
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      ba64beb1
  17. 22 Apr, 2021 1 commit
    • Mickaël Salaün's avatar
      landlock: Support filesystem access-control · cb2c7d1a
      Mickaël Salaün authored
      
      Using Landlock objects and ruleset, it is possible to tag inodes
      according to a process's domain.  To enable an unprivileged process to
      express a file hierarchy, it first needs to open a directory (or a file)
      and pass this file descriptor to the kernel through
      landlock_add_rule(2).  When checking if a file access request is
      allowed, we walk from the requested dentry to the real root, following
      the different mount layers.  The access to each "tagged" inodes are
      collected according to their rule layer level, and ANDed to create
      access to the requested file hierarchy.  This makes possible to identify
      a lot of files without tagging every inodes nor modifying the
      filesystem, while still following the view and understanding the user
      has from the filesystem.
      
      Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
      keep the same struct inodes for the same inodes whereas these inodes are
      in use.
      
      This commit adds a minimal set of supported filesystem access-control
      which doesn't enable to restrict all file-related actions.  This is the
      result of multiple discussions to minimize the code of Landlock to ease
      review.  Thanks to the Landlock design, extending this access-control
      without breaking user space will not be a problem.  Moreover, seccomp
      filters can be used to restrict the use of syscall families which may
      not be currently handled by Landlock.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20210422154123.13086-8-mic@digikod.net
      
      Signed-off-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      cb2c7d1a
  18. 08 Apr, 2021 2 commits
    • Sami Tolvanen's avatar
      add support for Clang CFI · cf68fffb
      Sami Tolvanen authored
      This change adds support for Clang’s forward-edge Control Flow
      Integrity (CFI) checking. With CONFIG_CFI_CLANG, the compiler
      injects a runtime check before each indirect function call to ensure
      the target is a valid function with the correct static type. This
      restricts possible call targets and makes it more difficult for
      an attacker to exploit bugs that allow the modification of stored
      function pointers. For more details, see:
      
        https://clang.llvm.org/docs/ControlFlowIntegrity.html
      
      
      
      Clang requires CONFIG_LTO_CLANG to be enabled with CFI to gain
      visibility to possible call targets. Kernel modules are supported
      with Clang’s cross-DSO CFI mode, which allows checking between
      independently compiled components.
      
      With CFI enabled, the compiler injects a __cfi_check() function into
      the kernel and each module for validating local call targets. For
      cross-module calls that cannot be validated locally, the compiler
      calls the global __cfi_slowpath_diag() function, which determines
      the target module and calls the correct __cfi_check() function. This
      patch includes a slowpath implementation that uses __module_address()
      to resolve call targets, and with CONFIG_CFI_CLANG_SHADOW enabled, a
      shadow map that speeds up module look-ups by ~3x.
      
      Clang implements indirect call checking using jump tables and
      offers two methods of generating them. With canonical jump tables,
      the compiler renames each address-taken function to <function>.cfi
      and points the original symbol to a jump table entry, which passes
      __cfi_check() validation. This isn’t compatible with stand-alone
      assembly code, which the compiler doesn’t instrument, and would
      result in indirect calls to assembly code to fail. Therefore, we
      default to using non-canonical jump tables instead, where the compiler
      generates a local jump table entry <function>.cfi_jt for each
      address-taken function, and replaces all references to the function
      with the address of the jump table entry.
      
      Note that because non-canonical jump table addresses are local
      to each component, they break cross-module function address
      equality. Specifically, the address of a global function will be
      different in each module, as it's replaced with the address of a local
      jump table entry. If this address is passed to a different module,
      it won’t match the address of the same function taken there. This
      may break code that relies on comparing addresses passed from other
      components.
      
      CFI checking can be disabled in a function with the __nocfi attribute.
      Additionally, CFI can be disabled for an entire compilation unit by
      filtering out CC_FLAGS_CFI.
      
      By default, CFI failures result in a kernel panic to stop a potential
      exploit. CONFIG_CFI_PERMISSIVE enables a permissive mode, where the
      kernel prints out a rate-limited warning instead, and allows execution
      to continue. This option is helpful for locating type mismatches, but
      should only be enabled during development.
      Signed-off-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210408182843.1754385-2-samitolvanen@google.com
      cf68fffb
    • Kees Cook's avatar
      stack: Optionally randomize kernel stack offset each syscall · 39218ff4
      Kees Cook authored
      This provides the ability for architectures to enable kernel stack base
      address offset randomization. This feature is controlled by the boot
      param "randomize_kstack_offset=on/off", with its default value set by
      CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
      
      This feature is based on the original idea from the last public release
      of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
      All the credit for the original idea goes to the PaX team. Note that
      the design and implementation of this upstream randomize_kstack_offset
      feature differs greatly from the RANDKSTACK feature (see below).
      
      Reasoning for the feature:
      
      This feature aims to make harder the various stack-based attacks that
      rely on deterministic stack structure. We have had many such attacks in
      past (just to name few):
      
      https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
      https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
      https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      
      As Linux kernel stack protections have been constantly improving
      (vmap-based stack allocation with guard pages, removal of thread_info,
      STACKLEAK), attackers have had to find new ways for their exploits
      to work. They have done so, continuing to rely on the kernel's stack
      determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
      were not relevant. For example, the following recent attacks would have
      been hampered if the stack offset was non-deterministic between syscalls:
      
      https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
      (page 70: targeting the pt_regs copy with linear stack overflow)
      
      https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
      (leaked stack address from one syscall as a target during next syscall)
      
      The main idea is that since the stack offset is randomized on each system
      call, it is harder for an attack to reliably land in any particular place
      on the thread stack, even with address exposures, as the stack base will
      change on the next syscall. Also, since randomization is performed after
      placing pt_regs, the ptrace-based approach[1] to discover the randomized
      offset during a long-running syscall should not be possible.
      
      Design description:
      
      During most of the kernel's execution, it runs on the "thread stack",
      which is pretty deterministic in its structure: it is fixed in size,
      and on every entry from userspace to kernel on a syscall the thread
      stack starts construction from an address fetched from the per-cpu
      cpu_current_top_of_stack variable. The first element to be pushed to the
      thread stack is the pt_regs struct that stores all required CPU registers
      and syscall parameters. Finally the specific syscall function is called,
      with the stack being used as the kernel executes the resulting request.
      
      The goal of randomize_kstack_offset feature is to add a random offset
      after the pt_regs has been pushed to the stack and before the rest of the
      thread stack is used during the syscall processing, and to change it every
      time a process issues a syscall. The source of randomness is currently
      architecture-defined (but x86 is using the low byte of rdtsc()). Future
      improvements for different entropy sources is possible, but out of scope
      for this patch. Further more, to add more unpredictability, new offsets
      are chosen at the end of syscalls (the timing of which should be less
      easy to measure from userspace than at syscall entry time), and stored
      in a per-CPU variable, so that the life of the value does not stay
      explicitly tied to a single task.
      
      As suggested by Andy Lutomirski, the offset is added using alloca()
      and an empty asm() statement with an output constraint, since it avoids
      changes to assembly syscall entry code, to the unwinder, and provides
      correct stack alignment as defined by the compiler.
      
      In order to make this available by default with zero performance impact
      for those that don't want it, it is boot-time selectable with static
      branches. This way, if the overhead is not wanted, it can just be
      left turned off with no performance impact.
      
      The generated assembly for x86_64 with GCC looks like this:
      
      ...
      ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
      					    # 12380 <kstack_offset>
      ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
      ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
      ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
      ffffffff8100398c: 48 29 c4              sub %rax,%rsp
      ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
      ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax
      ...
      
      As a result of the above stack alignment, this patch introduces about
      5 bits of randomness after pt_regs is spilled to the thread stack on
      x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
      stack alignment). The amount of entropy could be adjusted based on how
      much of the stack space we wish to trade for security.
      
      My measure of syscall performance overhead (on x86_64):
      
      lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
          randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
          randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds
      
      So, roughly 0.9% overhead growth for a no-op syscall, which is very
      manageable. And for people that don't want this, it's off by default.
      
      There are two gotchas with using the alloca() trick. First,
      compilers that have Stack Clash protection (-fstack-clash-protection)
      enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
      any dynamic stack allocations. While the randomization offset is
      always less than a page, the resulting assembly would still contain
      (unreachable!) probing routines, bloating the resulting assembly. To
      avoid this, -fno-stack-clash-protection is unconditionally added to
      the kernel Makefile since this is the only dynamic stack allocation in
      the kernel (now that VLAs have been removed) and it is provably safe
      from Stack Clash style attacks.
      
      The second gotcha with alloca() is a negative interaction with
      -fstack-protector*, in that it sees the alloca() as an array allocation,
      which triggers the unconditional addition of the stack canary function
      pre/post-amble which slows down syscalls regardless of the static
      branch. In order to avoid adding this unneeded check and its associated
      performance impact, architectures need to carefully remove uses of
      -fstack-protector-strong (or -fstack-protector) in the compilation units
      that use the add_random_kstack() macro and to audit the resulting stack
      mitigation coverage (to make sure no desired coverage disappears). No
      change is visible for this on x86 because the stack protector is already
      unconditionally disabled for the compilation unit, but the change is
      required on arm64. There is, unfortunately, no attribute that can be
      used to disable stack protector for specific functions.
      
      Comparison to PaX RANDKSTACK feature:
      
      The RANDKSTACK feature randomizes the location of the stack start
      (cpu_current_top_of_stack), i.e. including the location of pt_regs
      structure itself on the stack. Initially this patch followed the same
      approach, but during the recent discussions[2], it has been determined
      to be of a little value since, if ptrace functionality is available for
      an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
      different offsets in the pt_regs struct, observe the cache behavior of
      the pt_regs accesses, and figure out the random stack offset. Another
      difference is that the random offset is stored in a per-cpu variable,
      rather than having it be per-thread. As a result, these implementations
      differ a fair bit in their implementation details and results, though
      obviously the intent is similar.
      
      [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
      [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
      [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html
      
      Co-developed-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
      39218ff4
  19. 11 Mar, 2021 2 commits
  20. 17 Feb, 2021 1 commit
  21. 13 Feb, 2021 1 commit
    • Heiko Carstens's avatar
      s390,alpha: switch to 64-bit ino_t · 96c0a6a7
      Heiko Carstens authored
      s390 and alpha are the only 64 bit architectures with a 32-bit ino_t.
      Since this is quite unusual this causes bugs from time to time.
      
      See e.g. commit ebce3eb2 ("ceph: fix inode number handling on
      arches with 32-bit ino_t") for an example.
      
      This (obviously) also prevents s390 and alpha to use 64-bit ino_t for
      tmpfs. See commit b85a7a8b
      
       ("tmpfs: disallow CONFIG_TMPFS_INODE64
      on s390").
      
      Therefore switch both s390 and alpha to 64-bit ino_t. This should only
      have an effect on the ustat system call. To prevent ABI breakage
      define struct ustat compatible to the old layout and change
      sys_ustat() accordingly.
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      96c0a6a7
  22. 10 Feb, 2021 1 commit
  23. 29 Jan, 2021 1 commit
  24. 21 Jan, 2021 1 commit
  25. 14 Jan, 2021 2 commits
  26. 06 Jan, 2021 1 commit
    • Al Viro's avatar
      [amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly · 7facdc42
      Al Viro authored
      
      To get rid of hardcoded size/offset in those macros we need to have
      a definition of i386 variant of struct elf_prstatus.  However, we can't
      do that in asm/compat.h - the types needed for that are not there and
      adding an include of asm/user32.h into asm/compat.h would cause a lot
      of mess.
      
      That could be conveniently done in elfcore-compat.h, but currently there
      is nowhere to put arch-dependent parts of it - no asm/elfcore-compat.h.
      So we introduce a new file (asm/elfcore-compat.h, present on architectures
      that have CONFIG_ARCH_HAS_ELFCORE_COMPAT set, currently only on x86),
      have it pulled by linux/elfcore-compat.h and move the definitions there.
      
      As a side benefit, we don't need to worry about accidental inclusion of
      that file into binfmt_elf.c itself, so we don't need the dance with
      COMPAT_PRSTATUS_SIZE, etc. - only fs/compat_binfmt_elf.c will see
      that header.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7facdc42
  27. 28 Dec, 2020 1 commit
  28. 22 Dec, 2020 1 commit
  29. 15 Dec, 2020 4 commits
    • Mike Rapoport's avatar
      arch, mm: restore dependency of __kernel_map_pages() on DEBUG_PAGEALLOC · 5d6ad668
      Mike Rapoport authored
      The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
      never fail.  With this assumption is wouldn't be safe to allow general
      usage of this function.
      
      Moreover, some architectures that implement __kernel_map_pages() have this
      function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
      pages when page allocation debugging is disabled at runtime.
      
      As all the users of __kernel_map_pages() were converted to use
      debug_pagealloc_map_pages() it is safe to make it available only when
      DEBUG_PAGEALLOC is set.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.org
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d6ad668
    • Mike Rapoport's avatar
      arm, arm64: move free_unused_memmap() to generic mm · 4f5b0c17
      Mike Rapoport authored
      ARM and ARM64 free unused parts of the memory map just before the
      initialization of the page allocator. To allow holes in the memory map both
      architectures overload pfn_valid() and define HAVE_ARCH_PFN_VALID.
      
      Allowing holes in the memory map for FLATMEM may be useful for small
      machines, such as ARC and m68k and will enable those architectures to cease
      using DISCONTIGMEM and still support more than one memory bank.
      
      Move the functions that free unused memory map to generic mm and enable
      them in case HAVE_ARCH_PFN_VALID=y.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-10-rppt@kernel.org
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f5b0c17
    • Kalesh Singh's avatar
      mm: speedup mremap on 1GB or larger regions · c49dd340
      Kalesh Singh authored
      Android needs to move large memory regions for garbage collection.  The GC
      requires moving physical pages of multi-gigabyte heap using mremap.
      During this move, the application threads have to be paused for
      correctness.  It is critical to keep this pause as short as possible to
      avoid jitters during user interaction.
      
      Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
      the source and destination addresses are PUD-aligned.  For
      CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
      entries, since the PUD entry is “folded back” onto the PGD entry.  Add
      HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
      supported/tested can turn this off by not selecting the config.
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.com
      
      Signed-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49dd340
    • Colin Ian King's avatar
      arch/Kconfig: fix spelling mistakes · a86ecfa6
      Colin Ian King authored
      There are a few spelling mistakes in the Kconfig comments and help text.
      Fix these.
      
      Link: https://lkml.kernel.org/r/20201207155004.171962-1-colin.king@canonical.com
      
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a86ecfa6
  30. 14 Dec, 2020 1 commit
    • Steven Rostedt (VMware)'s avatar
      Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS" · adab66b7
      Steven Rostedt (VMware) authored
      It was believed that metag was the only architecture that required the ring
      buffer to keep 8 byte words aligned on 8 byte architectures, and with its
      removal, it was assumed that the ring buffer code did not need to handle
      this case. It appears that sparc64 also requires this.
      
      The following was reported on a sparc64 boot up:
      
         kernel: futex hash table entries: 65536 (order: 9, 4194304 bytes, linear)
         kernel: Running postponed tracer tests:
         kernel: Testing tracer function:
         kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
         kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
         kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
         kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140
         kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140
         kernel: PASSED
      
      Need to put back the 64BIT aligned code for the ring buffer.
      
      Link: https://lore.kernel.org/r/CADxRZqzXQRYgKc=y-KV=S_yHL+Y8Ay2mh5ezeZUnpRvg+syWKw@mail.gmail.com
      
      Cc: stable@vger.kernel.org
      Fixes: 86b3de60
      
       ("ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS")
      Reported-by: default avatarAnatoly Pugachev <matorola@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      adab66b7
  31. 02 Dec, 2020 1 commit