1. 04 Dec, 2018 1 commit
    • Ard Biesheuvel's avatar
      arm64: relocatable: fix inconsistencies in linker script and options · 3bbd3db8
      Ard Biesheuvel authored
      readelf complains about the section layout of vmlinux when building
      with CONFIG_RELOCATABLE=y (for KASLR):
      
        readelf: Warning: [21]: Link field (0) should index a symtab section.
        readelf: Warning: [21]: Info field (0) should index a relocatable section.
      
      Also, it seems that our use of '-pie -shared' is contradictory, and
      thus ambiguous. In general, the way KASLR is wired up at the moment
      is highly tailored to how ld.bfd happens to implement (and conflate)
      PIE executables and shared libraries, so given the current effort to
      support other toolchains, let's fix some of these issues as well.
      
      - Drop the -pie linker argument and just leave -shared. In ld.bfd,
        the differences between them are unclear (except for the ELF type
        of the produced image [0]) but lld chokes on seeing both at the
        same time.
      
      - Rename the .rela output section to .rela.dyn, as is customary for
        shared libraries and PIE executables, so that it is not misidentified
        by readelf as a static relocation section (producing the warnings
        above).
      
      - Pass the -z notext and -z norelro options to explicitly instruct the
        linker to permit text relocations, and to omit the RELRO program
        header (which requires a certain section layout that we don't adhere
        to in the kernel). These are the defaults for current versions of
        ld.bfd.
      
      - Discard .eh_frame and .gnu.hash sections to avoid them from being
        emitted between .head.text and .text, screwing up the section layout.
      
      These changes only affect the ELF image, and produce the same binary
      image.
      
      [0] b9dce7f1 ("arm64: kernel: force ET_DYN ELF type for ...")
      
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Peter Smith <peter.smith@linaro.org>
      Tested-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      3bbd3db8
  2. 30 Nov, 2018 8 commits
    • Ard Biesheuvel's avatar
      arm64/lib: improve CRC32 performance for deep pipelines · efdb25ef
      Ard Biesheuvel authored
      Improve the performance of the crc32() asm routines by getting rid of
      most of the branches and small sized loads on the common path.
      
      Instead, use a branchless code path involving overlapping 16 byte
      loads to process the first (length % 32) bytes, and process the
      remainder using a loop that processes 32 bytes at a time.
      
      Tested using the following test program:
      
        #include <stdlib.h>
      
        extern void crc32_le(unsigned short, char const*, int);
      
        int main(void)
        {
          static const char buf[4096];
      
          srand(20181126);
      
          for (int i = 0; i < 100 * 1000 * 1000; i++)
            crc32_le(0, buf, rand() % 1024);
      
          return 0;
        }
      
      On Cortex-A53 and Cortex-A57, the performance regresses but only very
      slightly. On Cortex-A72 however, the performance improves from
      
        $ time ./crc32
      
        real  0m10.149s
        user  0m10.149s
        sys   0m0.000s
      
      to
      
        $ time ./crc32
      
        real  0m7.915s
        user  0m7.915s
        sys   0m0.000s
      
      Cc: Rui Sun <sunrui26@huawei.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      efdb25ef
    • Mark Rutland's avatar
      arm64: ftrace: always pass instrumented pc in x0 · 7dc48bf9
      Mark Rutland authored
      The core ftrace hooks take the instrumented PC in x0, but for some
      reason arm64's prepare_ftrace_return() takes this in x1.
      
      For consistency, let's flip the argument order and always pass the
      instrumented PC in x0.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      7dc48bf9
    • Mark Rutland's avatar
      arm64: ftrace: remove return_regs macros · 49e258e0
      Mark Rutland authored
      The save_return_regs and restore_return_regs macros are only used by
      return_to_handler, and having them defined out-of-line only serves to
      obscure the logic.
      
      Before we complicate, let's clean this up and fold the logic directly
      into return_to_handler, saving a few lines of macro boilerplate in the
      process. At the same time, a missing trailing space is added to the
      comments, fixing a code style violation.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      49e258e0
    • Mark Rutland's avatar
      arm64: ftrace: don't adjust the LR value · 6e803e2e
      Mark Rutland authored
      The core ftrace code requires that when it is handed the PC of an
      instrumented function, this PC is the address of the instrumented
      instruction. This is necessary so that the core ftrace code can identify
      the specific instrumentation site. Since the instrumented function will
      be a BL, the address of the instrumented function is LR - 4 at entry to
      the ftrace code.
      
      This fixup is applied in the mcount_get_pc and mcount_get_pc0 helpers,
      which acquire the PC of the instrumented function.
      
      The mcount_get_lr helper is used to acquire the LR of the instrumented
      function, whose value does not require this adjustment, and cannot be
      adjusted to anything meaningful. No adjustment of this value is made on
      other architectures, including arm. However, arm64 adjusts this value by
      4.
      
      This patch brings arm64 in line with other architectures and removes the
      adjustment of the LR value.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6e803e2e
    • Mark Rutland's avatar
      arm64: ftrace: enable graph FP test · 5c176aff
      Mark Rutland authored
      The core frace code has an optional sanity check on the frame pointer
      passed by ftrace_graph_caller and return_to_handler. This is cheap,
      useful, and enabled unconditionally on x86, sparc, and riscv.
      
      Let's do the same on arm64, so that we can catch any problems early.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      5c176aff
    • Mark Rutland's avatar
      arm64: ftrace: use GLOBAL() · e4fe1966
      Mark Rutland authored
      The global exports of ftrace_call and ftrace_graph_call are somewhat
      painful to read. Let's use the generic GLOBAL() macro to ameliorate
      matters.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      e4fe1966
    • Mark Rutland's avatar
      linkage: add generic GLOBAL() macro · ad697a1a
      Mark Rutland authored
      Declaring a global symbol in assembly is tedious, error-prone, and
      painful to read. While ENTRY() exists, this is supposed to be used for
      function entry points, and this affects alignment in a potentially
      undesireable manner.
      
      Instead, let's add a generic GLOBAL() macro for this, as x86 added
      locally in commit:
      
        95695547 ("x86: asm linkage - introduce GLOBAL macro")
      
      ... thus allowing us to use this more freely in the kernel.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Torsten Duwe <duwe@suse.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      ad697a1a
    • Ard Biesheuvel's avatar
      arm64: drop linker script hack to hide __efistub_ symbols · dd6846d7
      Ard Biesheuvel authored
      Commit 1212f7a1 ("scripts/kallsyms: filter arm64's __efistub_
      symbols") updated the kallsyms code to filter out symbols with
      the __efistub_ prefix explicitly, so we no longer require the
      hack in our linker script to emit them as absolute symbols.
      
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      dd6846d7
  3. 29 Nov, 2018 1 commit
    • Will Deacon's avatar
      arm64: io: Ensure value passed to __iormb() is held in a 64-bit register · 1b57ec8c
      Will Deacon authored
      As of commit 6460d320 ("arm64: io: Ensure calls to delay routines
      are ordered against prior readX()"), MMIO reads smaller than 64 bits
      fail to compile under clang because we end up mixing 32-bit and 64-bit
      register operands for the same data processing instruction:
      
      ./include/asm-generic/io.h:695:9: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
              return readb(addr);
                     ^
      ./arch/arm64/include/asm/io.h:147:58: note: expanded from macro 'readb'
                                                                             ^
      ./include/asm-generic/io.h:695:9: note: use constraint modifier "w"
      ./arch/arm64/include/asm/io.h:147:50: note: expanded from macro 'readb'
                                                                     ^
      ./arch/arm64/include/asm/io.h:118:24: note: expanded from macro '__iormb'
              asm volatile("eor       %0, %1, %1\n"                           \
                                          ^
      
      Fix the build by casting the macro argument to 'unsigned long' when used
      as an input to the inline asm.
      Reported-by: default avatarNick Desaulniers <nick.desaulniers@gmail.com>
      Reported-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      1b57ec8c
  4. 27 Nov, 2018 6 commits
    • Will Deacon's avatar
      arm64: tlbi: Set MAX_TLBI_OPS to PTRS_PER_PTE · 3d65b6bb
      Will Deacon authored
      In order to reduce the possibility of soft lock-ups, we bound the
      maximum number of TLBI operations performed by a single call to
      flush_tlb_range() to an arbitrary constant of 1024.
      
      Whilst this does the job of avoiding lock-ups, we can actually be a bit
      smarter by defining this as PTRS_PER_PTE. Due to the structure of our
      page tables, using PTRS_PER_PTE means that an outer loop calling
      flush_tlb_range() for entire table entries will end up performing just a
      single TLBI operation for each entry. As an example, mremap()ing a 1GB
      range mapped using 4k pages now requires only 512 TLBI operations when
      moving the page tables as opposed to 262144 operations (512*512) when
      using the current threshold of 1024.
      
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      3d65b6bb
    • Ard Biesheuvel's avatar
      arm64/module: switch to ADRP/ADD sequences for PLT entries · bdb85cd1
      Ard Biesheuvel authored
      Now that we have switched to the small code model entirely, and
      reduced the extended KASLR range to 4 GB, we can be sure that the
      targets of relative branches that are out of range are in range
      for a ADRP/ADD pair, which is one instruction shorter than our
      current MOVN/MOVK/MOVK sequence, and is more idiomatic and so it
      is more likely to be implemented efficiently by micro-architectures.
      
      So switch over the ordinary PLT code and the special handling of
      the Cortex-A53 ADRP errata, as well as the ftrace trampline
      handling.
      Reviewed-by: default avatarTorsten Duwe <duwe@lst.de>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: Added a couple of comments in the plt equality check]
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      bdb85cd1
    • Ard Biesheuvel's avatar
      arm64/insn: add support for emitting ADR/ADRP instructions · 7aaf7b2f
      Ard Biesheuvel authored
      Add support for emitting ADR and ADRP instructions so we can switch
      over our PLT generation code in a subsequent patch.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      7aaf7b2f
    • James Morse's avatar
      arm64: Use a raw spinlock in __install_bp_hardening_cb() · d8797b12
      James Morse authored
      __install_bp_hardening_cb() is called via stop_machine() as part
      of the cpu_enable callback. To force each CPU to take its turn
      when allocating slots, they take a spinlock.
      
      With the RT patches applied, the spinlock becomes a mutex,
      and we get warnings about sleeping while in stop_machine():
      | [    0.319176] CPU features: detected: RAS Extension Support
      | [    0.319950] BUG: scheduling while atomic: migration/3/36/0x00000002
      | [    0.319955] Modules linked in:
      | [    0.319958] Preemption disabled at:
      | [    0.319969] [<ffff000008181ae4>] cpu_stopper_thread+0x7c/0x108
      | [    0.319973] CPU: 3 PID: 36 Comm: migration/3 Not tainted 4.19.1-rt3-00250-g330fc2c2a880 #2
      | [    0.319975] Hardware name: linux,dummy-virt (DT)
      | [    0.319976] Call trace:
      | [    0.319981]  dump_backtrace+0x0/0x148
      | [    0.319983]  show_stack+0x14/0x20
      | [    0.319987]  dump_stack+0x80/0xa4
      | [    0.319989]  __schedule_bug+0x94/0xb0
      | [    0.319991]  __schedule+0x510/0x560
      | [    0.319992]  schedule+0x38/0xe8
      | [    0.319994]  rt_spin_lock_slowlock_locked+0xf0/0x278
      | [    0.319996]  rt_spin_lock_slowlock+0x5c/0x90
      | [    0.319998]  rt_spin_lock+0x54/0x58
      | [    0.320000]  enable_smccc_arch_workaround_1+0xdc/0x260
      | [    0.320001]  __enable_cpu_capability+0x10/0x20
      | [    0.320003]  multi_cpu_stop+0x84/0x108
      | [    0.320004]  cpu_stopper_thread+0x84/0x108
      | [    0.320008]  smpboot_thread_fn+0x1e8/0x2b0
      | [    0.320009]  kthread+0x124/0x128
      | [    0.320010]  ret_from_fork+0x10/0x18
      
      Switch this to a raw spinlock, as we know this is only called with
      IRQs masked.
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      d8797b12
    • Jeremy Linton's avatar
      arm64: acpi: Prepare for longer MADTs · 9eb1c92b
      Jeremy Linton authored
      The BAD_MADT_GICC_ENTRY check is a little too strict because
      it rejects MADT entries that don't match the currently known
      lengths. We should remove this restriction to avoid problems
      if the table length changes. Future code which might depend on
      additional fields should be written to validate those fields
      before using them, rather than trying to globally check
      known MADT version lengths.
      
      Link: https://lkml.kernel.org/r/20181012192937.3819951-1-jeremy.linton@arm.comSigned-off-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      [lorenzo.pieralisi@arm.com: added MADT macro comments]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Al Stone <ahs3@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      9eb1c92b
    • Will Deacon's avatar
      arm64: io: Ensure calls to delay routines are ordered against prior readX() · 6460d320
      Will Deacon authored
      A relatively standard idiom for ensuring that a pair of MMIO writes to a
      device arrive at that device with a specified minimum delay between them
      is as follows:
      
      	writel_relaxed(42, dev_base + CTL1);
      	readl(dev_base + CTL1);
      	udelay(10);
      	writel_relaxed(42, dev_base + CTL2);
      
      the intention being that the read-back from the device will push the
      prior write to CTL1, and the udelay will hold up the write to CTL1 until
      at least 10us have elapsed.
      
      Unfortunately, on arm64 where the underlying delay loop is implemented
      as a read of the architected counter, the CPU does not guarantee
      ordering from the readl() to the delay loop and therefore the delay loop
      could in theory be speculated and not provide the desired interval
      between the two writes.
      
      Fix this in a similar manner to PowerPC by introducing a dummy control
      dependency on the output of readX() which, combined with the ISB in the
      read of the architected counter, guarantees that a subsequent delay loop
      can not be executed until the readX() has returned its result.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      6460d320
  5. 26 Nov, 2018 1 commit
    • Alex Van Brunt's avatar
      arm64: mm: Don't wait for completion of TLB invalidation when page aging · 3403e56b
      Alex Van Brunt authored
      When transitioning a PTE from young to old as part of page aging, we
      can avoid waiting for the TLB invalidation to complete and therefore
      drop the subsequent DSB instruction. Whilst this opens up a race with
      page reclaim, where a PTE in active use via a stale, young TLB entry
      does not update the underlying descriptor, the worst thing that happens
      is that the page is reclaimed and then immediately faulted back in.
      
      Given that we have a DSB in our context-switch path, the window for a
      spurious reclaim is fairly limited and eliding the barrier claims to
      boost NVMe/SSD accesses by over 10% on some platforms.
      
      A similar optimisation was made for x86 in commit b13b1d2d ("x86/mm:
      In the PTE swapout page reclaim case clear the accessed bit instead of
      flushing the TLB").
      Signed-off-by: default avatarAlex Van Brunt <avanbrunt@nvidia.com>
      Signed-off-by: default avatarAshish Mhetre <amhetre@nvidia.com>
      [will: rewrote patch]
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      3403e56b
  6. 20 Nov, 2018 3 commits
    • Jessica Yu's avatar
      arm64/module: use plt section indices for relocations · c8ebf64e
      Jessica Yu authored
      Instead of saving a pointer to the .plt and .init.plt sections to apply
      plt-based relocations, save and use their section indices instead.
      
      The mod->arch.{core,init}.plt pointers were problematic for livepatch
      because they pointed within temporary section headers (provided by the
      module loader via info->sechdrs) that would be freed after module load.
      Since livepatch modules may need to apply relocations post-module-load
      (for example, to patch a module that is loaded later), using section
      indices to offset into the section headers (instead of accessing them
      through a saved pointer) allows livepatch modules on arm64 to pass in
      their own copy of the section headers to apply_relocate_add() to apply
      delayed relocations.
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: default avatarJessica Yu <jeyu@kernel.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      c8ebf64e
    • Ard Biesheuvel's avatar
      arm64: mm: apply r/o permissions of VM areas to its linear alias as well · c55191e9
      Ard Biesheuvel authored
      On arm64, we use block mappings and contiguous hints to map the linear
      region, to minimize the TLB footprint. However, this means that the
      entire region is mapped using read/write permissions, which we cannot
      modify at page granularity without having to take intrusive measures to
      prevent TLB conflicts.
      
      This means the linear aliases of pages belonging to read-only mappings
      (executable or otherwise) in the vmalloc region are also mapped read/write,
      and could potentially be abused to modify things like module code, bpf JIT
      code or other read-only data.
      
      So let's fix this, by extending the set_memory_ro/rw routines to take
      the linear alias into account. The consequence of enabling this is
      that we can no longer use block mappings or contiguous hints, so in
      cases where the TLB footprint of the linear region is a bottleneck,
      performance may be affected.
      
      Therefore, allow this feature to be runtime en/disabled, by setting
      rodata=full (or 'on' to disable just this enhancement, or 'off' to
      disable read-only mappings for code and r/o data entirely) on the
      kernel command line. Also, allow the default value to be set via a
      Kconfig option.
      Tested-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      c55191e9
    • Ard Biesheuvel's avatar
      arm64: mm: purge lazily unmapped vm regions before changing permissions · b34d2ef0
      Ard Biesheuvel authored
      Call vm_unmap_aliases() every time we apply any changes to permission
      attributes of mappings in the vmalloc region. This avoids any potential
      issues resulting from lingering writable or executable aliases of
      mappings that should be read-only or non-executable, respectively.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      b34d2ef0
  7. 18 Nov, 2018 20 commits