1. 09 May, 2024 5 commits
  2. 03 May, 2024 6 commits
  3. 28 Apr, 2024 3 commits
    • Mark Rutland's avatar
      arm64: defer clearing DAIF.D · 080297be
      Mark Rutland authored
      For historical reasons we unmask debug exceptions in __cpu_setup(), but
      it's not necessary to unmask debug exceptions this early in the
      boot/idle entry paths. It would be better to unmask debug exceptions
      later in C code as this simplifies the current code and will make it
      easier to rework exception masking logic to handle non-DAIF bits in
      future (e.g. PSTATE.{ALLINT,PM}).
      
      We started clearing DAIF.D in __cpu_setup() in commit:
      
        2ce39ad1 ("arm64: debug: unmask PSTATE.D earlier")
      
      At the time, we needed to ensure that DAIF.D was clear on the primary
      CPU before scheduling and preemption were possible, and chose to do this
      in __cpu_setup() so that this occurred in the same place for primary and
      secondary CPUs. As we cannot handle debug exceptions this early, we
      placed an ISB between initializing MDSCR_EL1 and clearing DAIF.D so that
      no exceptions should be triggered.
      
      Subsequently we rewrote the return-from-{idle,suspend} paths to use
      __cpu_setup() in commit:
      
        cabe1c81 ("arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va")
      
      ... which allowed for earlier use of the MMU and had the desirable
      property of using the same code to reset the CPU in the cold and warm
      boot paths. This introduced a bug: DAIF.D was clear while
      cpu_do_resume() restored MDSCR_EL1 and other control registers (e.g.
      breakpoint/watchpoint control/value registers), and so we could
      unexpectedly take debug exceptions.
      
      We fixed that in commit:
      
        744c6c37 ("arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1")
      
      ... by having cpu_do_resume() use the `disable_dbg` macro to set DAIF.D
      before restoring MDSCR_EL1 and other control registers. This relies on
      DAIF.D being subsequently cleared again in cpu_resume().
      
      Subsequently we reworked DAIF masking in commit:
      
        0fbeb318 ("arm64: explicitly mask all exceptions")
      
      ... where we began enforcing a policy that DAIF.D being set implies all
      other DAIF bits are set, and so e.g. we cannot take an IRQ while DAIF.D
      is set. As part of this the use of `disable_dbg` in cpu_resume() was
      replaced with `disable_daif` for consistency with the rest of the
      kernel.
      
      These days, there's no need to clear DAIF.D early within __cpu_setup():
      
      * setup_arch() clears DAIF.DA before scheduling and preemption are
        possible on the primary CPU, avoiding the problem we we originally
        trying to work around.
      
        Note: DAIF.IF get cleared later when interrupts are enabled for the
        first time.
      
      * secondary_start_kernel() clears all DAIF bits before scheduling and
        preemption are possible on secondary CPUs.
      
        Note: with pseudo-NMI, the PMR is initialized here before any DAIF
        bits are cleared. Similar will be necessary for the architectural NMI.
      
      * cpu_suspend() restores all DAIF bits when returning from idle,
        ensuring that we don't unexpectedly leave DAIF.D clear or set.
      
        Note: with pseudo-NMI, the PMR is initialized here before DAIF is
        cleared. Similar will be necessary for the architectural NMI.
      
      This patch removes the unmasking of debug exceptions from __cpu_setup(),
      relying on the above locations to initialize DAIF. This allows some
      other cleanups:
      
      * It is no longer necessary for cpu_resume() to explicitly mask debug
        (or other) exceptions, as it is always called with all DAIF bits set.
        Thus we drop the use of `disable_daif`.
      
      * The `enable_dbg` macro is no longer used, and so is dropped.
      
      * It is no longer necessary to have an ISB immediately after
        initializing MDSCR_EL1 in __cpu_setup(), and we can revert to relying
        on the context synchronization that occurs when the MMU is enabled
        between __cpu_setup() and code which clears DAIF.D
      
      Comments are added to setup_arch() and secondary_start_kernel() to
      explain the initial unmasking of the DAIF bits.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20240422113523.4070414-3-mark.rutland@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
      080297be
    • Mark Rutland's avatar
      arm64: assembler: update stale comment for disable_step_tsk · 3a2d2ca4
      Mark Rutland authored
      A comment in the disable_step_tsk macro refers to synchronising with
      enable_dbg, as historically the entry used enable_dbg to unmask debug
      exceptions after disabling single-stepping.
      
      These days the unmasking happens in entry-common.c via
      local_daif_restore() or local_daif_inherit(), so the comment is stale.
      This logic is likely to chang in future, so it would be best to avoid
      referring to those macros specifically.
      
      Update the comment to take this into account, and describe it in terms
      of clearing DAIF.D so that it doesn't macro where this logic lives nor
      what it is called.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Reviewed-by: default avatarMark Brown <broonie@kernel.org>
      Link: https://lore.kernel.org/r/20240422113523.4070414-2-mark.rutland@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
      3a2d2ca4
    • Shiqi Liu's avatar
      arm64/sysreg: Update PIE permission encodings · 12d712dc
      Shiqi Liu authored
      Fix left shift overflow issue when the parameter idx is greater than or
      equal to 8 in the calculation of perm in PIRx_ELx_PERM macro.
      
      Fix this by modifying the encoding to use a long integer type.
      Signed-off-by: default avatarShiqi Liu <shiqiliu@hust.edu.cn>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lore.kernel.org/r/20240421063328.29710-1-shiqiliu@hust.edu.cnSigned-off-by: default avatarWill Deacon <will@kernel.org>
      12d712dc
  4. 19 Apr, 2024 1 commit
  5. 18 Apr, 2024 2 commits
  6. 12 Apr, 2024 5 commits
    • Ryan Roberts's avatar
      arm64: mm: Don't remap pgtables for allocate vs populate · 0e9df1c9
      Ryan Roberts authored
      During linear map pgtable creation, each pgtable is fixmapped /
      fixunmapped twice; once during allocation to zero the memory, and a
      again during population to write the entries. This means each table has
      2 TLB invalidations issued against it. Let's fix this so that each table
      is only fixmapped/fixunmapped once, halving the number of TLBIs, and
      improving performance.
      
      Achieve this by separating allocation and initialization (zeroing) of
      the page. The allocated page is now fixmapped directly by the walker and
      initialized, before being populated and finally fixunmapped.
      
      This approach keeps the change small, but has the side effect that late
      allocations (using __get_free_page()) must also go through the generic
      memory clearing routine. So let's tell __get_free_page() not to zero the
      memory to avoid duplication.
      
      Additionally this approach means that fixmap/fixunmap is still used for
      late pgtable modifications. That's not technically needed since the
      memory is all mapped in the linear map by that point. That's left as a
      possible future optimization if found to be needed.
      
      Execution time of map_mem(), which creates the kernel linear map page
      tables, was measured on different machines with different RAM configs:
      
                     | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
                     | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
      ---------------|-------------|-------------|-------------|-------------
                     |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
      ---------------|-------------|-------------|-------------|-------------
      before         |   11   (0%) |  161   (0%) |  656   (0%) |  1654   (0%)
      after          |   10 (-11%) |  104 (-35%) |  438 (-33%) |  1223 (-26%)
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Suggested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarItaru Kitayama <itaru.kitayama@fujitsu.com>
      Tested-by: default avatarEric Chanudet <echanude@redhat.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20240412131908.433043-4-ryan.roberts@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
      0e9df1c9
    • Ryan Roberts's avatar
      arm64: mm: Batch dsb and isb when populating pgtables · 1fcb7cea
      Ryan Roberts authored
      After removing uneccessary TLBIs, the next bottleneck when creating the
      page tables for the linear map is DSB and ISB, which were previously
      issued per-pte in __set_pte(). Since we are writing multiple ptes in a
      given pte table, we can elide these barriers and insert them once we
      have finished writing to the table.
      
      Execution time of map_mem(), which creates the kernel linear map page
      tables, was measured on different machines with different RAM configs:
      
                     | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
                     | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
      ---------------|-------------|-------------|-------------|-------------
                     |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
      ---------------|-------------|-------------|-------------|-------------
      before         |   78   (0%) |  435   (0%) | 1723   (0%) |  3779   (0%)
      after          |   11 (-86%) |  161 (-63%) |  656 (-62%) |  1654 (-56%)
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarItaru Kitayama <itaru.kitayama@fujitsu.com>
      Tested-by: default avatarEric Chanudet <echanude@redhat.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20240412131908.433043-3-ryan.roberts@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
      1fcb7cea
    • Ryan Roberts's avatar
      arm64: mm: Don't remap pgtables per-cont(pte|pmd) block · 5c63db59
      Ryan Roberts authored
      A large part of the kernel boot time is creating the kernel linear map
      page tables. When rodata=full, all memory is mapped by pte. And when
      there is lots of physical ram, there are lots of pte tables to populate.
      The primary cost associated with this is mapping and unmapping the pte
      table memory in the fixmap; at unmap time, the TLB entry must be
      invalidated and this is expensive.
      
      Previously, each pmd and pte table was fixmapped/fixunmapped for each
      cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
      we ended up issuing 32 TLBIs per (pmd|pte) table during the population
      phase.
      
      Let's fix that, and fixmap/fixunmap each page once per population, for a
      saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
      speedup.
      
      Execution time of map_mem(), which creates the kernel linear map page
      tables, was measured on different machines with different RAM configs:
      
                     | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
                     | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
      ---------------|-------------|-------------|-------------|-------------
                     |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
      ---------------|-------------|-------------|-------------|-------------
      before         |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
      after          |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)
      Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarItaru Kitayama <itaru.kitayama@fujitsu.com>
      Tested-by: default avatarEric Chanudet <echanude@redhat.com>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20240412131908.433043-2-ryan.roberts@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
      5c63db59
    • Simon Glass's avatar
      arm64: boot: Support Flat Image Tree · 7a23b027
      Simon Glass authored
      Add a script which produces a Flat Image Tree (FIT), a single file
      containing the built kernel and associated devicetree files.
      Compression defaults to gzip which gives a good balance of size and
      performance.
      
      The files compress from about 86MB to 24MB using this approach.
      
      The FIT can be used by bootloaders which support it, such as U-Boot
      and Linuxboot. It permits automatic selection of the correct
      devicetree, matching the compatible string of the running board with
      the closest compatible string in the FIT. There is no need for
      filenames or other workarounds.
      
      Add a 'make image.fit' build target for arm64, as well.
      
      The FIT can be examined using 'dumpimage -l'.
      
      This uses the 'dtbs-list' file but processes only .dtb files, ignoring
      the overlay .dtbo files.
      
      This features requires pylibfdt (use 'pip install libfdt'). It also
      requires compression utilities for the algorithm being used. Supported
      compression options are the same as the Image.xxx files. Use
      FIT_COMPRESSION to select an algorithm other than gzip.
      
      While FIT supports a ramdisk / initrd, no attempt is made to support
      this here, since it must be built separately from the Linux build.
      Signed-off-by: default avatarSimon Glass <sjg@chromium.org>
      Acked-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Link: https://lore.kernel.org/r/20240329032836.141899-3-sjg@chromium.orgSigned-off-by: default avatarWill Deacon <will@kernel.org>
      7a23b027
    • Simon Glass's avatar
      arm64: Add BOOT_TARGETS variable · 0dc1670b
      Simon Glass authored
      Add a new variable containing a list of possible targets. Mark them as
      phony. This matches the approach taken for arch/arm
      Signed-off-by: default avatarSimon Glass <sjg@chromium.org>
      Reviewed-by: default avatarNicolas Schier <n.schier@avm.de>
      Link: https://lore.kernel.org/r/20240329032836.141899-2-sjg@chromium.orgSigned-off-by: default avatarWill Deacon <will@kernel.org>
      0dc1670b
  7. 10 Apr, 2024 1 commit
  8. 07 Apr, 2024 4 commits
  9. 06 Apr, 2024 13 commits