1. 20 Oct, 2021 2 commits
    • Nathan Lynch's avatar
      powerpc/smp: do not decrement idle task preempt count in CPU offline · 787252a1
      Nathan Lynch authored
      With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
      get:
      
      BUG: scheduling while atomic: swapper/1/0/0x00000000
      no locks held by swapper/1/0.
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
      Call Trace:
       dump_stack_lvl+0xac/0x108
       __schedule_bug+0xac/0xe0
       __schedule+0xcf8/0x10d0
       schedule_idle+0x3c/0x70
       do_idle+0x2d8/0x4a0
       cpu_startup_entry+0x38/0x40
       start_secondary+0x2ec/0x3a0
       start_secondary_prolog+0x10/0x14
      
      This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
      preempt count, for reasons explained in commit a7c2bb82 ("powerpc:
      Re-enable preemption before cpu_die()"), specifically "start_secondary()
      expects a preempt_count() of 0."
      
      However, since commit 2c669ef6 ("powerpc/preempt: Don't touch the idle
      task's preempt_count during hotplug") and commit f1a0a376 ("sched/core:
      Initialize the idle task with preemption disabled"), that justification no
      longer holds.
      
      The idle task isn't supposed to re-enable preemption, so remove the
      vestigial preempt_enable() from the CPU offline path.
      
      Tested with pseries and powernv in qemu, and pseries on PowerVM.
      
      Fixes: 2c669ef6 ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
      787252a1
    • Michael Ellerman's avatar
      powerpc/idle: Don't corrupt back chain when going idle · 496c5fe2
      Michael Ellerman authored
      In isa206_idle_insn_mayloss() we store various registers into the stack
      red zone, which is allowed.
      
      However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
      to 0(r1), which corrupts the stack back chain.
      
      We used to do the same in isa206_idle_insn_mayloss() itself, but we
      fixed that in 73287caa ("powerpc64/idle: Fix SP offsets when saving
      GPRs"), however we missed that the macro also corrupts the back chain.
      
      Corrupting the back chain is bad for debuggability but doesn't
      necessarily cause a bug.
      
      However we recently changed the stack handling in some KVM code, and it
      now relies on the stack back chain being valid when it returns. The
      corruption causes that code to return with r1 pointing somewhere in
      kernel data, at some point LR is restored from the stack and we branch
      to NULL or somewhere else invalid.
      
      Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
      enabled (which it is by default).
      
      The fixes tag below points to the commit that changed the KVM stack
      handling, exposing this bug. The actual corruption of the back chain has
      always existed since 948cf67c ("powerpc: Add NAP mode support on
      Power7 in HV mode").
      
      Fixes: 9b4416c5 ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
      496c5fe2
  2. 15 Oct, 2021 2 commits
    • Michael Ellerman's avatar
      KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest · cdeb5d7d
      Michael Ellerman authored
      We call idle_kvm_start_guest() from power7_offline() if the thread has
      been requested to enter KVM. We pass it the SRR1 value that was returned
      from power7_idle_insn() which tells us what sort of wakeup we're
      processing.
      
      Depending on the SRR1 value we pass in, the KVM code might enter the
      guest, or it might return to us to do some host action if the wakeup
      requires it.
      
      If idle_kvm_start_guest() is able to handle the wakeup, and enter the
      guest it is supposed to indicate that by returning a zero SRR1 value to
      us.
      
      That was the behaviour prior to commit 10d91611 ("powerpc/64s:
      Reimplement book3s idle code in C"), however in that commit the
      handling of SRR1 was reworked, and the zeroing behaviour was lost.
      
      Returning from idle_kvm_start_guest() without zeroing the SRR1 value can
      confuse the host offline code, causing the guest to crash and other
      weirdness.
      
      Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211015133929.832061-2-mpe@ellerman.id.au
      cdeb5d7d
    • Michael Ellerman's avatar
      KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest() · 9b4416c5
      Michael Ellerman authored
      In commit 10d91611 ("powerpc/64s: Reimplement book3s idle code in
      C") kvm_start_guest() became idle_kvm_start_guest(). The old code
      allocated a stack frame on the emergency stack, but didn't use the
      frame to store anything, and also didn't store anything in its caller's
      frame.
      
      idle_kvm_start_guest() on the other hand is written more like a normal C
      function, it creates a frame on entry, and also stores CR/LR into its
      callers frame (per the ABI). The problem is that there is no caller
      frame on the emergency stack.
      
      The emergency stack for a given CPU is allocated with:
      
        paca_ptrs[i]->emergency_sp = alloc_stack(limit, i) + THREAD_SIZE;
      
      So emergency_sp actually points to the first address above the emergency
      stack allocation for a given CPU, we must not store above it without
      first decrementing it to create a frame. This is different to the
      regular kernel stack, paca->kstack, which is initialised to point at an
      initial frame that is ready to use.
      
      idle_kvm_start_guest() stores the backchain, CR and LR all of which
      write outside the allocation for the emergency stack. It then creates a
      stack frame and saves the non-volatile registers. Unfortunately the
      frame it creates is not large enough to fit the non-volatiles, and so
      the saving of the non-volatile registers also writes outside the
      emergency stack allocation.
      
      The end result is that we corrupt whatever is at 0-24 bytes, and 112-248
      bytes above the emergency stack allocation.
      
      In practice this has gone unnoticed because the memory immediately above
      the emergency stack happens to be used for other stack allocations,
      either another CPUs mc_emergency_sp or an IRQ stack. See the order of
      calls to irqstack_early_init() and emergency_stack_init().
      
      The low addresses of another stack are the top of that stack, and so are
      only used if that stack is under extreme pressue, which essentially
      never happens in practice - and if it did there's a high likelyhood we'd
      crash due to that stack overflowing.
      
      Still, we shouldn't be corrupting someone else's stack, and it is purely
      luck that we aren't corrupting something else.
      
      To fix it we save CR/LR into the caller's frame using the existing r1 on
      entry, we then create a SWITCH_FRAME_SIZE frame (which has space for
      pt_regs) on the emergency stack with the backchain pointing to the
      existing stack, and then finally we switch to the new frame on the
      emergency stack.
      
      Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211015133929.832061-1-mpe@ellerman.id.au
      9b4416c5
  3. 13 Oct, 2021 1 commit
  4. 07 Oct, 2021 18 commits
  5. 30 Sep, 2021 1 commit
  6. 20 Sep, 2021 2 commits
    • Linus Torvalds's avatar
      Linux 5.15-rc2 · e4e737bb
      Linus Torvalds authored
      e4e737bb
    • Linus Torvalds's avatar
      pci_iounmap'2: Electric Boogaloo: try to make sense of it all · 316e8d79
      Linus Torvalds authored
      Nathan Chancellor reports that the recent change to pci_iounmap in
      commit 9caea000 ("parisc: Declare pci_iounmap() parisc version only
      when CONFIG_PCI enabled") causes build errors on arm64.
      
      It took me about two hours to convince myself that I think I know what
      the logic of that mess of #ifdef's in the <asm-generic/io.h> header file
      really aim to do, and rewrite it to be easier to follow.
      
      Famous last words.
      
      Anyway, the code has now been lifted from that grotty header file into
      lib/pci_iomap.c, and has fairly extensive comments about what the logic
      is.  It also avoids indirecting through another confusing (and badly
      named) helper function that has other preprocessor config conditionals.
      
      Let's see what odd architecture did something else strange in this area
      to break things.  But my arm64 cross build is clean.
      
      Fixes: 9caea000 ("parisc: Declare pci_iounmap() parisc version only when CONFIG_PCI enabled")
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Ulrich Teichert <krypton@ulrich-teichert.org>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      316e8d79
  7. 19 Sep, 2021 14 commits