1. 01 Jul, 2019 3 commits
  2. 25 Jun, 2019 1 commit
    • Nicholas Piggin's avatar
      powerpc/64s/exception: Fix machine check early corrupting AMR · e13e7cd4
      Nicholas Piggin authored
      The early machine check runs in real mode, so locking is unnecessary.
      Worse, the windup does not restore AMR, so this can result in a false
      KUAP fault after a recoverable machine check hits inside a user copy
      operation.
      
      Fix this similarly to HMI by just avoiding the kuap lock in the
      early machine check handler (it will be set by the late handler that
      runs in virtual mode if that runs). If the virtual mode handler is
      reached, it will lock and restore the AMR.
      
      Fixes: 890274c2 ("powerpc/64s: Implement KUAP for Radix MMU")
      Cc: Russell Currey <ruscur@russell.cc>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e13e7cd4
  3. 20 Jun, 2019 4 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry · 3c25ab35
      Suraj Jitindar Singh authored
      If we enter an L1 guest with a pending decrementer exception then this
      is cleared on guest exit if the guest has writtien a positive value
      into the decrementer (indicating that it handled the decrementer
      exception) since there is no other way to detect that the guest has
      handled the pending exception and that it should be dequeued. In the
      event that the L1 guest tries to run a nested (L2) guest immediately
      after this and the L2 guest decrementer is negative (which is loaded
      by L1 before making the H_ENTER_NESTED hcall), then the pending
      decrementer exception isn't cleared and the L2 entry is blocked since
      L1 has a pending exception, even though L1 may have already handled
      the exception and written a positive value for it's decrementer. This
      results in a loop of L1 trying to enter the L2 guest and L0 blocking
      the entry since L1 has an interrupt pending with the outcome being
      that L2 never gets to run and hangs.
      
      Fix this by clearing any pending decrementer exceptions when L1 makes
      the H_ENTER_NESTED hcall since it won't do this if it's decrementer
      has gone negative, and anyway it's decrementer has been communicated
      to L0 in the hdec_expires field and L0 will return control to L1 when
      this goes negative by delivering an H_DECREMENTER exception.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c25ab35
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer · 86953770
      Suraj Jitindar Singh authored
      On POWER9 the decrementer can operate in large decrementer mode where
      the decrementer is 56 bits and signed extended to 64 bits. When not
      operating in this mode the decrementer behaves as a 32 bit decrementer
      which is NOT signed extended (as on POWER8).
      
      Currently when reading a guest decrementer value we don't take into
      account whether the large decrementer is enabled or not, and this
      means the value will be incorrect when the guest is not using the
      large decrementer. Fix this by sign extending the value read when the
      guest isn't using the large decrementer.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      86953770
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries · 50087112
      Suraj Jitindar Singh authored
      When a guest vcpu moves from one physical thread to another it is
      necessary for the host to perform a tlb flush on the previous core if
      another vcpu from the same guest is going to run there. This is because the
      guest may use the local form of the tlb invalidation instruction meaning
      stale tlb entries would persist where it previously ran. This is handled
      on guest entry in kvmppc_check_need_tlb_flush() which calls
      flush_guest_tlb() to perform the tlb flush.
      
      Previously the generic radix__local_flush_tlb_lpid_guest() function was
      used, however the functionality was reimplemented in flush_guest_tlb()
      to avoid the trace_tlbie() call as the flushing may be done in real
      mode. The reimplementation in flush_guest_tlb() was missing an erat
      invalidation after flushing the tlb.
      
      This lead to observable memory corruption in the guest due to the
      caching of stale translations. Fix this by adding the erat invalidation.
      
      Fixes: 70ea13f6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      50087112
    • Alexey Kardashevskiy's avatar
      powerpc/pci/of: Fix OF flags parsing for 64bit BARs · df5be5be
      Alexey Kardashevskiy authored
      When the firmware does PCI BAR resource allocation, it passes the assigned
      addresses and flags (prefetch/64bit/...) via the "reg" property of
      a PCI device device tree node so the kernel does not need to do
      resource allocation.
      
      The flags are stored in resource::flags - the lower byte stores
      PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc.
      Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated,
      such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc.
      When parsing the "reg" property, we copy the prefetch flag but we skip
      on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync.
      
      The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions:
      1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch()
      or by passing "/chosen/linux,pci-probe-only");
      2. we request resource alignment (by passing pci=resource_alignment=
      via the kernel cmd line to request PAGE_SIZE alignment or defining
      ppc_md.pcibios_default_alignment which returns anything but 0). Note that
      the alignment requests are ignored if PCI_PROBE_ONLY is enabled.
      
      With 1) and 2), the generic PCI code in the kernel unconditionally
      decides to:
      - reassign the BARs in pci_specified_resource_alignment() (works fine)
      - write new BARs to the device - this fails for 64bit BARs as the generic
      code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits
      of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping
      in the hypervisor.
      
      This fixes the issue by copying the flag. This is useful if we want to
      enforce certain BAR alignment per platform as handling subpage sized BARs
      is proven to cause problems with hotplug (SLOF already aligns BARs to 64k).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarSam Bobroff <sbobroff@linux.ibm.com>
      Reviewed-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: default avatarShawn Anastasio <shawn@anastas.io>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      df5be5be
  4. 19 Jun, 2019 13 commits
    • Christoph Hellwig's avatar
      powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac · 9739ab7e
      Christoph Hellwig authored
      With the strict dma mask checking introduced with the switch to
      the generic DMA direct code common wifi chips on 32-bit powerbooks
      stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
      to allow them to reliably allocate dma coherent memory.
      
      Fixes: 65a21b71 ("powerpc/dma: remove dma_nommu_dma_supported")
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Acked-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9739ab7e
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP · d909f910
      Nicholas Piggin authored
      This sets the HAVE_ARCH_HUGE_VMAP option, and defines the required
      page table functions.
      
      This enables huge (2MB and 1GB) ioremap mappings. I don't have a
      benchmark for this change, but huge vmap will be used by a later core
      kernel change to enable huge vmalloc memory mappings. This improves
      cached `git diff` performance by about 5% on a 2-node POWER9 with 32MB
      size dentry cache hash.
      
        Profiling git diff dTLB misses with a vanilla kernel:
      
        81.75%  git      [kernel.vmlinux]    [k] __d_lookup_rcu
         7.21%  git      [kernel.vmlinux]    [k] strncpy_from_user
         1.77%  git      [kernel.vmlinux]    [k] find_get_entry
         1.59%  git      [kernel.vmlinux]    [k] kmem_cache_free
      
                  40,168      dTLB-miss
             0.100342754 seconds time elapsed
      
        With powerpc huge vmalloc:
      
                   2,987      dTLB-miss
             0.095933138 seconds time elapsed
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d909f910
    • Nicholas Piggin's avatar
      powerpc/64s/radix: ioremap use ioremap_page_range · d38153f9
      Nicholas Piggin authored
      Radix can use ioremap_page_range for ioremap, after slab is available.
      This makes it possible to enable huge ioremap mapping support.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d38153f9
    • Nicholas Piggin's avatar
      powerpc/64: __ioremap_at clean up in the error case · a72808a7
      Nicholas Piggin authored
      __ioremap_at error handling is wonky, it requires caller to clean up
      after it. Implement a helper that does the map and error cleanup and
      remove the requirement from the caller.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a72808a7
    • Anju T Sudhakar's avatar
      powerpc/perf: Use cpumask_last() to determine the designated cpu for nest/core units. · 9c9f8fb7
      Anju T Sudhakar authored
      Nest and core IMC (In-Memory Collection counters) assigns a particular
      cpu as the designated target for counter data collection. During
      system boot, the first online cpu in a chip gets assigned as the
      designated cpu for that chip(for nest-imc) and the first online cpu in
      a core gets assigned as the designated cpu for that core(for
      core-imc).
      
      If the designated cpu goes offline, the next online cpu from the same
      chip(for nest-imc)/core(for core-imc) is assigned as the next target,
      and the event context is migrated to the target cpu. Currently,
      cpumask_any_but() function is used to find the target cpu. Though this
      function is expected to return a `random` cpu, this always returns the
      next online cpu.
      
      If all cpus in a chip/core is offlined in a sequential manner,
      starting from the first cpu, the event migration has to happen for all
      the cpus which goes offline. Since the migration process involves a
      grace period, the total time taken to offline all the cpus will be
      significantly high.
      
      Example:
        In a system which has 2 sockets, with
        NUMA node0 CPU(s):     0-87
        NUMA node8 CPU(s):     88-175
      
        Time taken to offline cpu 88-175:
        real    2m56.099s
        user    0m0.191s
        sys     0m0.000s
      
      Use cpumask_last() to choose the target cpu, when the designated cpu
      goes online, so the migration will happen only when the last_cpu in
      the mask goes offline. This way the time taken to offline all cpus in
      a chip/core can be reduced.
      
      With the patch:
      
        Time taken  to offline cpu 88-175:
        real    0m12.207s
        user    0m0.171s
        sys     0m0.000s
      
      Offlining all cpus in reverse order is also taken care because,
      cpumask_any_but() is used to find the designated cpu if the last cpu
      in the mask goes offline. Since cpumask_any_but() always return the
      first cpu in the mask, that becomes the designated cpu and migration
      will happen only when the first_cpu in the mask goes offline.
      
      Example: With the patch,
      
        Time taken to offline cpu from 175-88:
        real    0m9.330s
        user    0m0.110s
        sys     0m0.000s
      Signed-off-by: default avatarAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9c9f8fb7
    • Shaokun Zhang's avatar
      powerpc/64s: Fix misleading SPR and timebase information · 87997471
      Shaokun Zhang authored
      pr_info shows SPR and timebase as a decimal value with a '0x'
      prefix, which is somewhat misleading.
      
      Fix it to print hexadecimal, as was intended.
      
      Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C")
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      87997471
    • Nathan Lynch's avatar
      powerpc/pseries: avoid blocking in irq when queuing hotplug events · 348ea30f
      Nathan Lynch authored
      A couple of bugs in queue_hotplug_event():
      
      1. Unchecked kmalloc result which could lead to an oops.
      2. Use of GFP_KERNEL allocations in interrupt context (this code's
         only caller is ras_hotplug_interrupt()).
      
      Use kmemdup to avoid open-coding the allocation+copy and check for
      failure; use GFP_ATOMIC for both allocations.
      
      Ultimately it probably would be better to avoid or reduce allocations
      in this path if possible.
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      348ea30f
    • Ravi Bangoria's avatar
      powerpc/watchpoint: Restore NV GPRs while returning from exception · f474c28f
      Ravi Bangoria authored
      powerpc hardware triggers watchpoint before executing the instruction.
      To make trigger-after-execute behavior, kernel emulates the
      instruction. If the instruction is 'load something into non-volatile
      register', exception handler should restore emulated register state
      while returning back, otherwise there will be register state
      corruption. eg, adding a watchpoint on a list can corrput the list:
      
        # cat /proc/kallsyms | grep kthread_create_list
        c00000000121c8b8 d kthread_create_list
      
      Add watchpoint on kthread_create_list->prev:
      
        # perf record -e mem:0xc00000000121c8c0
      
      Run some workload such that new kthread gets invoked. eg, I just
      logged out from console:
      
        list_add corruption. next->prev should be prev (c000000001214e00), \
      	but was c00000000121c8b8. (next=c00000000121c8b8).
        WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
        CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
        ...
        NIP __list_add_valid+0xb4/0xc0
        LR __list_add_valid+0xb0/0xc0
        Call Trace:
        __list_add_valid+0xb0/0xc0 (unreliable)
        __kthread_create_on_node+0xe0/0x260
        kthread_create_on_node+0x34/0x50
        create_worker+0xe8/0x260
        worker_thread+0x444/0x560
        kthread+0x160/0x1a0
        ret_from_kernel_thread+0x5c/0x70
      
      List corruption happened because it uses 'load into non-volatile
      register' instruction:
      
      Snippet from __kthread_create_on_node:
      
        c000000000136be8:     addis   r29,r2,-19
        c000000000136bec:     ld      r29,31424(r29)
              if (!__list_add_valid(new, prev, next))
        c000000000136bf0:     mr      r3,r30
        c000000000136bf4:     mr      r5,r28
        c000000000136bf8:     mr      r4,r29
        c000000000136bfc:     bl      c00000000059a2f8 <__list_add_valid+0x8>
      
      Register state from WARN_ON():
      
        GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
        GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
        GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
        GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
        GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
        GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
        GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
        GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
      
      Watchpoint hit at 0xc000000000136bec.
      
        addis   r29,r2,-19
         => r29 = 0xc000000001344e00 + (-19 << 16)
         => r29 = 0xc000000001214e00
      
        ld      r29,31424(r29)
         => r29 = *(0xc000000001214e00 + 31424)
         => r29 = *(0xc00000000121c8c0)
      
      0xc00000000121c8c0 is where we placed a watchpoint and thus this
      instruction was emulated by emulate_step. But because handle_dabr_fault
      did not restore emulated register state, r29 still contains stale
      value in above register state.
      
      Fixes: 5aae8a53 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
      Signed-off-by: default avatarRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: stable@vger.kernel.org # 2.6.36+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f474c28f
    • Greg Kroah-Hartman's avatar
      cxl: no need to check return value of debugfs_create functions · 1b7de1df
      Greg Kroah-Hartman authored
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Because there's no need to check, also make the return value of the
      local debugfs_create_io_x64() call void, as no one ever did anything
      with the return value (as they did not need to.)
      
      And make the cxl_debugfs_* calls return void as no one was even checking
      their return value at all.
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarFrederic Barrat <fbarrat@linux.ibm.com>
      Acked-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1b7de1df
    • Geert Uytterhoeven's avatar
      powerpc/ps3: Use [] to denote a flexible array member · 0b1be03f
      Geert Uytterhoeven authored
      Flexible array members should be denoted using [] instead of [0], else
      gcc will not warn when they are no longer at the end of the structure.
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0b1be03f
    • Andreas Schwab's avatar
      powerpc/mm/32s: fix condition that is always true · 46c2478a
      Andreas Schwab authored
      Move a misplaced paren that makes the condition always true.
      
      Fixes: 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
      Cc: stable@vger.kernel.org # v5.1+
      Signed-off-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      46c2478a
    • Christophe Leroy's avatar
      powerpc/32s: fix suspend/resume when IBATs 4-7 are used · 6ecb78ef
      Christophe Leroy authored
      Previously, only IBAT1 and IBAT2 were used to map kernel linear mem.
      Since commit 63b2bc61 ("powerpc/mm/32s: Use BATs for
      STRICT_KERNEL_RWX"), we may have all 8 BATs used for mapping
      kernel text. But the suspend/restore functions only save/restore
      BATs 0 to 3, and clears BATs 4 to 7.
      
      Make suspend and restore functions respectively save and reload
      the 8 BATs on CPUs having MMU_FTR_USE_HIGH_BATS feature.
      Reported-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6ecb78ef
    • Gustavo Romero's avatar
      selftests/powerpc: Fix earlyclobber in tm-vmxcopy · 8d0f1e05
      Gustavo Romero authored
      In some cases, compiler can allocate the same register for operand 'res'
      and 'vecoutptr', resulting in segfault at 'stxvd2x 40,0,%[vecoutptr]'
      because base register will contain 1, yielding a false-positive.
      
      This is because output 'res' must be marked as an earlyclobber operand so
      it may not overlap an input operand ('vecoutptr').
      Signed-off-by: default avatarGustavo Romero <gromero@linux.vnet.ibm.com>
      Reviewed-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8d0f1e05
  5. 18 Jun, 2019 2 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode · 84b02824
      Suraj Jitindar Singh authored
      The hcall H_SET_DAWR is used by a guest to set the data address
      watchpoint register (DAWR). This hcall is handled in the host in
      kvmppc_h_set_dawr() which can be called in either real mode on the
      guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S,
      or in virtual mode when called from kvmppc_pseries_do_hcall() in
      book3s_hv.c.
      
      The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in
      the vcpu struct accordingly and then also writes the respective values
      into the DAWR and DAWRX registers directly. It is necessary to write
      the registers directly here when calling the function in real mode
      since the path to re-enter the guest won't do this. However when in
      virtual mode the host DAWR and DAWRX values have already been
      restored, and so writing the registers would overwrite these.
      Additionally there is no reason to write the guest values here as
      these will be read from the vcpu struct and written to the registers
      appropriately the next time the vcpu is run.
      
      This also avoids the case when handling h_set_dawr for a nested guest
      where the guest hypervisor isn't able to write the DAWR and DAWRX
      registers directly and must rely on the real hypervisor to do this for
      it when it calls H_ENTER_NESTED.
      
      Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      84b02824
    • Michael Neuling's avatar
      KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr() · fabb2efc
      Michael Neuling authored
      Commit c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      screwed up some assembler and corrupted a pointer in r3. This resulted
      in crashes like the below:
      
        BUG: Kernel NULL pointer dereference at 0x000013bf
        Faulting instruction address: 0xc00000000010b044
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3
        NIP:  c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4
        REGS: c00000179b397710 TRAP: 0300   Not tainted  (5.2.0-rc4+)
        MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42244842  XER: 00000000
        CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0
        GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff
        GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0
        ...
        NIP kvmppc_h_set_dabr+0x50/0x68
        LR  kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv]
        Call Trace:
          0xc0000017f11c0000 (unreliable)
          kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
          kvmppc_vcpu_run+0x34/0x48 [kvm]
          kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
          kvm_vcpu_ioctl+0x460/0x850 [kvm]
          do_vfs_ioctl+0xe4/0xb40
          ksys_ioctl+0xc4/0x110
          sys_ioctl+0x28/0x80
          system_call+0x5c/0x70
        Instruction dump:
        4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff
        4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6
      
      Fix the bug by only changing r3 when we are returning immediately.
      
      Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reported-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fabb2efc
  6. 15 Jun, 2019 7 commits
  7. 14 Jun, 2019 3 commits
    • Nathan Lynch's avatar
      powerpc/pseries: Fix oops in hotplug memory notifier · 0aa82c48
      Nathan Lynch authored
      During post-migration device tree updates, we can oops in
      pseries_update_drconf_memory() if the source device tree has an
      ibm,dynamic-memory-v2 property and the destination has a
      ibm,dynamic_memory (v1) property. The notifier processes an "update"
      for the ibm,dynamic-memory property but it's really an add in this
      scenario. So make sure the old property object is there before
      dereferencing it.
      
      Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0aa82c48
    • Daniel Axtens's avatar
      powerpc/pseries/hvconsole: Fix stack overread via udbg · 934bda59
      Daniel Axtens authored
      While developing KASAN for 64-bit book3s, I hit the following stack
      over-read.
      
      It occurs because the hypercall to put characters onto the terminal
      takes 2 longs (128 bits/16 bytes) of characters at a time, and so
      hvc_put_chars() would unconditionally copy 16 bytes from the argument
      buffer, regardless of supplied length. However, udbg_hvc_putc() can
      call hvc_put_chars() with a single-byte buffer, leading to the error.
      
        ==================================================================
        BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110
        Read of size 8 at addr c0000000023e7a90 by task swapper/0
      
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113
        Call Trace:
          dump_stack+0x104/0x154 (unreliable)
          print_address_description+0xa0/0x30c
          __kasan_report+0x20c/0x224
          kasan_report+0x18/0x30
          __asan_report_load8_noabort+0x24/0x40
          hvc_put_chars+0xdc/0x110
          hvterm_raw_put_chars+0x9c/0x110
          udbg_hvc_putc+0x154/0x200
          udbg_write+0xf0/0x240
          console_unlock+0x868/0xd30
          register_console+0x970/0xe90
          register_early_udbg_console+0xf8/0x114
          setup_arch+0x108/0x790
          start_kernel+0x104/0x784
          start_here_common+0x1c/0x534
      
        Memory state around the buggy address:
         c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
        >c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
                                 ^
         c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Document that a 16-byte buffer is requred, and provide it in udbg.
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      934bda59
    • Masahiro Yamada's avatar
      ocxl: do not use C++ style comments in uapi header · 2305ff22
      Masahiro Yamada authored
      Linux kernel tolerates C++ style comments these days. Actually, the
      SPDX License tags for .c files start with //.
      
      On the other hand, uapi headers are written in more strict C, where
      the C++ comment style is forbidden.
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarFrederic Barrat <fbarrat@linux.ibm.com>
      Acked-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2305ff22
  8. 13 Jun, 2019 2 commits
  9. 12 Jun, 2019 1 commit
    • Michael Ellerman's avatar
      powerpc/mm/64s/hash: Reallocate context ids on fork · ca72d883
      Michael Ellerman authored
      When using the Hash Page Table (HPT) MMU, userspace memory mappings
      are managed at two levels. Firstly in the Linux page tables, much like
      other architectures, and secondly in the SLB (Segment Lookaside
      Buffer) and HPT. It's the SLB and HPT that are actually used by the
      hardware to do translations.
      
      As part of the series adding support for 4PB user virtual address
      space using the hash MMU, we added support for allocating multiple
      "context ids" per process, one for each 512TB chunk of address space.
      These are tracked in an array called extended_id in the mm_context_t
      of a process that has done a mapping above 512TB.
      
      If such a process forks (ie. clone(2) without CLONE_VM set) it's mm is
      copied, including the mm_context_t, and then init_new_context() is
      called to reinitialise parts of the mm_context_t as appropriate to
      separate the address spaces of the two processes.
      
      The key step in ensuring the two processes have separate address
      spaces is to allocate a new context id for the process, this is done
      at the beginning of hash__init_new_context(). If we didn't allocate a
      new context id then the two processes would share mappings as far as
      the SLB and HPT are concerned, even though their Linux page tables
      would be separate.
      
      For mappings above 512TB, which use the extended_id array, we
      neglected to allocate new context ids on fork, meaning the parent and
      child use the same ids and therefore share those mappings even though
      they're supposed to be separate. This can lead to the parent seeing
      writes done by the child, which is essentially memory corruption.
      
      There is an additional exposure which is that if the child process
      exits, all its context ids are freed, including the context ids that
      are still in use by the parent for mappings above 512TB. One or more
      of those ids can then be reallocated to a third process, that process
      can then read/write to the parent's mappings above 512TB. Additionally
      if the freed id is used for the third process's primary context id,
      then the parent is able to read/write to the third process's mappings
      *below* 512TB.
      
      All of these are fundamental failures to enforce separation between
      processes. The only mitigating factor is that the bug only occurs if a
      process creates mappings above 512TB, and most applications still do
      not create such mappings.
      
      Only machines using the hash page table MMU are affected, eg. PowerPC
      970 (G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines
      (powernv) use the Radix MMU and are not affected, unless the machine
      has been explicitly booted in HPT mode (using disable_radix on the
      kernel command line). KVM guests on Power9 may be affected if the host
      or guest is configured to use the HPT MMU. LPARs under PowerVM on
      Power9 are affected as they always use the HPT MMU. Kernels built with
      PAGE_SIZE=4K are not affected.
      
      The fix is relatively simple, we need to reallocate context ids for
      all extended mappings on fork.
      
      Fixes: f384796c ("powerpc/mm: Add support for handling > 512TB address in SLB miss")
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ca72d883
  10. 07 Jun, 2019 4 commits
    • Christophe Leroy's avatar
      powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX · c21f5a9e
      Christophe Leroy authored
      When booting through OF, setup_disp_bat() does nothing because
      disp_BAT are not set. By change, it used to work because BOOTX
      buffer is mapped 1:1 at address 0x81000000 by the bootloader, and
      btext_setup_display() sets virt addr same as phys addr.
      
      But since commit 215b8237 ("powerpc/32s: set up an early static
      hash table for KASAN."), a temporary page table overrides the
      bootloader mapping.
      
      This 0x81000000 is also problematic with the newly implemented
      Kernel Userspace Access Protection (KUAP) because it is within user
      address space.
      
      This patch fixes those issues by properly setting disp_BAT through
      a call to btext_prepare_BAT(), allowing setup_disp_bat() to
      properly setup BAT3 for early bootx screen buffer access.
      Reported-by: default avatarMathieu Malaterre <malat@debian.org>
      Fixes: 215b8237 ("powerpc/32s: set up an early static hash table for KASAN.")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: default avatarMathieu Malaterre <malat@debian.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c21f5a9e
    • Nicholas Piggin's avatar
      powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate() · a00196a2
      Nicholas Piggin authored
      The change to pmdp_invalidate() to mark the pmd with _PAGE_INVALID
      broke the synchronisation against lock free lookups,
      __find_linux_pte()'s pmd_none() check no longer returns true for such
      cases.
      
      Fix this by adding a check for this condition as well.
      
      Fixes: da7ad366 ("powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bit")
      Cc: stable@vger.kernel.org # v4.20+
      Suggested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a00196a2
    • Nicholas Piggin's avatar
      powerpc/64s: Fix THP PMD collapse serialisation · 33258a1d
      Nicholas Piggin authored
      Commit 1b2443a5 ("powerpc/book3s64: Avoid multiple endian
      conversion in pte helpers") changed the actual bitwise tests in
      pte_access_permitted by using pte_write() and pte_present() helpers
      rather than raw bitwise testing _PAGE_WRITE and _PAGE_PRESENT bits.
      
      The pte_present() change now returns true for PTEs which are
      !_PAGE_PRESENT and _PAGE_INVALID, which is the combination used by
      pmdp_invalidate() to synchronize access from lock-free lookups.
      pte_access_permitted() is used by pmd_access_permitted(), so allowing
      GUP lock free access to proceed with such PTEs breaks this
      synchronisation.
      
      This bug has been observed on a host using the hash page table MMU,
      with random crashes and corruption in guests, usually together with
      bad PMD messages in the host.
      
      Fix this by adding an explicit check in pmd_access_permitted(), and
      documenting the condition explicitly.
      
      The pte_write() change should be okay, and would prevent GUP from
      falling back to the slow path when encountering savedwrite PTEs, which
      matches what x86 (that does not implement savedwrite) does.
      
      Fixes: 1b2443a5 ("powerpc/book3s64: Avoid multiple endian conversion in pte helpers")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      33258a1d
    • Christophe Leroy's avatar
      powerpc: Fix kexec failure on book3s/32 · 6c284228
      Christophe Leroy authored
      In the old days, _PAGE_EXEC didn't exist on 6xx aka book3s/32.
      Therefore, allthough __mapin_ram_chunk() was already mapping kernel
      text with PAGE_KERNEL_TEXT and the rest with PAGE_KERNEL, the entire
      memory was executable. Part of the memory (first 512kbytes) was
      mapped with BATs instead of page table, but it was also entirely
      mapped as executable.
      
      In commit 385e89d5 ("powerpc/mm: add exec protection on
      powerpc 603"), we started adding exec protection to some 6xx, namely
      the 603, for pages mapped via pagetables.
      
      Then, in commit 63b2bc61 ("powerpc/mm/32s: Use BATs for
      STRICT_KERNEL_RWX"), the exec protection was extended to BAT mapped
      memory, so that really only the kernel text could be executed.
      
      The problem here is that kexec is based on copying some code into
      upper part of memory then executing it from there in order to install
      a fresh new kernel at its definitive location.
      
      However, the code is position independant and first part of it is
      just there to deactivate the MMU and jump to the second part. So it
      is possible to run this first part inplace instead of running the
      copy. Once the MMU is off, there is no protection anymore and the
      second part of the code will just run as before.
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Fixes: 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
      Cc: stable@vger.kernel.org # v5.1+
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6c284228