1. 02 Jul, 2019 11 commits
  2. 01 Jul, 2019 15 commits
  3. 25 Jun, 2019 1 commit
    • Nicholas Piggin's avatar
      powerpc/64s/exception: Fix machine check early corrupting AMR · e13e7cd4
      Nicholas Piggin authored
      The early machine check runs in real mode, so locking is unnecessary.
      Worse, the windup does not restore AMR, so this can result in a false
      KUAP fault after a recoverable machine check hits inside a user copy
      operation.
      
      Fix this similarly to HMI by just avoiding the kuap lock in the
      early machine check handler (it will be set by the late handler that
      runs in virtual mode if that runs). If the virtual mode handler is
      reached, it will lock and restore the AMR.
      
      Fixes: 890274c2 ("powerpc/64s: Implement KUAP for Radix MMU")
      Cc: Russell Currey <ruscur@russell.cc>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e13e7cd4
  4. 20 Jun, 2019 4 commits
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry · 3c25ab35
      Suraj Jitindar Singh authored
      If we enter an L1 guest with a pending decrementer exception then this
      is cleared on guest exit if the guest has writtien a positive value
      into the decrementer (indicating that it handled the decrementer
      exception) since there is no other way to detect that the guest has
      handled the pending exception and that it should be dequeued. In the
      event that the L1 guest tries to run a nested (L2) guest immediately
      after this and the L2 guest decrementer is negative (which is loaded
      by L1 before making the H_ENTER_NESTED hcall), then the pending
      decrementer exception isn't cleared and the L2 entry is blocked since
      L1 has a pending exception, even though L1 may have already handled
      the exception and written a positive value for it's decrementer. This
      results in a loop of L1 trying to enter the L2 guest and L0 blocking
      the entry since L1 has an interrupt pending with the outcome being
      that L2 never gets to run and hangs.
      
      Fix this by clearing any pending decrementer exceptions when L1 makes
      the H_ENTER_NESTED hcall since it won't do this if it's decrementer
      has gone negative, and anyway it's decrementer has been communicated
      to L0 in the hdec_expires field and L0 will return control to L1 when
      this goes negative by delivering an H_DECREMENTER exception.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c25ab35
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer · 86953770
      Suraj Jitindar Singh authored
      On POWER9 the decrementer can operate in large decrementer mode where
      the decrementer is 56 bits and signed extended to 64 bits. When not
      operating in this mode the decrementer behaves as a 32 bit decrementer
      which is NOT signed extended (as on POWER8).
      
      Currently when reading a guest decrementer value we don't take into
      account whether the large decrementer is enabled or not, and this
      means the value will be incorrect when the guest is not using the
      large decrementer. Fix this by sign extending the value read when the
      guest isn't using the large decrementer.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      86953770
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries · 50087112
      Suraj Jitindar Singh authored
      When a guest vcpu moves from one physical thread to another it is
      necessary for the host to perform a tlb flush on the previous core if
      another vcpu from the same guest is going to run there. This is because the
      guest may use the local form of the tlb invalidation instruction meaning
      stale tlb entries would persist where it previously ran. This is handled
      on guest entry in kvmppc_check_need_tlb_flush() which calls
      flush_guest_tlb() to perform the tlb flush.
      
      Previously the generic radix__local_flush_tlb_lpid_guest() function was
      used, however the functionality was reimplemented in flush_guest_tlb()
      to avoid the trace_tlbie() call as the flushing may be done in real
      mode. The reimplementation in flush_guest_tlb() was missing an erat
      invalidation after flushing the tlb.
      
      This lead to observable memory corruption in the guest due to the
      caching of stale translations. Fix this by adding the erat invalidation.
      
      Fixes: 70ea13f6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      50087112
    • Alexey Kardashevskiy's avatar
      powerpc/pci/of: Fix OF flags parsing for 64bit BARs · df5be5be
      Alexey Kardashevskiy authored
      When the firmware does PCI BAR resource allocation, it passes the assigned
      addresses and flags (prefetch/64bit/...) via the "reg" property of
      a PCI device device tree node so the kernel does not need to do
      resource allocation.
      
      The flags are stored in resource::flags - the lower byte stores
      PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc.
      Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated,
      such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc.
      When parsing the "reg" property, we copy the prefetch flag but we skip
      on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync.
      
      The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions:
      1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch()
      or by passing "/chosen/linux,pci-probe-only");
      2. we request resource alignment (by passing pci=resource_alignment=
      via the kernel cmd line to request PAGE_SIZE alignment or defining
      ppc_md.pcibios_default_alignment which returns anything but 0). Note that
      the alignment requests are ignored if PCI_PROBE_ONLY is enabled.
      
      With 1) and 2), the generic PCI code in the kernel unconditionally
      decides to:
      - reassign the BARs in pci_specified_resource_alignment() (works fine)
      - write new BARs to the device - this fails for 64bit BARs as the generic
      code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits
      of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping
      in the hypervisor.
      
      This fixes the issue by copying the flag. This is useful if we want to
      enforce certain BAR alignment per platform as handling subpage sized BARs
      is proven to cause problems with hotplug (SLOF already aligns BARs to 64k).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarSam Bobroff <sbobroff@linux.ibm.com>
      Reviewed-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: default avatarShawn Anastasio <shawn@anastas.io>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      df5be5be
  5. 19 Jun, 2019 9 commits
    • Christoph Hellwig's avatar
      powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac · 9739ab7e
      Christoph Hellwig authored
      With the strict dma mask checking introduced with the switch to
      the generic DMA direct code common wifi chips on 32-bit powerbooks
      stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
      to allow them to reliably allocate dma coherent memory.
      
      Fixes: 65a21b71 ("powerpc/dma: remove dma_nommu_dma_supported")
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Acked-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9739ab7e
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP · d909f910
      Nicholas Piggin authored
      This sets the HAVE_ARCH_HUGE_VMAP option, and defines the required
      page table functions.
      
      This enables huge (2MB and 1GB) ioremap mappings. I don't have a
      benchmark for this change, but huge vmap will be used by a later core
      kernel change to enable huge vmalloc memory mappings. This improves
      cached `git diff` performance by about 5% on a 2-node POWER9 with 32MB
      size dentry cache hash.
      
        Profiling git diff dTLB misses with a vanilla kernel:
      
        81.75%  git      [kernel.vmlinux]    [k] __d_lookup_rcu
         7.21%  git      [kernel.vmlinux]    [k] strncpy_from_user
         1.77%  git      [kernel.vmlinux]    [k] find_get_entry
         1.59%  git      [kernel.vmlinux]    [k] kmem_cache_free
      
                  40,168      dTLB-miss
             0.100342754 seconds time elapsed
      
        With powerpc huge vmalloc:
      
                   2,987      dTLB-miss
             0.095933138 seconds time elapsed
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d909f910
    • Nicholas Piggin's avatar
      powerpc/64s/radix: ioremap use ioremap_page_range · d38153f9
      Nicholas Piggin authored
      Radix can use ioremap_page_range for ioremap, after slab is available.
      This makes it possible to enable huge ioremap mapping support.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d38153f9
    • Nicholas Piggin's avatar
      powerpc/64: __ioremap_at clean up in the error case · a72808a7
      Nicholas Piggin authored
      __ioremap_at error handling is wonky, it requires caller to clean up
      after it. Implement a helper that does the map and error cleanup and
      remove the requirement from the caller.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a72808a7
    • Anju T Sudhakar's avatar
      powerpc/perf: Use cpumask_last() to determine the designated cpu for nest/core units. · 9c9f8fb7
      Anju T Sudhakar authored
      Nest and core IMC (In-Memory Collection counters) assigns a particular
      cpu as the designated target for counter data collection. During
      system boot, the first online cpu in a chip gets assigned as the
      designated cpu for that chip(for nest-imc) and the first online cpu in
      a core gets assigned as the designated cpu for that core(for
      core-imc).
      
      If the designated cpu goes offline, the next online cpu from the same
      chip(for nest-imc)/core(for core-imc) is assigned as the next target,
      and the event context is migrated to the target cpu. Currently,
      cpumask_any_but() function is used to find the target cpu. Though this
      function is expected to return a `random` cpu, this always returns the
      next online cpu.
      
      If all cpus in a chip/core is offlined in a sequential manner,
      starting from the first cpu, the event migration has to happen for all
      the cpus which goes offline. Since the migration process involves a
      grace period, the total time taken to offline all the cpus will be
      significantly high.
      
      Example:
        In a system which has 2 sockets, with
        NUMA node0 CPU(s):     0-87
        NUMA node8 CPU(s):     88-175
      
        Time taken to offline cpu 88-175:
        real    2m56.099s
        user    0m0.191s
        sys     0m0.000s
      
      Use cpumask_last() to choose the target cpu, when the designated cpu
      goes online, so the migration will happen only when the last_cpu in
      the mask goes offline. This way the time taken to offline all cpus in
      a chip/core can be reduced.
      
      With the patch:
      
        Time taken  to offline cpu 88-175:
        real    0m12.207s
        user    0m0.171s
        sys     0m0.000s
      
      Offlining all cpus in reverse order is also taken care because,
      cpumask_any_but() is used to find the designated cpu if the last cpu
      in the mask goes offline. Since cpumask_any_but() always return the
      first cpu in the mask, that becomes the designated cpu and migration
      will happen only when the first_cpu in the mask goes offline.
      
      Example: With the patch,
      
        Time taken to offline cpu from 175-88:
        real    0m9.330s
        user    0m0.110s
        sys     0m0.000s
      Signed-off-by: default avatarAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9c9f8fb7
    • Shaokun Zhang's avatar
      powerpc/64s: Fix misleading SPR and timebase information · 87997471
      Shaokun Zhang authored
      pr_info shows SPR and timebase as a decimal value with a '0x'
      prefix, which is somewhat misleading.
      
      Fix it to print hexadecimal, as was intended.
      
      Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C")
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarShaokun Zhang <zhangshaokun@hisilicon.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      87997471
    • Nathan Lynch's avatar
      powerpc/pseries: avoid blocking in irq when queuing hotplug events · 348ea30f
      Nathan Lynch authored
      A couple of bugs in queue_hotplug_event():
      
      1. Unchecked kmalloc result which could lead to an oops.
      2. Use of GFP_KERNEL allocations in interrupt context (this code's
         only caller is ras_hotplug_interrupt()).
      
      Use kmemdup to avoid open-coding the allocation+copy and check for
      failure; use GFP_ATOMIC for both allocations.
      
      Ultimately it probably would be better to avoid or reduce allocations
      in this path if possible.
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      348ea30f
    • Ravi Bangoria's avatar
      powerpc/watchpoint: Restore NV GPRs while returning from exception · f474c28f
      Ravi Bangoria authored
      powerpc hardware triggers watchpoint before executing the instruction.
      To make trigger-after-execute behavior, kernel emulates the
      instruction. If the instruction is 'load something into non-volatile
      register', exception handler should restore emulated register state
      while returning back, otherwise there will be register state
      corruption. eg, adding a watchpoint on a list can corrput the list:
      
        # cat /proc/kallsyms | grep kthread_create_list
        c00000000121c8b8 d kthread_create_list
      
      Add watchpoint on kthread_create_list->prev:
      
        # perf record -e mem:0xc00000000121c8c0
      
      Run some workload such that new kthread gets invoked. eg, I just
      logged out from console:
      
        list_add corruption. next->prev should be prev (c000000001214e00), \
      	but was c00000000121c8b8. (next=c00000000121c8b8).
        WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
        CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
        ...
        NIP __list_add_valid+0xb4/0xc0
        LR __list_add_valid+0xb0/0xc0
        Call Trace:
        __list_add_valid+0xb0/0xc0 (unreliable)
        __kthread_create_on_node+0xe0/0x260
        kthread_create_on_node+0x34/0x50
        create_worker+0xe8/0x260
        worker_thread+0x444/0x560
        kthread+0x160/0x1a0
        ret_from_kernel_thread+0x5c/0x70
      
      List corruption happened because it uses 'load into non-volatile
      register' instruction:
      
      Snippet from __kthread_create_on_node:
      
        c000000000136be8:     addis   r29,r2,-19
        c000000000136bec:     ld      r29,31424(r29)
              if (!__list_add_valid(new, prev, next))
        c000000000136bf0:     mr      r3,r30
        c000000000136bf4:     mr      r5,r28
        c000000000136bf8:     mr      r4,r29
        c000000000136bfc:     bl      c00000000059a2f8 <__list_add_valid+0x8>
      
      Register state from WARN_ON():
      
        GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
        GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
        GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
        GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
        GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
        GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
        GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
        GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
      
      Watchpoint hit at 0xc000000000136bec.
      
        addis   r29,r2,-19
         => r29 = 0xc000000001344e00 + (-19 << 16)
         => r29 = 0xc000000001214e00
      
        ld      r29,31424(r29)
         => r29 = *(0xc000000001214e00 + 31424)
         => r29 = *(0xc00000000121c8c0)
      
      0xc00000000121c8c0 is where we placed a watchpoint and thus this
      instruction was emulated by emulate_step. But because handle_dabr_fault
      did not restore emulated register state, r29 still contains stale
      value in above register state.
      
      Fixes: 5aae8a53 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
      Signed-off-by: default avatarRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: stable@vger.kernel.org # 2.6.36+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f474c28f
    • Greg Kroah-Hartman's avatar
      cxl: no need to check return value of debugfs_create functions · 1b7de1df
      Greg Kroah-Hartman authored
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Because there's no need to check, also make the return value of the
      local debugfs_create_io_x64() call void, as no one ever did anything
      with the return value (as they did not need to.)
      
      And make the cxl_debugfs_* calls return void as no one was even checking
      their return value at all.
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarFrederic Barrat <fbarrat@linux.ibm.com>
      Acked-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1b7de1df