1. 19 Oct, 2020 3 commits
    • Vasant Hegde's avatar
      powerpc/powernv/dump: Fix race while processing OPAL dump · 0a43ae3e
      Vasant Hegde authored
      Every dump reported by OPAL is exported to userspace through a sysfs
      interface and notified using kobject_uevent(). The userspace daemon
      (opal_errd) then reads the dump and acknowledges that the dump is
      saved safely to disk. Once acknowledged the kernel removes the
      respective sysfs file entry causing respective resources to be
      released including kobject.
      
      However it's possible the userspace daemon may already be scanning
      dump entries when a new sysfs dump entry is created by the kernel.
      User daemon may read this new entry and ack it even before kernel can
      notify userspace about it through kobject_uevent() call. If that
      happens then we have a potential race between
      dump_ack_store->kobject_put() and kobject_uevent which can lead to
      use-after-free of a kernfs object resulting in a kernel crash.
      
      This patch fixes this race by protecting the sysfs file
      creation/notification by holding a reference count on kobject until we
      safely send kobject_uevent().
      
      The function create_dump_obj() returns the dump object which if used
      by caller function will end up in use-after-free problem again.
      However, the return value of create_dump_obj() function isn't being
      used today and there is no need as well. Hence change it to return
      void to make this fix complete.
      
      Fixes: c7e64b9c ("powerpc/powernv Platform dump interface")
      Signed-off-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201017164210.264619-1-hegdevasant@linux.vnet.ibm.com
      0a43ae3e
    • Srikar Dronamraju's avatar
      powerpc/smp: Use GFP_ATOMIC while allocating tmp mask · 84dbf66c
      Srikar Dronamraju authored
      Qian Cai reported a regression where CPU Hotplug fails with the latest
      powerpc/next
      
      BUG: sleeping function called from invalid context at mm/slab.h:494
      in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88
      no locks held by swapper/88/0.
      irq event stamp: 18074448
      hardirqs last  enabled at (18074447): [<c0000000001a2a7c>] tick_nohz_idle_enter+0x9c/0x110
      hardirqs last disabled at (18074448): [<c000000000106798>] do_idle+0x138/0x3b0
      do_idle at kernel/sched/idle.c:253 (discriminator 1)
      softirqs last  enabled at (18074440): [<c0000000000bbec4>] irq_enter_rcu+0x94/0xa0
      softirqs last disabled at (18074439): [<c0000000000bbea0>] irq_enter_rcu+0x70/0xa0
      CPU: 88 PID: 0 Comm: swapper/88 Tainted: G        W         5.9.0-rc8-next-20201007 #1
      Call Trace:
      [c00020000a4bfcf0] [c000000000649e98] dump_stack+0xec/0x144 (unreliable)
      [c00020000a4bfd30] [c0000000000f6c34] ___might_sleep+0x2f4/0x310
      [c00020000a4bfdb0] [c000000000354f94] slab_pre_alloc_hook.constprop.82+0x124/0x190
      [c00020000a4bfe00] [c00000000035e9e8] __kmalloc_node+0x88/0x3a0
      slab_alloc_node at mm/slub.c:2817
      (inlined by) __kmalloc_node at mm/slub.c:4013
      [c00020000a4bfe80] [c0000000006494d8] alloc_cpumask_var_node+0x38/0x80
      kmalloc_node at include/linux/slab.h:577
      (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
      [c00020000a4bfef0] [c00000000003eedc] start_secondary+0x27c/0x800
      update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
      (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
      (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
      [c00020000a4bff90] [c00000000000c468] start_secondary_resume+0x10/0x14
      
      Allocating a temporary mask while performing a CPU Hotplug operation
      with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable
      function from a atomic context. Fix this by allocating the temporary
      mask with GFP_ATOMIC flag. Also instead of having to allocate twice,
      allocate the mask in the caller so that we only have to allocate once.
      If the allocation fails, assume the mask to be same as sibling mask, which
      will make the scheduler to drop this domain for this CPU.
      
      Fixes: 70a94089 ("powerpc/smp: Optimize update_coregroup_mask")
      Fixes: 3ab33d6d ("powerpc/smp: Optimize update_mask_by_l2")
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201019042716.106234-3-srikar@linux.vnet.ibm.com
      84dbf66c
    • Srikar Dronamraju's avatar
      powerpc/smp: Remove unnecessary variable · 966730a6
      Srikar Dronamraju authored
      Commit 3ab33d6d ("powerpc/smp: Optimize update_mask_by_l2")
      introduced submask_fn in update_mask_by_l2 to track the right submask.
      However commit f6606cfd ("powerpc/smp: Dont assume l2-cache to be
      superset of sibling") introduced sibling_mask in update_mask_by_l2 to
      track the same submask. Remove sibling_mask in favour of submask_fn.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201019042716.106234-2-srikar@linux.vnet.ibm.com
      966730a6
  2. 16 Oct, 2020 2 commits
  3. 15 Oct, 2020 1 commit
    • Qian Cai's avatar
      Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed" · ffd0b25c
      Qian Cai authored
      This reverts commit 3a3181e1 which
      causes memory corruptions on POWER9 powernv. eg:
      
        pci_bus 0035:08: busn_res: [bus 08-0c] is released
        =============================================================================
        BUG kmalloc-16 (Tainted: G        W  O     ): Object already free
        -----------------------------------------------------------------------------
        Disabling lock debugging due to kernel taint
        INFO: Allocated in pcibios_scan_phb+0x104/0x3e0 age=1960714 cpu=4 pid=1
        	__slab_alloc+0xa4/0xf0
        	__kmalloc+0x294/0x330
        	pcibios_scan_phb+0x104/0x3e0
        	pcibios_init+0x84/0x124
        	do_one_initcall+0xac/0x528
        	kernel_init_freeable+0x35c/0x3fc
        	kernel_init+0x24/0x148
        	ret_from_kernel_thread+0x5c/0x80
        INFO: Freed in pcibios_remove_bus+0x70/0x90 age=0 cpu=16 pid=1717146
        	kfree+0x49c/0x510
        	pcibios_remove_bus+0x70/0x90
        	pci_remove_bus+0xe4/0x110
        	pci_remove_bus_device+0x74/0x170
        	pci_remove_bus_device+0x4c/0x170
        	pci_stop_and_remove_bus_device_locked+0x34/0x50
        	remove_store+0xc0/0xe0
        	dev_attr_store+0x30/0x50
        	sysfs_kf_write+0x68/0xb0
        	kernfs_fop_write+0x114/0x260
        	vfs_write+0xe4/0x260
        	ksys_write+0x74/0x130
        	system_call_exception+0xf8/0x1d0
        	system_call_common+0xe8/0x218
        INFO: Slab 0x0000000099caaf22 objects=178 used=174 fp=0x00000000006a64b0 flags=0x7fff8000000201
        INFO: Object 0x00000000f360132d @offset=30192 fp=0x0000000000000000
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201014182811.12027-1-cai@lca.pw
      ffd0b25c
  4. 14 Oct, 2020 1 commit
  5. 08 Oct, 2020 25 commits
  6. 07 Oct, 2020 7 commits
  7. 06 Oct, 2020 1 commit
    • Scott Cheloha's avatar
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha authored
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: default avatarScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117