• Nishanth Aravamudan's avatar
    powerpc: reorder per-cpu NUMA information's initialization · 2fabf084
    Nishanth Aravamudan authored
    There is an issue currently where NUMA information is used on powerpc
    (and possibly ia64) before it has been read from the device-tree, which
    leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
    
    NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
    after start_secondary(), similar to ia64, which is invoked via
    smp_init().
    
    Commit 6ee0578b ("workqueue: mark init_workqueues() as
    early_initcall()") made init_workqueues() be invoked via
    do_pre_smp_initcalls(), which is obviously before the secondary
    processors are online.
    
    Additionally, the following commits changed init_workqueues() to use
    cpu_to_node to determine the node to use for kthread_create_on_node:
    
    bce90380 ("workqueue: add wq_numa_tbl_len and
    wq_numa_possible_cpumask[]")
    f3f90ad4 ("workqueue: determine NUMA node of workers accourding to
    the allowed cpumask")
    
    Therefore, when init_workqueues() runs, it sees all CPUs as being on
    Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
    a high number of slab deactivations
    (http://www.spinics.net/lists/linux-mm/msg67489.html).
    
    Fix this by initializing the powerpc-specific CPU<->node/local memory
    node mapping as early as possible, which on powerpc is
    do_init_bootmem(). Currently that function initializes the mapping for
    the boot CPU, but we extend it to setup the mapping for all possible
    CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
    per-cpu values for all possible CPUs. That ensures that before the
    early_initcalls run (and really as early as possible), the per-cpu NUMA
    mapping is accurate.
    
    While testing memoryless nodes on PowerKVM guests with a fix to the
    workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
    guest topology of:
    
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
    node 1 size: 16336 MB
    node 1 free: 15329 MB
    node distances:
    node   0   1
      0:  10  40
      1:  40  10
    
    the slab consumption decreases from
    
    Slab:             932416 kB
    SUnreclaim:       902336 kB
    
    to
    
    Slab:             395264 kB
    SUnreclaim:       359424 kB
    
    And we a corresponding increase in the slab efficiency from
    
    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                       337 MB   11.28%  100.00%
    task_struct                         288 MB    9.93%  100.00%
    
    to
    
    slab                                   mem     objs    slabs
                                          used   active   active
    ------------------------------------------------------------
    kmalloc-16384                        37 MB  100.00%  100.00%
    task_struct                          31 MB  100.00%  100.00%
    
    Powerpc didn't support memoryless nodes until recently (64bb80d8
    "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261
    "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
    helped improve memory consumption with these kind of environments.
    Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
    2fabf084
smp.c 18.2 KB