1. 24 Jul, 2018 26 commits
    • Simon Guo's avatar
      selftests/powerpc: Update memcmp_64 selftest for VMX implementation · c827ac45
      Simon Guo authored
      This patch reworked selftest memcmp_64 so that memcmp selftest can
      cover more test cases.
      
      It adds testcases for:
      - memcmp over 4K bytes size.
      - s1/s2 with different/random offset on 16 bytes boundary.
      - enter/exit_vmx_ops pairness.
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      [mpe: Add -maltivec to fix build on some toolchains]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c827ac45
    • Simon Guo's avatar
      powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() · c2a4e54e
      Simon Guo authored
      This patch is based on the previous VMX patch on memcmp().
      
      To optimize ppc64 memcmp() with VMX instruction, we need to think about
      the VMX penalty brought with: If kernel uses VMX instruction, it needs
      to save/restore current thread's VMX registers. There are 32 x 128 bits
      VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
      
      The major concern regarding the memcmp() performance in kernel is KSM,
      who will use memcmp() frequently to merge identical pages. So it will
      make sense to take some measures/enhancement on KSM to see whether any
      improvement can be done here.  Cyril Bur indicates that the memcmp() for
      KSM has a higher possibility to fail (unmatch) early in previous bytes
      in following mail.
      	https://patchwork.ozlabs.org/patch/817322/#1773629
      And I am taking a follow-up on this with this patch.
      
      Per some testing, it shows KSM memcmp() will fail early at previous 32
      bytes.  More specifically:
          - 76% cases will fail/unmatch before 16 bytes;
          - 83% cases will fail/unmatch before 32 bytes;
          - 84% cases will fail/unmatch before 64 bytes;
      So 32 bytes looks a better choice than other bytes for pre-checking.
      
      The early failure is also true for memcmp() for non-KSM case. With a
      non-typical call load, it shows ~73% cases fail before first 32 bytes.
      
      This patch adds a 32 bytes pre-checking firstly before jumping into VMX
      operations, to avoid the unnecessary VMX penalty. It is not limited to
      KSM case. And the testing shows ~20% improvement on memcmp() average
      execution time with this patch.
      
      And note the 32B pre-checking is only performed when the compare size
      is long enough (>=4K currently) to allow VMX operation.
      
      The detail data and analysis is at:
      https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.mdSigned-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c2a4e54e
    • Simon Guo's avatar
      powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision · d58badfb
      Simon Guo authored
      This patch add VMX primitives to do memcmp() in case the compare size
      is equal or greater than 4K bytes. KSM feature can benefit from this.
      
      Test result with following test program(replace the "^>" with ""):
      ------
      ># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
      >#include <malloc.h>
      >#include <stdlib.h>
      >#include <string.h>
      >#include <time.h>
      >#include "utils.h"
      >#define SIZE (1024 * 1024 * 900)
      >#define ITERATIONS 40
      
      int test_memcmp(const void *s1, const void *s2, size_t n);
      
      static int testcase(void)
      {
              char *s1;
              char *s2;
              unsigned long i;
      
              s1 = memalign(128, SIZE);
              if (!s1) {
                      perror("memalign");
                      exit(1);
              }
      
              s2 = memalign(128, SIZE);
              if (!s2) {
                      perror("memalign");
                      exit(1);
              }
      
              for (i = 0; i < SIZE; i++)  {
                      s1[i] = i & 0xff;
                      s2[i] = i & 0xff;
              }
              for (i = 0; i < ITERATIONS; i++) {
      		int ret = test_memcmp(s1, s2, SIZE);
      
      		if (ret) {
      			printf("return %d at[%ld]! should have returned zero\n", ret, i);
      			abort();
      		}
      	}
      
              return 0;
      }
      
      int main(void)
      {
              return test_harness(testcase, "memcmp");
      }
      ------
      Without this patch (but with the first patch "powerpc/64: Align bytes
      before fall back to .Lshort in powerpc64 memcmp()." in the series):
      	4.726728762 seconds time elapsed                                          ( +-  3.54%)
      With VMX patch:
      	4.234335473 seconds time elapsed                                          ( +-  2.63%)
      		There is ~+10% improvement.
      
      Testing with unaligned and different offset version (make s1 and s2 shift
      random offset within 16 bytes) can archieve higher improvement than 10%..
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d58badfb
    • Simon Guo's avatar
      powerpc: add vcmpequd/vcmpequb ppc instruction macro · f1ecbaf4
      Simon Guo authored
      Some old tool chains don't know about instructions like vcmpequd.
      
      This patch adds .long macro for vcmpequd and vcmpequb, which is
      a preparation to optimize ppc64 memcmp with VMX instructions.
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f1ecbaf4
    • Simon Guo's avatar
      powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() · 2d9ee327
      Simon Guo authored
      Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
      (compare per byte mode) if either src or dst address is not 8 bytes aligned.
      It can be opmitized in 2 situations:
      
      1) if both addresses are with the same offset with 8 bytes boundary:
      memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
      and then compare the rest 8-bytes-aligned content with .Llong mode.
      
      2)  If src/dst addrs are not with the same offset of 8 bytes boundary:
      memcmp() can align src addr with 8 bytes, increment dst addr accordingly,
       then load src with aligned mode and load dst with unaligned mode.
      
      This patch optmizes memcmp() behavior in the above 2 situations.
      
      Tested with both little/big endian. Performance result below is based on
      little endian.
      
      Following is the test result with src/dst having the same offset case:
      (a similar result was observed when src/dst having different offset):
      (1) 256 bytes
      Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
      - without patch
      	29.773018302 seconds time elapsed                                          ( +- 0.09% )
      - with patch
      	16.485568173 seconds time elapsed                                          ( +-  0.02% )
      		-> There is ~+80% percent improvement
      
      (2) 32 bytes
      To observe performance impact on < 32 bytes, modify
      tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
      -------
       #include <string.h>
       #include "utils.h"
      
      -#define SIZE 256
      +#define SIZE 32
       #define ITERATIONS 10000
      
       int test_memcmp(const void *s1, const void *s2, size_t n);
      --------
      
      - Without patch
      	0.244746482 seconds time elapsed                                          ( +-  0.36%)
      - with patch
      	0.215069477 seconds time elapsed                                          ( +-  0.51%)
      		-> There is ~+13% improvement
      
      (3) 0~8 bytes
      To observe <8 bytes performance impact, modify
      tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
      -------
       #include <string.h>
       #include "utils.h"
      
      -#define SIZE 256
      -#define ITERATIONS 10000
      +#define SIZE 8
      +#define ITERATIONS 1000000
      
       int test_memcmp(const void *s1, const void *s2, size_t n);
      -------
      - Without patch
             1.845642503 seconds time elapsed                                          ( +- 0.12% )
      - With patch
             1.849767135 seconds time elapsed                                          ( +- 0.26% )
      		-> They are nearly the same. (-0.2%)
      Signed-off-by: default avatarSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2d9ee327
    • Aneesh Kumar K.V's avatar
      powerpc/pseries/mm: Improve error reporting on HCALL failures · ca42d8d2
      Aneesh Kumar K.V authored
      This patch adds error reporting to H_ENTER and H_READ hcalls. A
      failure for both these hcalls are mostly fatal and it would be good to
      log the failure reason.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      [mpe: Split out of larger patch]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ca42d8d2
    • Aneesh Kumar K.V's avatar
      powerpc/pseries: Use pr_xxx() in lpar.c · 65471d76
      Aneesh Kumar K.V authored
      Switch from printk to pr_fmt() / pr_xxx().
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      [mpe: Split out of larger patch]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      65471d76
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Reduce contention on hpte lock · 27d8959d
      Aneesh Kumar K.V authored
      We do this in some part. This patch make sure we always try to search
      for hpte without holding lock and redo the compare with lock held once
      match found.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      27d8959d
    • Aneesh Kumar K.V's avatar
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Remove the superfluous bitwise operation when find hpte group · 1531cff4
      Aneesh Kumar K.V authored
      When computing the starting slot number for a hash page table group we used
      to do this
      hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
      
      Multiplying with 8 (HPTES_PER_GROUP) imply the last three bits are 0. Hence we
      really don't need to clear then separately.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1531cff4
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Increase MAX_PHYSMEM_BITS to 128TB with SPARSEMEM_VMEMMAP config · 7d4340bb
      Aneesh Kumar K.V authored
      We do this only with VMEMMAP config so that our page_to_[nid/section] etc are not
      impacted.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7d4340bb
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Check memblock_add against MAX_PHYSMEM_BITS range · 6aba0c84
      Aneesh Kumar K.V authored
      With SPARSEMEM config enabled, we make sure that we don't add sections beyond
      MAX_PHYSMEM_BITS range. This results in not building vmemmap mapping for
      range beyond max range. But our memblock layer looks the device tree and create
      mapping for the full memory range. Prevent this by checking against
      MAX_PHSYSMEM_BITS when doing memblock_add.
      
      We don't do similar check for memeblock_reserve_range. If reserve range is beyond
      MAX_PHYSMEM_BITS we expect that to be configured with 'nomap'. Any other
      reserved range should come from existing memblock ranges which we already
      filtered while adding.
      
      This avoids crash as below when running on a system with system ram config above
      MAX_PHSYSMEM_BITS
      
       Unable to handle kernel paging request for data at address 0xc00a001000000440
       Faulting instruction address: 0xc000000001034118
       cpu 0x0: Vector: 300 (Data Access) at [c00000000124fb30]
           pc: c000000001034118: __free_pages_bootmem+0xc0/0x1c0
           lr: c00000000103b258: free_all_bootmem+0x19c/0x22c
           sp: c00000000124fdb0
          msr: 9000000002001033
          dar: c00a001000000440
        dsisr: 40000000
         current = 0xc00000000120dd00
         paca    = 0xc000000001f60000^I irqmask: 0x03^I irq_happened: 0x01
           pid   = 0, comm = swapper
       [c00000000124fe20] c00000000103b258 free_all_bootmem+0x19c/0x22c
       [c00000000124fee0] c000000001010a68 mem_init+0x3c/0x5c
       [c00000000124ff00] c00000000100401c start_kernel+0x298/0x5e4
       [c00000000124ff90] c00000000000b57c start_here_common+0x1c/0x520
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6aba0c84
    • Michael Ellerman's avatar
      powerpc: Add ppc64le and ppc64_book3e allmodconfig targets · 64de5d8d
      Michael Ellerman authored
      Similarly as we just did for 32-bit, add phony targets for generating
      a little endian and Book3E allmodconfig. These aren't covered by the
      regular allmodconfig, which is big endian and Book3S due to the way
      the Kconfig symbols are structured.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      64de5d8d
    • Michael Ellerman's avatar
      powerpc: Add ppc32_allmodconfig defconfig target · 8db0c9d4
      Michael Ellerman authored
      Because the allmodconfig logic just sets every symbol to M or Y, it
      has the effect of always generating a 64-bit config, because
      CONFIG_PPC64 becomes Y.
      
      So to make it easier for folks to test 32-bit code, provide a phony
      defconfig target that generates a 32-bit allmodconfig.
      
      The 32-bit port has several mutually exclusive CPU types, we choose
      the Book3S variants as that's what the help text in Kconfig says is
      most common.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8db0c9d4
    • Michael Ellerman's avatar
      powerpc64s: Show ori31 availability in spectre_v1 sysfs file not v2 · 6d44acae
      Michael Ellerman authored
      When I added the spectre_v2 information in sysfs, I included the
      availability of the ori31 speculation barrier.
      
      Although the ori31 barrier can be used to mitigate v2, it's primarily
      intended as a spectre v1 mitigation. Spectre v2 is mitigated by
      hardware changes.
      
      So rework the sysfs files to show the ori31 information in the
      spectre_v1 file, rather than v2.
      
      Currently we display eg:
      
        $ grep . spectre_v*
        spectre_v1:Mitigation: __user pointer sanitization
        spectre_v2:Mitigation: Indirect branch cache disabled, ori31 speculation barrier enabled
      
      After:
      
        $ grep . spectre_v*
        spectre_v1:Mitigation: __user pointer sanitization, ori31 speculation barrier enabled
        spectre_v2:Mitigation: Indirect branch cache disabled
      
      Fixes: d6fbe1c5 ("powerpc/64s: Wire up cpu_show_spectre_v2()")
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6d44acae
    • Nicholas Piggin's avatar
      powerpc: NMI IPI make NMI IPIs fully sychronous · 5b73151f
      Nicholas Piggin authored
      There is an asynchronous aspect to smp_send_nmi_ipi. The caller waits
      for all CPUs to call in to the handler, but it does not wait for
      completion of the handler. This is a needless complication, so remove
      it and always wait synchronously.
      
      The synchronous wait allows the caller to easily time out and clear
      the wait for completion (zero nmi_ipi_busy_count) in the case of badly
      behaved handlers. This would have prevented the recent smp_send_stop
      NMI IPI bug from causing the system to hang.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5b73151f
    • Nicholas Piggin's avatar
      powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely · 9b81c021
      Nicholas Piggin authored
      When the masked interrupt handler clears MSR[EE] for an interrupt in
      the PACA_IRQ_MUST_HARD_MASK set, it does not set PACA_IRQ_HARD_DIS.
      This makes them get out of synch.
      
      With that taken into account, it's only low level irq manipulation
      (and interrupt entry before reconcile) where they can be out of synch.
      This makes the code less surprising.
      
      It also allows the IRQ replay code to rely on the IRQ_HARD_DIS value
      and not have to mtmsrd again in this case (e.g., for an external
      interrupt that has been masked). The bigger benefit might just be
      that there is not such an element of surprise in these two bits of
      state.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9b81c021
    • Ram Pai's avatar
      selftests/powerpc: Fix ptrace-pkey for default execute permission change · 29e8131c
      Ram Pai authored
      The test case assumes execute-permissions of unallocated keys are
      enabled by default, which is incorrect.
      Reviewed-by: default avatarThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      29e8131c
    • Ram Pai's avatar
      selftests/powerpc: Fix core-pkey for default execute permission change · 5db26e89
      Ram Pai authored
      Only when the key is allocated, its permission are enabled.
      Reviewed-by: default avatarThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5db26e89
    • Ram Pai's avatar
      powerpc/pkeys: make protection key 0 less special · 07f522d2
      Ram Pai authored
      Applications need the ability to associate an address-range with some
      key and latter revert to its initial default key. Pkey-0 comes close to
      providing this function but falls short, because the current
      implementation disallows applications to explicitly associate pkey-0 to
      the address range.
      
      Lets make pkey-0 less special and treat it almost like any other key.
      Thus it can be explicitly associated with any address range, and can be
      freed. This gives the application more flexibility and power.  The
      ability to free pkey-0 must be used responsibily, since pkey-0 is
      associated with almost all address-range by default.
      
      Even with this change pkey-0 continues to be slightly more special
      from the following point of view.
      (a) it is implicitly allocated.
      (b) it is the default key assigned to any address-range.
      (c) its permissions cannot be modified by userspace.
      
      NOTE: (c) is specific to powerpc only. pkey-0 is associated by default
      with all pages including kernel pages, and pkeys are also active in
      kernel mode. If any permission is denied on pkey-0, the kernel running
      in the context of the application will be unable to operate.
      
      Tested on powerpc.
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      [mpe: Drop #define PKEY_0 0 in favour of plain old 0]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      07f522d2
    • Ram Pai's avatar
      powerpc/pkeys: Preallocate execute-only key · a4fcc877
      Ram Pai authored
      execute-only key is allocated dynamically. This is a problem. When a
      thread implicitly creates an execute-only key, and resets the UAMOR
      for that key, the UAMOR value does not percolate to all the other
      threads. Any other thread may ignorantly change the permissions on the
      key. This can cause the key to be not execute-only for that thread.
      
      Preallocate the execute-only key and ensure that no thread can change
      the permission of the key, by resetting the corresponding bit in
      UAMOR.
      
      Fixes: 5586cf61 ("powerpc: introduce execute-only pkey")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a4fcc877
    • Ram Pai's avatar
      powerpc/pkeys: Fix calculation of total pkeys. · fe6a2804
      Ram Pai authored
      Total number of pkeys calculation is off by 1. Fix it.
      
      Fixes: 4fb158f6 ("powerpc: track allocation status of all pkeys")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fe6a2804
    • Ram Pai's avatar
      powerpc/pkeys: Save the pkey registers before fork · c76662e8
      Ram Pai authored
      When a thread forks the contents of AMR, IAMR, UAMOR registers in the
      newly forked thread are not inherited.
      
      Save the registers before forking, for content of those
      registers to be automatically copied into the new thread.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c76662e8
    • Ram Pai's avatar
      powerpc/pkeys: key allocation/deallocation must not change pkey registers · 4a4a5e5d
      Ram Pai authored
      Key allocation and deallocation has the side effect of programming the
      UAMOR/AMR/IAMR registers. This is wrong, since its the responsibility of
      the application and not that of the kernel, to modify the permission on
      the key.
      
      Do not modify the pkey registers at key allocation/deallocation.
      
      This patch also fixes a bug where a sys_pkey_free() resets the UAMOR
      bits of the key, thus making its permissions unmodifiable from user
      space. Later if the same key gets reallocated from a different thread
      this thread will no longer be able to change the permissions on the key.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: default avatarThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4a4a5e5d
    • Ram Pai's avatar
      powerpc/pkeys: Deny read/write/execute by default · de113256
      Ram Pai authored
      Deny all permissions on all keys, with some exceptions. pkey-0 must
      allow all permissions, or else everything comes to a screaching halt.
      Execute-only key must allow execute permission.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      de113256
    • Ram Pai's avatar
      powerpc/pkeys: Give all threads control of their key permissions · a57a04c7
      Ram Pai authored
      Currently in a multithreaded application, a key allocated by one
      thread is not usable by other threads. By "not usable" we mean that
      other threads are unable to change the access permissions for that
      key for themselves.
      
      When a new key is allocated in one thread, the corresponding UAMOR
      bits for that thread get enabled, however the UAMOR bits for that key
      for all other threads remain disabled.
      
      Other threads have no way to set permissions on the key, and the
      current default permissions are that read/write is enabled for all
      keys, which means the key has no effect for other threads. Although
      that may be the desired behaviour in some circumstances, having all
      threads able to control their permissions for the key is more
      flexible.
      
      The current behaviour also differs from the x86 behaviour, which is
      problematic for users.
      
      To fix this, enable the UAMOR bits for all keys, at process
      creation (in start_thread(), ie exec time). Since the contents of
      UAMOR are inherited at fork, all threads are capable of modifying the
      permissions on any key.
      
      This is technically an ABI break on powerpc, but pkey support is fairly
      new on powerpc and not widely used, and this brings us into
      line with x86.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Tested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      [mpe: Reword some of the changelog]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a57a04c7
  2. 20 Jul, 2018 5 commits
  3. 19 Jul, 2018 8 commits
  4. 16 Jul, 2018 1 commit
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126
      Alexey Kardashevskiy authored
      At the moment we allocate the entire TCE table, twice (hardware part and
      userspace translation cache). This normally works as we normally have
      contigous memory and the guest will map entire RAM for 64bit DMA.
      
      However if we have sparse RAM (one example is a memory device), then
      we will allocate TCEs which will never be used as the guest only maps
      actual memory for DMA. If it is a single level TCE table, there is nothing
      we can really do but if it a multilevel table, we can skip allocating
      TCEs we know we won't need.
      
      This adds ability to allocate only first level, saving memory.
      
      This changes iommu_table::free() to avoid allocating of an extra level;
      iommu_table::set() will do this when needed.
      
      This adds @alloc parameter to iommu_table::exchange() to tell the callback
      if it can allocate an extra level; the flag is set to "false" for
      the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
      H_TOO_HARD.
      
      This still requires the entire table to be counted in mm::locked_vm.
      
      To be conservative, this only does on-demand allocation when
      the usespace cache table is requested which is the case of VFIO.
      
      The example math for a system replicating a powernv setup with NVLink2
      in a guest:
      16GB RAM mapped at 0x0
      128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
      
      the table to cover that all with 64K pages takes:
      (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
      
      If we allocate only necessary TCE levels, we will only need:
      (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
      levels).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a68bd126