1. 14 Jun, 2017 11 commits
    • Robert Bragg's avatar
      drm/i915/perf: Add more OA configs for BDW, CHV, SKL + BXT · fc599211
      Robert Bragg authored
      These are auto generated from an XML description of metric sets,
      currently maintained in gputop, ref:
      
       https://github.com/rib/gputop
       > gputop-data/oa-*.xml
       > scripts/i915-perf-kernelgen.py
      
       $ make -C gputop-data -f Makefile.xml
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      fc599211
    • Robert Bragg's avatar
      drm/i915/perf: Add OA unit support for Gen 8+ · 19f81df2
      Robert Bragg authored
      Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all
      share (more-or-less) the same OA unit design.
      
      Of particular note in comparison to Haswell: some OA unit HW config
      state has become per-context state and as a consequence it is somewhat
      more complicated to manage synchronous state changes from the cpu while
      there's no guarantee of what context (if any) is currently actively
      running on the gpu.
      
      The periodic sampling frequency which can be particularly useful for
      system-wide analysis (as opposed to command stream synchronised
      MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to
      have become per-context save and restored (while the OABUFFER
      destination is still a shared, system-wide resource).
      
      This support for gen8+ takes care to consider a number of timing
      challenges involved in synchronously updating per-context state
      primarily by programming all config state from the cpu and updating all
      current and saved contexts synchronously while the OA unit is still
      disabled.
      
      The driver intentionally avoids depending on command streamer
      programming to update OA state considering the lack of synchronization
      between the automatic loading of OACTXCONTROL state (that includes the
      periodic sampling state and enable state) on context restore and the
      parsing of any general purpose BB the driver can control. I.e. this
      implementation is careful to avoid the possibility of a context restore
      temporarily enabling any out-of-date periodic sampling state. In
      addition to the risk of transiently-out-of-date state being loaded
      automatically; there are also internal HW latencies involved in the
      loading of MUX configurations which would be difficult to account for
      from the command streamer (and we only want to enable the unit when once
      the MUX configuration is complete).
      
      Since the Gen8+ OA unit design no longer supports clock gating the unit
      off for a single given context (which effectively stopped any progress
      of counters while any other context was running) and instead supports
      tagging OA reports with a context ID for filtering on the CPU, it means
      we can no longer hide the system-wide progress of counters from a
      non-privileged application only interested in metrics for its own
      context. Although we could theoretically try and subtract the progress
      of other contexts before forwarding reports via read() we aren't in a
      position to filter reports captured via MI_REPORT_PERF_COUNT commands.
      As a result, for Gen8+, we always require the
      dev.i915.perf_stream_paranoid to be unset for any access to OA metrics
      if not root.
      
      v5: Drain submitted requests when enabling metric set to ensure no
          lite-restore erases the context image we just updated (Lionel)
      
      v6: In addition to drain, switch to kernel context & update all
          context in place (Chris)
      
      v7: Add missing mutex_unlock() if switching to kernel context fails
          (Matthew)
      
      v8: Simplify OA period/flex-eu-counters programming by using the
          batchbuffer instead of modifying ctx-image (Lionel)
      
      v9: Back to updating the context image (due to erroneous testing,
          batchbuffer programming the OA unit doesn't actually work)
          (Lionel)
          Pin context before updating context image (Chris)
          Drop MMIO programming now that we switch to a kernel context with
          right values in initial context image (Chris)
      
      v10: Just pin_map the contexts we want to modify or let the
           configuration happen on first use (Chris)
      
      v11: Update kernel context OA config through the batchbuffer rather
           than on the fly ctx-image update (Lionel)
      
      v12: Rework OA context registers update again by swithing away from
           user contexts and reconfiguring the kernel context through the
           batchbuffer and updating all the other contexts' context image.
           Also take care to lock slice/subslice configuration when OA is
           on. (Lionel)
      
      v13: Request rpcs updates on all engine when updating the OA config
           (Lionel)
      
      v14: Drop any kind of rpcs management now that we monitor sseu
           configuration changes in a later patch (Lionel)
           Remove usleep after programming the NOA configs on Gen8+, this
           doesn't seem to be needed (Lionel)
      
      v15: Respect coding style for block comments (Chris)
      
      v16: Add missing i915_add_request() in case we fail to emit OA
           configuration (Matthew)
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: Matthew Auld <matthew.auld@intel.com> \o/
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      19f81df2
    • Robert Bragg's avatar
      drm/i915/perf: Add 'render basic' Gen8+ OA unit configs · 5182f646
      Robert Bragg authored
      Adds a static OA unit, MUX, B Counter + Flex EU configurations for basic
      render metrics on Broadwell, Cherryview, Skylake and Broxton. These are
      auto generated from an XML description of metric sets, currently
      maintained in gputop, ref:
      
       https://github.com/rib/gputop
       > gputop-data/oa-*.xml
       > scripts/i915-perf-kernelgen.py
      
       $ make -C gputop-data -f Makefile.xml WHITELIST=RenderBasic
      
      v2: add newlines to debug messages + fix comment (Matthew Auld)
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      5182f646
    • Lionel Landwerlin's avatar
      drm/i915/perf: rework mux configurations queries · 3f488d99
      Lionel Landwerlin authored
      Gen8+ might have mux configurations per slices/subslices. Depending on
      whether slices/subslices have been fused off, only part of the
      configuration needs to be applied. This change reworks the mux
      configurations query mechanism to allow more than one set of registers
      to be programmed.
      
      v2: s/n_mux_regs/n_mux_configs/ (Matthew)
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      3f488d99
    • Robert Bragg's avatar
      drm/i915: expose _SUBSLICE_MASK GETPARM · f5320233
      Robert Bragg authored
      Assuming a uniform mask across all slices, this enables userspace to
      determine the specific sub slices can be enabled. This information is
      required, for example, to be able to analyse some OA counter reports
      where the counter configuration depends on the HW sub slice
      configuration.
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      f5320233
    • Robert Bragg's avatar
      drm/i915: expose _SLICE_MASK GETPARM · 7fed555c
      Robert Bragg authored
      Enables userspace to determine the maximum number of slices that can
      be enabled on the device and also know what specific slices can be
      enabled. This information is required, for example, to be able to
      analyse some OA counter reports where the counter configuration
      depends on the HW slice configuration.
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Reviewed-by: default avatarMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: default avatarLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Signed-off-by: default avatarBen Widawsky <ben@bwidawsk.net>
      7fed555c
    • Chris Wilson's avatar
      drm/i915: Reinstate reservation_object zapping for batch_pool objects · 9ee82d78
      Chris Wilson authored
      I removed the zapping of the reservation_object->fence array of shared
      fences prematurely. We don't yet have the code to zap that array when
      retiring the object, and so currently it remains possible to continually
      grow the shared array trapping requests when reusing the batch_pool
      object across many timelines.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Reviewed-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170518094638.5469-4-chris@chris-wilson.co.uk
      9ee82d78
    • Chris Wilson's avatar
      drm/i915: Spin for struct_mutex inside shrinker · 290271de
      Chris Wilson authored
      Having resolved whether or not we would deadlock upon a call to
      mutex_lock(&dev->struct_mutex), we can then spin for the contended
      struct_mutex if we are not the owner. We cannot afford to simply block
      and wait for the mutex, as the owner may itself be waiting for the
      allocator -- i.e. a cyclic deadlock. This should significantly improve
      the chance of running the shrinker for other processes whilst the GPU is
      busy.
      
      A more balanced approach would be to optimistically spin whilst the
      mutex owner was on the cpu and there was an opportunity to acquire the
      mutex for ourselves quickly. However, that requires support from
      kernel/locking/ and a new mutex_spin_trylock() primitive.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-4-chris@chris-wilson.co.ukReviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      290271de
    • Chris Wilson's avatar
      drm/i915: Only restrict noreclaim in the early shrink passes · 0f6ab55d
      Chris Wilson authored
      In our first pass, we do not want to use reclaim at all as we want to
      solely reap the i915 buffer caches (its purgeable pages). But we don't
      mind it initiates IO or pulls via the FS (but it shouldn't anyway as we
      say no to reclaim!). Just drop the GFP_IO constraint for simplicity.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-3-chris@chris-wilson.co.ukReviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      0f6ab55d
    • Chris Wilson's avatar
      drm/i915: Remove __GFP_NORETRY from our buffer allocator · eaf41801
      Chris Wilson authored
      I tried __GFP_NORETRY in the belief that __GFP_RECLAIM was effective. It
      struggles with handling reclaim of our dirty buffers and relies on
      reclaim via kswapd. As a result, a single pass of direct reclaim is
      unreliable when i915 occupies the majority of available memory, and the
      only means of effectively waiting on kswapd to amke progress is by not
      setting the __GFP_NORETRY flag and lopping. That leaves us with the
      dilemma of invoking the oomkiller instead of propagating the allocation
      failure back to userspace where it can be handled more gracefully (one
      hopes).  In the future we may have __GFP_MAYFAIL to allow repeats up until
      we genuinely run out of memory and the oomkiller would have been invoked.
      Until then, let the oomkiller wreck havoc.
      
      v2: Stop playing with side-effects of gfp flags and await __GFP_MAYFAIL
      v3: Update comments that direct reclaim only appears to be ignoring our
      dirty buffers!
      
      Fixes: 24f8e00a ("drm/i915: Prefer to report ENOMEM rather than incur the oom for gfx allocations")
      Testcase: igt/gem_tiled_swapping
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-2-chris@chris-wilson.co.ukReviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      eaf41801
    • Chris Wilson's avatar
      drm/i915: Encourage our shrinker more when our shmemfs allocations fails · 4846bf0c
      Chris Wilson authored
      Commit 24f8e00a ("drm/i915: Prefer to report ENOMEM rather than
      incur the oom for gfx allocations") made the bold decision to try and
      avoid the oomkiller by reporting -ENOMEM to userspace if our allocation
      failed after attempting to free enough buffer objects. In short, it
      appears we were giving up too easily (even before we start wondering if
      one pass of reclaim is as strong as we would like). Part of the problem
      is that if we only shrink just enough pages for our expected allocation,
      the likelihood of those pages becoming available to us is less than 100%
      To counter-act that we ask for twice the number of pages to be made
      available. Furthermore, we allow the shrinker to pull pages from the
      active list in later passes.
      
      v2: Be a little more cautious in paging out gfx buffers, and leave that
      to a more balanced approach from shrink_slab(). Important when combined
      with "drm/i915: Start writeback from the shrinker" as anything shrunk is
      immediately swapped out and so should be more conservative.
      
      Fixes: 24f8e00a ("drm/i915: Prefer to report ENOMEM rather than incur the oom for gfx allocations")
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170609110350.1767-1-chris@chris-wilson.co.uk
      4846bf0c
  2. 12 Jun, 2017 23 commits
  3. 09 Jun, 2017 6 commits