1. 03 Jun, 2021 2 commits
    • Alexey Bogoslavsky's avatar
      nvme: extend and modify the APST configuration algorithm · ebd8a93a
      Alexey Bogoslavsky authored
      The algorithm that was used until now for building the APST configuration
      table has been found to produce entries with excessively long ITPT
      (idle time prior to transition) for devices declaring relatively long
      entry and exit latencies for non-operational power states. This leads
      to unnecessary waste of power and, as a result, failure to pass
      mandatory power consumption tests on Chromebook platforms.
      
      The new algorithm is based on two predefined ITPT values and two
      predefined latency tolerances. Based on these values, as well as on
      exit and entry latencies reported by the device, the algorithm looks
      for up to 2 suitable non-operational power states to use as primary
      and secondary APST transition targets. The predefined values are
      supplied to the nvme driver as module parameters:
      
       - apst_primary_timeout_ms (default: 100)
       - apst_secondary_timeout_ms (default: 2000)
       - apst_primary_latency_tol_us (default: 15000)
       - apst_secondary_latency_tol_us (default: 100000)
      
      The algorithm echoes the approach used by Intel's and Microsoft's drivers
      on Windows. The specific default parameter values are also based on those
      drivers. Yet, this patch doesn't introduce the ability to dynamically
      regenerate the APST table in the event of switching the power source from
      AC to battery and back. Adding this functionality may be considered in the
      future. In the meantime, the timeouts and tolerances reflect a compromise
      between values used by Microsoft for AC and battery scenarios.
      
      In most NVMe devices the new algorithm causes them to implement a more
      aggressive power saving policy. While beneficial in most cases, this
      sometimes comes at the price of a higher IO processing latency in certain
      scenarios as well as at the price of a potential impact on the drive's
      endurance (due to more frequent context saving when entering deep non-
      operational states). So in order to provide a fallback for systems where
      these regressions cannot be tolerated, the patch allows to revert to
      the legacy behavior by setting either apst_primary_timeout_ms or
      apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
      fine tuning the default values of the module parameters) the legacy behavior
      can be removed.
      
      TESTING.
      
      The new algorithm has been extensively tested. Initially, simulations were
      used to compare APST tables generated by old and new algorithms for a wide
      range of devices. After that, power consumption, performance and latencies
      were measured under different workloads on devices from multiple vendors
      (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
      and the findings.
      
      General observations.
      The effect the patch has on the APST table varies depending on the entry and
      exit latencies advertised by the devices. For some devices, the effect is
      negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
      transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
      the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
      SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
      the initial transition happens after a longer idle time, but takes the device
      to a lower power state.
      
      Workflows.
      In order to evaluate the patch's effect on the power consumption and latency,
      7 workflows were used for each device. The workflows were designed to test
      the scenarios where significant differences between the old and new behaviors
      are most likely. Each workflow was tested twice: with the new and with the
      old APST table generation implementation. Power consumption, performance and
      latency were measured in the process. The following workflows were used:
      1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
      2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
         idle time
      3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
         idle time
      4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
         idle time
      5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
         idle time
      6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
         idle time
      7) Repeated pattern of a single random read of a 4K packet followed by 150ms
         idle time
      
      Power consumption
      Actual power consumption measurements produced predictable results in
      accordance with the APST mechanism's theory of operation.
      Devices with long entry and exit latencies such as WD SN530 showed huge
      improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
      KBG40ZNS where the resulting APST table looks virtually identical with
      both legacy and new algorithms, showed little or no change in the average power
      consumption on all workflows. Devices with extra short latencies such as
      Samsung PM991 showed moderate increase in power consumption of up to 18% in
      worst case scenarios.
      In addition, on Intel and Samsung devices a more complex impact was observed
      on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
      non-operational states between the writes the devices start performing background
      operations leading to an increase of power consumption. With the old APST tables
      part of these operations are delayed until the scenario is over and a longer idle
      period begins, but eventually this extra power is consumed anyway.
      
      Performance.
      In terms of performance measured on sustained write or read scenarios, the
      effect of the patch is minimal as in this case the device doesn't enter low power
      states.
      
      Latency
      As expected, in devices where the patch causes a more aggressive power saving
      policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
      certain scenarios. Workflow number 7, specifically designed to simulate the
      worst case scenario as far as latency is concerned, indeed shows a sharp
      increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
      WD SN530). The latency increase on other workloads and other devices is much
      milder or non-existent.
      Signed-off-by: default avatarAlexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ebd8a93a
    • Colin Ian King's avatar
      nvme: remove redundant initialization of variable ret · 13ce7e62
      Colin Ian King authored
      The variable ret is being initialized with a value that is never read,
      it is being updated later on. The assignment is redundant and can be
      removed.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      13ce7e62
  2. 24 May, 2021 13 commits
  3. 23 May, 2021 18 commits
  4. 22 May, 2021 4 commits
    • Linus Torvalds's avatar
      Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block · 4ff2473b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Fix BLKRRPART and deletion race (Gulam, Christoph)
      
       - NVMe pull request (Christoph):
            - nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith
              Busch)
            - nvme-fc teardown fix (James Smart)
            - nvmet/nvme-loop memory leak fixes (Wu Bo)"
      
      * tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
        block: fix a race between del_gendisk and BLKRRPART
        block: prevent block device lookups at the beginning of del_gendisk
        nvme-fc: clear q_live at beginning of association teardown
        nvme-tcp: rerun io_work if req_list is not empty
        nvme-tcp: fix possible use-after-completion
        nvme-loop: fix memory leak in nvme_loop_create_ctrl()
        nvmet: fix memory leak in nvmet_alloc_ctrl()
      4ff2473b
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block · b9231dfb
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "One fix for a regression with poll in this merge window, and another
        just hardens the io-wq exit path a bit"
      
      * tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
        io_uring: fortify tctx/io_wq cleanup
        io_uring: don't modify req->poll for rw
      b9231dfb
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.13b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 23d72926
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - a fix for a boot regression when running as PV guest on hardware
         without NX support
      
       - a small series fixing a bug in the Xen pciback driver when
         configuring a PCI card with multiple virtual functions
      
      * tag 'for-linus-5.13b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-pciback: reconfigure also from backend watch handler
        xen-pciback: redo VF placement in the virtual topology
        x86/Xen: swap NX determination and GDT setup on BSP
      23d72926
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · a3969ef4
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
      
       - Fix some math errors in the realtime allocator when extent size hints
         are applied.
      
       - Fix unnecessary short writes to realtime files when free space is
         fragmented.
      
       - Fix a crash when using scrub tracepoints.
      
       - Restore ioctl uapi definitions that were accidentally removed in
         5.13-rc1.
      
      * tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: restore old ioctl definitions
        xfs: fix deadlock retry tracepoint arguments
        xfs: retry allocations when locality-based search fails
        xfs: adjust rt allocation minlen when extszhint > rtextsize
      a3969ef4
  5. 21 May, 2021 3 commits