• Alexey Bogoslavsky's avatar
    nvme: extend and modify the APST configuration algorithm · ebd8a93a
    Alexey Bogoslavsky authored
    
    
    The algorithm that was used until now for building the APST configuration
    table has been found to produce entries with excessively long ITPT
    (idle time prior to transition) for devices declaring relatively long
    entry and exit latencies for non-operational power states. This leads
    to unnecessary waste of power and, as a result, failure to pass
    mandatory power consumption tests on Chromebook platforms.
    
    The new algorithm is based on two predefined ITPT values and two
    predefined latency tolerances. Based on these values, as well as on
    exit and entry latencies reported by the device, the algorithm looks
    for up to 2 suitable non-operational power states to use as primary
    and secondary APST transition targets. The predefined values are
    supplied to the nvme driver as module parameters:
    
     - apst_primary_timeout_ms (default: 100)
     - apst_secondary_timeout_ms (default: 2000)
     - apst_primary_latency_tol_us (default: 15000)
     - apst_secondary_latency_tol_us (default: 100000)
    
    The algorithm echoes the approach used by Intel's and Microsoft's drivers
    on Windows. The specific default parameter values are also based on those
    drivers. Yet, this patch doesn't introduce the ability to dynamically
    regenerate the APST table in the event of switching the power source from
    AC to battery and back. Adding this functionality may be considered in the
    future. In the meantime, the timeouts and tolerances reflect a compromise
    between values used by Microsoft for AC and battery scenarios.
    
    In most NVMe devices the new algorithm causes them to implement a more
    aggressive power saving policy. While beneficial in most cases, this
    sometimes comes at the price of a higher IO processing latency in certain
    scenarios as well as at the price of a potential impact on the drive's
    endurance (due to more frequent context saving when entering deep non-
    operational states). So in order to provide a fallback for systems where
    these regressions cannot be tolerated, the patch allows to revert to
    the legacy behavior by setting either apst_primary_timeout_ms or
    apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
    fine tuning the default values of the module parameters) the legacy behavior
    can be removed.
    
    TESTING.
    
    The new algorithm has been extensively tested. Initially, simulations were
    used to compare APST tables generated by old and new algorithms for a wide
    range of devices. After that, power consumption, performance and latencies
    were measured under different workloads on devices from multiple vendors
    (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
    and the findings.
    
    General observations.
    The effect the patch has on the APST table varies depending on the entry and
    exit latencies advertised by the devices. For some devices, the effect is
    negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
    transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
    the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
    SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
    the initial transition happens after a longer idle time, but takes the device
    to a lower power state.
    
    Workflows.
    In order to evaluate the patch's effect on the power consumption and latency,
    7 workflows were used for each device. The workflows were designed to test
    the scenarios where significant differences between the old and new behaviors
    are most likely. Each workflow was tested twice: with the new and with the
    old APST table generation implementation. Power consumption, performance and
    latency were measured in the process. The following workflows were used:
    1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
    2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
       idle time
    3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
       idle time
    4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
       idle time
    5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
       idle time
    6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
       idle time
    7) Repeated pattern of a single random read of a 4K packet followed by 150ms
       idle time
    
    Power consumption
    Actual power consumption measurements produced predictable results in
    accordance with the APST mechanism's theory of operation.
    Devices with long entry and exit latencies such as WD SN530 showed huge
    improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
    KBG40ZNS where the resulting APST table looks virtually identical with
    both legacy and new algorithms, showed little or no change in the average power
    consumption on all workflows. Devices with extra short latencies such as
    Samsung PM991 showed moderate increase in power consumption of up to 18% in
    worst case scenarios.
    In addition, on Intel and Samsung devices a more complex impact was observed
    on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
    non-operational states between the writes the devices start performing background
    operations leading to an increase of power consumption. With the old APST tables
    part of these operations are delayed until the scenario is over and a longer idle
    period begins, but eventually this extra power is consumed anyway.
    
    Performance.
    In terms of performance measured on sustained write or read scenarios, the
    effect of the patch is minimal as in this case the device doesn't enter low power
    states.
    
    Latency
    As expected, in devices where the patch causes a more aggressive power saving
    policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
    certain scenarios. Workflow number 7, specifically designed to simulate the
    worst case scenario as far as latency is concerned, indeed shows a sharp
    increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
    WD SN530). The latency increase on other workloads and other devices is much
    milder or non-existent.
    Signed-off-by: default avatarAlexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    ebd8a93a
core.c 121 KB