• Waiman Long's avatar
    clocksource: Avoid accidental unstable marking of clocksources · c86ff8c5
    Waiman Long authored
    Since commit db3a34e1 ("clocksource: Retry clock read if long delays
    detected") and commit 2e27e793 ("clocksource: Reduce clocksource-skew
    threshold"), it is found that tsc clocksource fallback to hpet can
    sometimes happen on both Intel and AMD systems especially when they are
    running stressful benchmarking workloads. Of the 23 systems tested with
    a v5.14 kernel, 10 of them have switched to hpet clock source during
    the test run.
    
    The result of falling back to hpet is a drastic reduction of performance
    when running benchmarks. For example, the fio performance tests can
    drop up to 70% whereas the iperf3 performance can drop up to 80%.
    
    4 hpet fallbacks happened during bootup. They were:
    
      [    8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable
      [   12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable
      [   17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable
      [   17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
    
    Other fallbacks happen when the systems were running stressful
    benchmarks. For example:
    
      [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable
      [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
    
    Commit 2e27e793 ("clocksource: Reduce clocksource-skew threshold"),
    changed the skew margin from 100us to 50us. I think this is too small
    and can easily be exceeded when running some stressful workloads on a
    thermally stressed system.  So it is switched back to 100us.
    
    Even a maximum skew margin of 100us may be too small in for some systems
    when booting up especially if those systems are under thermal stress. To
    eliminate the case that the large skew is due to the system being too
    busy slowing down the reading of both the watchdog and the clocksource,
    an extra consecutive read of watchdog clock is being done to check this.
    
    The consecutive watchdog read delay is compared against
    WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that
    the system is just too busy. A warning will be printed to the console
    and the clock skew check is skipped for this round.
    
    Fixes: db3a34e1 ("clocksource: Retry clock read if long delays detected")
    Fixes: 2e27e793 ("clocksource: Reduce clocksource-skew threshold")
    Signed-off-by: default avatarWaiman Long <longman@redhat.com>
    Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    c86ff8c5
clocksource.c 41.1 KB