1. 12 Dec, 2020 1 commit
  2. 11 Dec, 2020 9 commits
    • Thomas Gleixner's avatar
      tick/sched: Make jiffies update quick check more robust · aa3b66f4
      Thomas Gleixner authored
      The quick check in tick_do_update_jiffies64() whether jiffies need to be
      updated is not really correct under all circumstances and on all
      architectures, especially not on 32bit systems.
      
      The quick check does:
      
          if (now < READ_ONCE(tick_next_period))
          	return;
      
      and the counterpart in the update is:
      
          WRITE_ONCE(tick_next_period, next_update_time);
      
      This has two problems:
      
        1) On weakly ordered architectures there is no guarantee that the stores
           before the WRITE_ONCE() are visible which means that other CPUs can
           operate on a stale jiffies value.
      
        2) On 32bit the store of tick_next_period which is an u64 is split into
           two 32bit stores. If the first 32bit store advances tick_next_period
           far out and the second 32bit store is delayed (virt, NMI ...) then
           jiffies will become stale until the second 32bit store happens.
      
      Address this by seperating the handling for 32bit and 64bit.
      
      On 64bit problem #1 is addressed by replacing READ_ONCE() / WRITE_ONCE()
      with smp_load_acquire() / smp_store_release().
      
      On 32bit problem #2 is addressed by protecting the quick check with the
      jiffies sequence counter. The load and stores can be plain because the
      sequence count mechanics provides the required barriers already.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/87czzpc02w.fsf@nanos.tec.linutronix.de
      aa3b66f4
    • Thomas Gleixner's avatar
      ntp: Consolidate the RTC update implementation · 76e87d96
      Thomas Gleixner authored
      The code for the legacy RTC and the RTC class based update are pretty much
      the same. Consolidate the common parts into one function and just invoke
      the actual setter functions.
      
      For RTC class based devices the update code checks whether the offset is
      valid for the device, which is usually not the case for the first
      invocation. If it's not the same it stores the correct offset and lets the
      caller try again. That's not much different from the previous approach
      where the first invocation had a pretty low probability to actually hit the
      allowed window.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220542.355743355@linutronix.de
      76e87d96
    • Thomas Gleixner's avatar
      ntp: Make the RTC sync offset less obscure · 69eca258
      Thomas Gleixner authored
      The current RTC set_offset_nsec value is not really intuitive to
      understand. 
      
        tsched       twrite(t2.tv_sec - 1) 	 t2 (seconds increment)
      
      The offset is calculated from twrite based on the assumption that t2 -
      twrite == 1s. That means for the MC146818 RTC the offset needs to be
      negative so that the write happens 500ms before t2.
      
      It's easier to understand when the whole calculation is based on t2. That
      avoids negative offsets and the meaning is obvious:
      
       t2 - twrite:     The time defined by the chip when seconds increment
            		  after the write.
      
       twrite - tsched: The time for the transport to the point where the chip
       	  	  is updated. 
      
      ==> set_offset_nsec =  t2 - tsched
          ttransport      =  twrite - tsched
          tRTCinc         =  t2 - twrite
      ==> set_offset_nsec =  ttransport + tRTCinc
      
      tRTCinc is a chip property and can be obtained from the data sheet.
      
      ttransport depends on how the RTC is connected. It is close to 0 for
      directly accessible RTCs. For RTCs behind a slow bus, e.g. i2c, it's the
      time required to send the update over the bus. This can be estimated or
      even calibrated, but that's a different problem.
      
      Adjust the implementation and update comments accordingly.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220542.263204937@linutronix.de
      69eca258
    • Thomas Gleixner's avatar
      ntp, rtc: Move rtc_set_ntp_time() to ntp code · 33e62e83
      Thomas Gleixner authored
      rtc_set_ntp_time() is not really RTC functionality as the code is just a
      user of RTC. Move it into the NTP code which allows further cleanups.
      Requested-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220542.166871172@linutronix.de
      33e62e83
    • Thomas Gleixner's avatar
      ntp: Make the RTC synchronization more reliable · c9e6189f
      Thomas Gleixner authored
      Miroslav reported that the periodic RTC synchronization in the NTP code
      fails more often than not to hit the specified update window.
      
      The reason is that the code uses delayed_work to schedule the update which
      needs to be in thread context as the underlying RTC might be connected via
      a slow bus, e.g. I2C. In the update function it verifies whether the
      current time is correct vs. the requirements of the underlying RTC.
      
      But delayed_work is using the timer wheel for scheduling which is
      inaccurate by design. Depending on the distance to the expiry the wheel
      gets less granular to allow batching and to avoid the cascading of the
      original timer wheel. See 500462a9 ("timers: Switch to a non-cascading
      wheel") and the code for further details.
      
      The code already deals with this by splitting the 660 seconds period into a
      long 659 seconds timer and then retrying with a smaller delta.
      
      But looking at the actual granularities of the timer wheel (which depend on
      the HZ configuration) the 659 seconds timer ends up in an outer wheel level
      and is affected by a worst case granularity of:
      
      HZ          Granularity
      1000        32s
       250        16s
       100        40s
      
      So the initial timer can be already off by max 12.5% which is not a big
      issue as the period of the sync is defined as ~11 minutes.
      
      The fine grained second attempt schedules to the desired update point with
      a timer expiring less than a second from now. Depending on the actual delta
      and the HZ setting even the second attempt can end up in outer wheel levels
      which have a large enough granularity to make the correctness check fail.
      
      As this is a fundamental property of the timer wheel there is no way to
      make this more accurate short of iterating in one jiffies steps towards the
      update point.
      
      Switch it to an hrtimer instead which schedules the actual update work. The
      hrtimer will expire precisely (max 1 jiffie delay when high resolution
      timers are not available). The actual scheduling delay of the work is the
      same as before.
      
      The update is triggered from do_adjtimex() which is a bit racy but not much
      more racy than it was before:
      
           if (ntp_synced())
           	queue_delayed_work(system_power_efficient_wq, &sync_work, 0);
      
      which is racy when the work is currently executed and has not managed to
      reschedule itself.
      
      This becomes now:
      
           if (ntp_synced() && !hrtimer_is_queued(&sync_hrtimer))
           	queue_work(system_power_efficient_wq, &sync_work, 0);
      
      which is racy when the hrtimer has expired and the work is currently
      executed and has not yet managed to rearm the hrtimer.
      
      Not a big problem as it just schedules work for nothing.
      
      The new implementation has a safe guard in place to catch the case where
      the hrtimer is queued on entry to the work function and avoids an extra
      update attempt of the RTC that way.
      Reported-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220542.062910520@linutronix.de
      c9e6189f
    • Thomas Gleixner's avatar
      rtc: core: Make the sync offset default more realistic · 354c796b
      Thomas Gleixner authored
      The offset which is used to steer the start of an RTC synchronization
      update via rtc_set_ntp_time() is huge. The math behind this is:
      
        tsched       twrite(t2.tv_sec - 1) 	 t2 (seconds increment)
      
      twrite - tsched is the transport time for the write to hit the device.
      
      t2 - twrite depends on the chip and is for most chips one second.
      
      The rtc_set_ntp_time() calculation of tsched is:
      
          tsched = t2 - 1sec - (t2 - twrite)
      
      The default for the sync offset is 500ms which means that twrite - tsched
      is 500ms assumed that t2 - twrite is one second.
      
      This is 0.5 seconds off for RTCs which are directly accessible by IO writes
      and probably for the majority of i2C/SPI based RTC off by an order of
      magnitude. Set it to 5ms which should bring it closer to reality.
      
      The default can be adjusted by drivers (rtc_cmos does so) and could be
      adjusted further by a calibration method which is an orthogonal problem.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220541.960333166@linutronix.de
      354c796b
    • Thomas Gleixner's avatar
      rtc: cmos: Make rtc_cmos sync offset correct · b0ecd8e8
      Thomas Gleixner authored
      The offset for rtc_cmos must be -500ms to work correctly with the current
      implementation of rtc_set_ntp_time() due to the following:
      
        tsched       twrite(t2.tv_sec - 1) 	 t2 (seconds increment)
      
      twrite - tsched is the transport time for the write to hit the device,
      which is negligible for this chip because it's accessed directly.
      
      t2 - twrite = 500ms according to the datasheet.
      
      But rtc_set_ntp_time() calculation of tsched is:
      
          tsched = t2 - 1sec - (t2 - twrite)
      
      The default for the sync offset is 500ms which means that the write happens
      at t2 - 1.5 seconds which is obviously off by a second for this device.
      
      Make the offset -500ms so it works correct.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220541.830517160@linutronix.de
      b0ecd8e8
    • Thomas Gleixner's avatar
      dcf257e9
    • Thomas Gleixner's avatar
      rtc: mc146818: Prevent reading garbage · 05a0302c
      Thomas Gleixner authored
      The MC146818 driver is prone to read garbage from the RTC. There are
      several issues all related to the update cycle of the MC146818. The chip
      increments seconds obviously once per second and indicates that by a bit in
      a register. The bit goes high 244us before the actual update starts. During
      the update the readout of the time values is undefined.
      
      The code just checks whether the update in progress bit (UIP) is set before
      reading the clock. If it's set it waits arbitrary 20ms before retrying,
      which is ample because the maximum update time is ~2ms.
      
      But this check does not guarantee that the UIP bit goes high and the actual
      update happens during the readout. So the following can happen
      
       0.997 	       UIP = False
         -> Interrupt/NMI/preemption
       0.998	       UIP -> True
       0.999	       Readout	<- Undefined
      
      To prevent this rework the code so it checks UIP before and after the
      readout and if set after the readout try again.
      
      But that's not enough to cover the following:
      
       0.997 	       UIP = False
       	       Readout seconds
         -> NMI (or vCPU scheduled out)
       0.998	       UIP -> True
       	       update completes
      	       UIP -> False
       1.000	       Readout	minutes,....
       	       UIP check succeeds
      
      That can make the readout wrong up to 59 seconds.
      
      To prevent this, read the seconds value before the first UIP check,
      validate it after checking UIP and after reading out the rest.
      
      It's amazing that the original i386 code had this actually correct and
      the generic implementation of the MC146818 driver got it wrong in 2002 and
      it stayed that way until today.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201206220541.594826678@linutronix.de
      05a0302c
  3. 07 Dec, 2020 1 commit
    • Niklas Söderlund's avatar
      clocksource/drivers/sh_cmt: Fix potential deadlock when calling runtime PM · 8ae954ca
      Niklas Söderlund authored
      The ch->lock is used to protect the whole enable() and read() of
      sh_cmt's implementation of struct clocksource. The enable()
      implementation calls pm_runtime_get_sync() which may result in the clock
      source to be read() triggering a cyclic lockdep warning for the
      ch->lock.
      
      The sh_cmt driver implement its own balancing of calls to
      sh_cmt_{enable,disable}() with flags in sh_cmt_{start,stop}(). It does
      this to deal with that start and stop are shared between the clock
      source and clock event providers. While this could be improved on
      verifying corner cases based on any substantial rework on all devices
      this driver supports might prove hard.
      
      As a first step separate the PM handling for clock event and clock
      source. Always put/get the device when enabling/disabling the clock
      source but keep the clock event logic unchanged. This allows the sh_cmt
      implementation of struct clocksource to call PM without holding the
      ch->lock and avoiding the deadlock.
      
      Triggering and log of the deadlock warning,
      
        # echo e60f0000.timer > /sys/devices/system/clocksource/clocksource0/current_clocksource
        [   46.948370] ======================================================
        [   46.954730] WARNING: possible circular locking dependency detected
        [   46.961094] 5.10.0-rc6-arm64-renesas-00001-g0e5fd7414e8b #36 Not tainted
        [   46.967985] ------------------------------------------------------
        [   46.974342] migration/0/11 is trying to acquire lock:
        [   46.979543] ffff0000403ed220 (&dev->power.lock){-...}-{2:2}, at: __pm_runtime_resume+0x40/0x74
        [   46.988445]
        [   46.988445] but task is already holding lock:
        [   46.994441] ffff000040ad0298 (&ch->lock){....}-{2:2}, at: sh_cmt_start+0x28/0x210
        [   47.002173]
        [   47.002173] which lock already depends on the new lock.
        [   47.002173]
        [   47.010573]
        [   47.010573] the existing dependency chain (in reverse order) is:
        [   47.018262]
        [   47.018262] -> #3 (&ch->lock){....}-{2:2}:
        [   47.024033]        lock_acquire.part.0+0x120/0x330
        [   47.028970]        lock_acquire+0x64/0x80
        [   47.033105]        _raw_spin_lock_irqsave+0x7c/0xc4
        [   47.038130]        sh_cmt_start+0x28/0x210
        [   47.042352]        sh_cmt_clocksource_enable+0x28/0x50
        [   47.047644]        change_clocksource+0x9c/0x160
        [   47.052402]        multi_cpu_stop+0xa4/0x190
        [   47.056799]        cpu_stopper_thread+0x90/0x154
        [   47.061557]        smpboot_thread_fn+0x244/0x270
        [   47.066310]        kthread+0x154/0x160
        [   47.070175]        ret_from_fork+0x10/0x20
        [   47.074390]
        [   47.074390] -> #2 (tk_core.seq.seqcount){----}-{0:0}:
        [   47.081136]        lock_acquire.part.0+0x120/0x330
        [   47.086070]        lock_acquire+0x64/0x80
        [   47.090203]        seqcount_lockdep_reader_access.constprop.0+0x74/0x100
        [   47.097096]        ktime_get+0x28/0xa0
        [   47.100960]        hrtimer_start_range_ns+0x210/0x2dc
        [   47.106164]        generic_sched_clock_init+0x70/0x88
        [   47.111364]        sched_clock_init+0x40/0x64
        [   47.115853]        start_kernel+0x494/0x524
        [   47.120156]
        [   47.120156] -> #1 (hrtimer_bases.lock){-.-.}-{2:2}:
        [   47.126721]        lock_acquire.part.0+0x120/0x330
        [   47.136042]        lock_acquire+0x64/0x80
        [   47.144461]        _raw_spin_lock_irqsave+0x7c/0xc4
        [   47.153721]        hrtimer_start_range_ns+0x68/0x2dc
        [   47.163054]        rpm_suspend+0x308/0x5dc
        [   47.171473]        rpm_idle+0xc4/0x2a4
        [   47.179550]        pm_runtime_work+0x98/0xc0
        [   47.188209]        process_one_work+0x294/0x6f0
        [   47.197142]        worker_thread+0x70/0x45c
        [   47.205661]        kthread+0x154/0x160
        [   47.213673]        ret_from_fork+0x10/0x20
        [   47.221957]
        [   47.221957] -> #0 (&dev->power.lock){-...}-{2:2}:
        [   47.236292]        check_noncircular+0x128/0x140
        [   47.244907]        __lock_acquire+0x13b0/0x204c
        [   47.253332]        lock_acquire.part.0+0x120/0x330
        [   47.262033]        lock_acquire+0x64/0x80
        [   47.269826]        _raw_spin_lock_irqsave+0x7c/0xc4
        [   47.278430]        __pm_runtime_resume+0x40/0x74
        [   47.286758]        sh_cmt_start+0x84/0x210
        [   47.294537]        sh_cmt_clocksource_enable+0x28/0x50
        [   47.303449]        change_clocksource+0x9c/0x160
        [   47.311783]        multi_cpu_stop+0xa4/0x190
        [   47.319720]        cpu_stopper_thread+0x90/0x154
        [   47.328022]        smpboot_thread_fn+0x244/0x270
        [   47.336298]        kthread+0x154/0x160
        [   47.343708]        ret_from_fork+0x10/0x20
        [   47.351445]
        [   47.351445] other info that might help us debug this:
        [   47.351445]
        [   47.370225] Chain exists of:
        [   47.370225]   &dev->power.lock --> tk_core.seq.seqcount --> &ch->lock
        [   47.370225]
        [   47.392003]  Possible unsafe locking scenario:
        [   47.392003]
        [   47.405314]        CPU0                    CPU1
        [   47.413569]        ----                    ----
        [   47.421768]   lock(&ch->lock);
        [   47.428425]                                lock(tk_core.seq.seqcount);
        [   47.438701]                                lock(&ch->lock);
        [   47.447930]   lock(&dev->power.lock);
        [   47.455172]
        [   47.455172]  *** DEADLOCK ***
        [   47.455172]
        [   47.471433] 3 locks held by migration/0/11:
        [   47.479099]  #0: ffff8000113c9278 (timekeeper_lock){-.-.}-{2:2}, at: change_clocksource+0x2c/0x160
        [   47.491834]  #1: ffff8000113c8f88 (tk_core.seq.seqcount){----}-{0:0}, at: multi_cpu_stop+0xa4/0x190
        [   47.504727]  #2: ffff000040ad0298 (&ch->lock){....}-{2:2}, at: sh_cmt_start+0x28/0x210
        [   47.516541]
        [   47.516541] stack backtrace:
        [   47.528480] CPU: 0 PID: 11 Comm: migration/0 Not tainted 5.10.0-rc6-arm64-renesas-00001-g0e5fd7414e8b #36
        [   47.542147] Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
        [   47.554241] Call trace:
        [   47.560832]  dump_backtrace+0x0/0x190
        [   47.568670]  show_stack+0x14/0x30
        [   47.576144]  dump_stack+0xe8/0x130
        [   47.583670]  print_circular_bug+0x1f0/0x200
        [   47.592015]  check_noncircular+0x128/0x140
        [   47.600289]  __lock_acquire+0x13b0/0x204c
        [   47.608486]  lock_acquire.part.0+0x120/0x330
        [   47.616953]  lock_acquire+0x64/0x80
        [   47.624582]  _raw_spin_lock_irqsave+0x7c/0xc4
        [   47.633114]  __pm_runtime_resume+0x40/0x74
        [   47.641371]  sh_cmt_start+0x84/0x210
        [   47.649115]  sh_cmt_clocksource_enable+0x28/0x50
        [   47.657916]  change_clocksource+0x9c/0x160
        [   47.666165]  multi_cpu_stop+0xa4/0x190
        [   47.674056]  cpu_stopper_thread+0x90/0x154
        [   47.682308]  smpboot_thread_fn+0x244/0x270
        [   47.690560]  kthread+0x154/0x160
        [   47.697927]  ret_from_fork+0x10/0x20
        [   47.708447] clocksource: Switched to clocksource e60f0000.timer
      Signed-off-by: default avatarNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Reviewed-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Link: https://lore.kernel.org/r/20201205021921.1456190-2-niklas.soderlund+renesas@ragnatech.se
      8ae954ca
  4. 05 Dec, 2020 3 commits
  5. 03 Dec, 2020 14 commits
  6. 19 Nov, 2020 7 commits
  7. 16 Nov, 2020 2 commits
  8. 15 Nov, 2020 3 commits