• Linus Torvalds's avatar
    Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e7989789
    Linus Torvalds authored
    Pull timers and timekeeping updates from Thomas Gleixner:
    
     - Improve the VDSO build time checks to cover all dynamic relocations
    
       VDSO does not allow dynamic relocations, but the build time check is
       incomplete and fragile.
    
       It's based on architectures specifying the relocation types to search
       for and does not handle R_*_NONE relocation entries correctly.
       R_*_NONE relocations are injected by some GNU ld variants if they
       fail to determine the exact .rel[a]/dyn_size to cover trailing zeros.
       R_*_NONE relocations must be ignored by dynamic loaders, so they
       should be ignored in the build time check too.
    
       Remove the architecture specific relocation types to check for and
       validate strictly that no other relocations than R_*_NONE end up in
       the VSDO .so file.
    
     - Prefer signal delivery to the current thread for
       CLOCK_PROCESS_CPUTIME_ID based posix-timers
    
       Such timers prefer to deliver the signal to the main thread of a
       process even if the context in which the timer expires is the current
       task. This has the downside that it might wake up an idle thread.
    
       As there is no requirement or guarantee that the signal has to be
       delivered to the main thread, avoid this by preferring the current
       task if it is part of the thread group which shares sighand.
    
       This not only avoids waking idle threads, it also distributes the
       signal delivery in case of multiple timers firing in the context of
       different threads close to each other better.
    
     - Align the tick period properly (again)
    
       For a long time the tick was starting at CLOCK_MONOTONIC zero, which
       allowed users space applications to either align with the tick or to
       place a periodic computation so that it does not interfere with the
       tick. The alignement of the tick period was more by chance than by
       intention as the tick is set up before a high resolution clocksource
       is installed, i.e. timekeeping is still tick based and the tick
       period advances from there.
    
       The early enablement of sched_clock() broke this alignement as the
       time accumulated by sched_clock() is taken into account when
       timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is
       not longer a multiple of tick periods, which breaks applications
       which relied on that behaviour.
    
       Cure this by aligning the tick starting point to the next multiple of
       tick periods, i.e 1000ms/CONFIG_HZ.
    
     - A set of NOHZ fixes and enhancements:
    
         * Cure the concurrent writer race for idle and IO sleeptime
           statistics
    
           The statitic values which are exposed via /proc/stat are updated
           from the CPU local idle exit and remotely by cpufreq, but that
           happens without any form of serialization. As a consequence
           sleeptimes can be accounted twice or worse.
    
           Prevent this by restricting the accumulation writeback to the CPU
           local idle exit and let the remote access compute the accumulated
           value.
    
         * Protect idle/iowait sleep time with a sequence count
    
           Reading idle/iowait sleep time, e.g. from /proc/stat, can race
           with idle exit updates. As a consequence the readout may result
           in random and potentially going backwards values.
    
           Protect this by a sequence count, which fixes the idle time
           statistics issue, but cannot fix the iowait time problem because
           iowait time accounting races with remote wake ups decrementing
           the remote runqueues nr_iowait counter. The latter is impossible
           to fix, so the only way to deal with that is to document it
           properly and to remove the assertion in the selftest which
           triggers occasionally due to that.
    
         * Restructure struct tick_sched for better cache layout
    
         * Some small cleanups and a better cache layout for struct
           tick_sched
    
     - Implement the missing timer_wait_running() callback for POSIX CPU
       timers
    
       For unknown reason the introduction of the timer_wait_running()
       callback missed to fixup posix CPU timers, which went unnoticed for
       almost four years.
    
       While initially only targeted to prevent livelocks between a timer
       deletion and the timer expiry function on PREEMPT_RT enabled kernels,
       it turned out that fixing this for mainline is not as trivial as just
       implementing a stub similar to the hrtimer/timer callbacks.
    
       The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled
       systems there is a livelock issue independent of RT.
    
       CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU
       timers out from hard interrupt context to task work, which is handled
       before returning to user space or to a VM. The expiry mechanism moves
       the expired timers to a stack local list head with sighand lock held.
       Once sighand is dropped the task can be preempted and a task which
       wants to delete a timer will spin-wait until the expiry task is
       scheduled back in. In the worst case this will end up in a livelock
       when the preempting task and the expiry task are pinned on the same
       CPU.
    
       The timer wheel has a timer_wait_running() mechanism for RT, which
       uses a per CPU timer-base expiry lock which is held by the expiry
       code and the task waiting for the timer function to complete blocks
       on that lock.
    
       This does not work in the same way for posix CPU timers as there is
       no timer base and expiry for process wide timers can run on any task
       belonging to that process, but the concept of waiting on an expiry
       lock can be used too in a slightly different way.
    
       Add a per task mutex to struct posix_cputimers_work, let the expiry
       task hold it accross the expiry function and let the deleting task
       which waits for the expiry to complete block on the mutex.
    
       In the non-contended case this results in an extra
       mutex_lock()/unlock() pair on both sides.
    
       This avoids spin-waiting on a task which is scheduled out, prevents
       the livelock and cures the problem for RT and !RT systems
    
    * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      posix-cpu-timers: Implement the missing timer_wait_running callback
      selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity
      selftests/proc: Remove idle time monotonicity assertions
      MAINTAINERS: Remove stale email address
      timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
      timers/nohz: Add a comment about broken iowait counter update race
      timers/nohz: Protect idle/iowait sleep time under seqcount
      timers/nohz: Only ever update sleeptime from idle exit
      timers/nohz: Restructure and reshuffle struct tick_sched
      tick/common: Align tick period with the HZ tick.
      selftests/timers/posix_timers: Test delivery of signals across threads
      posix-timers: Prefer delivery of signals to the current thread
      vdso: Improve cmd_vdso_check to check all dynamic relocations
    e7989789
tick-sched.c 40 KB