• Andrei Vagin's avatar
    ns: Introduce Time Namespace · 769071ac
    Andrei Vagin authored
    Time Namespace isolates clock values.
    
    The kernel provides access to several clocks CLOCK_REALTIME,
    CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
    
    CLOCK_REALTIME
          System-wide clock that measures real (i.e., wall-clock) time.
    
    CLOCK_MONOTONIC
          Clock that cannot be set and represents monotonic time since
          some unspecified starting point.
    
    CLOCK_BOOTTIME
          Identical to CLOCK_MONOTONIC, except it also includes any time
          that the system is suspended.
    
    For many users, the time namespace means the ability to changes date and
    time in a container (CLOCK_REALTIME). Providing per namespace notions of
    CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
    value.
    
    But in the context of checkpoint/restore functionality, monotonic and
    boottime clocks become interesting. Both clocks are monotonic with
    unspecified starting points. These clocks are widely used to measure time
    slices and set timers. After restoring or migrating processes, it has to be
    guaranteed that they never go backward. In an ideal case, the behavior of
    these clocks should be the same as for a case when a whole system is
    suspended. All this means that it is required to set CLOCK_MONOTONIC and
    CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
    offsets for clocks.
    
    A time namespace is similar to a pid namespace in the way how it is
    created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
    but doesn't set it to the current process. Then all children of the process
    will be born in the new time namespace, or a process can use the setns()
    system call to join a namespace.
    
    This scheme allows setting clock offsets for a namespace, before any
    processes appear in it.
    
    All available clone flags have been used, so CLONE_NEWTIME uses the highest
    bit of CSIGNAL. It means that it can be used only with the unshare() and
    the clone3() system calls.
    
    [ tglx: Adjusted paragraph about clone3() to reality and massaged the
      	changelog a bit. ]
    Co-developed-by: default avatarDmitry Safonov <dima@arista.com>
    Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
    Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Link: https://criu.org/Time_namespace
    Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
    Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
    
    769071ac
nsproxy.c 7.08 KB