• Linus Torvalds's avatar
    Merge tag 'cap-checkpoint-restore-v5.9' of... · 74858abb
    Linus Torvalds authored
    Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
    
    Pull checkpoint-restore updates from Christian Brauner:
     "This enables unprivileged checkpoint/restore of processes.
    
      Given that this work has been going on for quite some time the first
      sentence in this summary is hopefully more exciting than the actual
      final code changes required. Unprivileged checkpoint/restore has seen
      a frequent increase in interest over the last two years and has thus
      been one of the main topics for the combined containers &
      checkpoint/restore microconference since at least 2018 (cf. [1]).
    
      Here are just the three most frequent use-cases that were brought forward:
    
       - The JVM developers are integrating checkpoint/restore into a Java
         VM to significantly decrease the startup time.
    
       - In high-performance computing environment a resource manager will
         typically be distributing jobs where users are always running as
         non-root. Long-running and "large" processes with significant
         startup times are supposed to be checkpointed and restored with
         CRIU.
    
       - Container migration as a non-root user.
    
      In all of these scenarios it is either desirable or required to run
      without CAP_SYS_ADMIN. The userspace implementation of
      checkpoint/restore CRIU already has the pull request for supporting
      unprivileged checkpoint/restore up (cf. [2]).
    
      To enable unprivileged checkpoint/restore a new dedicated capability
      CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
      discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
      "Update on Task Migration at Google Using CRIU") with Adrian and
      Nicolas providing the implementation now over the last months. In
      essence, this allows the CRIU binary to be installed with the
      CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
      unprivileged users to restore processes.
    
      To make this possible the following permissions are altered:
    
       - Selecting a specific PID via clone3() set_tid relaxed from userns
         CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
    
       - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
         from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
    
       - Accessing /proc/pid/map_files relaxed from init userns
         CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.
    
       - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
         CAP_CHECKPOINT_RESTORE.
    
      Of these four changes the /proc/self/exe change deserves a few words
      because the reasoning behind even restricting /proc/self/exe changes
      in the first place is just full of historical quirks and tracking this
      down was a questionable version of fun that I'd like to spare others.
    
      In short, it is trivial to change /proc/self/exe as an unprivileged
      user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
      or by simply intercepting the elf loader in userspace during exec.
      Nicolas was nice enough to even provide a POC for the latter (cf. [3])
      to illustrate this fact.
    
      The original patchset which introduced PR_SET_MM_MAP had no
      permissions around changing the exe link. They too argued that it is
      trivial to spoof the exe link already which is true. The argument
      brought up against this was that the Tomoyo LSM uses the exe link in
      tomoyo_manager() to detect whether the calling process is a policy
      manager. This caused changing the exe links to be guarded by userns
      CAP_SYS_ADMIN.
    
      All in all this rather seems like a "better guard it with something
      rather than nothing" argument which imho doesn't qualify as a great
      security policy. Again, because spoofing the exe link is possible for
      the calling process so even if this were security relevant it was
      broken back then and would be broken today. So technically, dropping
      all permissions around changing the exe link would probably be
      possible and would send a clearer message to any userspace that relies
      on /proc/self/exe for security reasons that they should stop doing
      this but for now we're only relaxing the exe link permissions from
      userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.
    
      There's a final uapi change in here. Changing the exe link used to
      accidently return EINVAL when the caller lacked the necessary
      permissions instead of the more correct EPERM. This pr contains a
      commit fixing this. I assume that userspace won't notice or care and
      if they do I will revert this commit. But since we are changing the
      permissions anyway it seems like a good opportunity to try this fix.
    
      With these changes merged unprivileged checkpoint/restore will be
      possible and has already been tested by various users"
    
    [1] LPC 2018
         1. "Task Migration at Google Using CRIU"
            https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
         2. "Securely Migrating Untrusted Workloads with CRIU"
            https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
         LPC 2019
         1. "CRIU and the PID dance"
             https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
         2. "Update on Task Migration at Google Using CRIU"
            https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s
    
    [2] https://github.com/checkpoint-restore/criu/pull/1155
    
    [3] https://github.com/nviennot/run_as_exe
    
    * tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
      selftests: add clone3() CAP_CHECKPOINT_RESTORE test
      prctl: exe link permission error changed from -EINVAL to -EPERM
      prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
      proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
      pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
      pid: use checkpoint_restore_ns_capable() for set_tid
      capabilities: Introduce CAP_CHECKPOINT_RESTORE
    74858abb
pid.c 16.6 KB