Commit 74858abb authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cap-checkpoint-restore-v5.9' of...

Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull checkpoint-restore updates from Christian Brauner:
 "This enables unprivileged checkpoint/restore of processes.

  Given that this work has been going on for quite some time the first
  sentence in this summary is hopefully more exciting than the actual
  final code changes required. Unprivileged checkpoint/restore has seen
  a frequent increase in interest over the last two years and has thus
  been one of the main topics for the combined containers &
  checkpoint/restore microconference since at least 2018 (cf. [1]).

  Here are just the three most frequent use-cases that were brought forward:

   - The JVM developers are integrating checkpoint/restore into a Java
     VM to significantly decrease the startup time.

   - In high-performance computing environment a resource manager will
     typically be distributing jobs where users are always running as
     non-root. Long-running and "large" processes with significant
     startup times are supposed to be checkpointed and restored with
     CRIU.

   - Container migration as a non-root user.

  In all of these scenarios it is either desirable or required to run
  without CAP_SYS_ADMIN. The userspace implementation of
  checkpoint/restore CRIU already has the pull request for supporting
  unprivileged checkpoint/restore up (cf. [2]).

  To enable unprivileged checkpoint/restore a new dedicated capability
  CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
  discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
  "Update on Task Migration at Google Using CRIU") with Adrian and
  Nicolas providing the implementation now over the last months. In
  essence, this allows the CRIU binary to be installed with the
  CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
  unprivileged users to restore processes.

  To make this possible the following permissions are altered:

   - Selecting a specific PID via clone3() set_tid relaxed from userns
     CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
     from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Accessing /proc/pid/map_files relaxed from init userns
     CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

   - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
     CAP_CHECKPOINT_RESTORE.

  Of these four changes the /proc/self/exe change deserves a few words
  because the reasoning behind even restricting /proc/self/exe changes
  in the first place is just full of historical quirks and tracking this
  down was a questionable version of fun that I'd like to spare others.

  In short, it is trivial to change /proc/self/exe as an unprivileged
  user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
  or by simply intercepting the elf loader in userspace during exec.
  Nicolas was nice enough to even provide a POC for the latter (cf. [3])
  to illustrate this fact.

  The original patchset which introduced PR_SET_MM_MAP had no
  permissions around changing the exe link. They too argued that it is
  trivial to spoof the exe link already which is true. The argument
  brought up against this was that the Tomoyo LSM uses the exe link in
  tomoyo_manager() to detect whether the calling process is a policy
  manager. This caused changing the exe links to be guarded by userns
  CAP_SYS_ADMIN.

  All in all this rather seems like a "better guard it with something
  rather than nothing" argument which imho doesn't qualify as a great
  security policy. Again, because spoofing the exe link is possible for
  the calling process so even if this were security relevant it was
  broken back then and would be broken today. So technically, dropping
  all permissions around changing the exe link would probably be
  possible and would send a clearer message to any userspace that relies
  on /proc/self/exe for security reasons that they should stop doing
  this but for now we're only relaxing the exe link permissions from
  userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

  There's a final uapi change in here. Changing the exe link used to
  accidently return EINVAL when the caller lacked the necessary
  permissions instead of the more correct EPERM. This pr contains a
  commit fixing this. I assume that userspace won't notice or care and
  if they do I will revert this commit. But since we are changing the
  permissions anyway it seems like a good opportunity to try this fix.

  With these changes merged unprivileged checkpoint/restore will be
  possible and has already been tested by various users"

[1] LPC 2018
     1. "Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
     2. "Securely Migrating Untrusted Workloads with CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
     LPC 2019
     1. "CRIU and the PID dance"
         https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
     2. "Update on Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s

[2] https://github.com/checkpoint-restore/criu/pull/1155

[3] https://github.com/nviennot/run_as_exe

* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  prctl: exe link permission error changed from -EINVAL to -EPERM
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  pid: use checkpoint_restore_ns_capable() for set_tid
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
parents 9ba27414 1d27a0be
...@@ -2189,16 +2189,16 @@ struct map_files_info { ...@@ -2189,16 +2189,16 @@ struct map_files_info {
}; };
/* /*
* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
* symlinks may be used to bypass permissions on ancestor directories in the * to concerns about how the symlinks may be used to bypass permissions on
* path to the file in question. * ancestor directories in the path to the file in question.
*/ */
static const char * static const char *
proc_map_files_get_link(struct dentry *dentry, proc_map_files_get_link(struct dentry *dentry,
struct inode *inode, struct inode *inode,
struct delayed_call *done) struct delayed_call *done)
{ {
if (!capable(CAP_SYS_ADMIN)) if (!checkpoint_restore_ns_capable(&init_user_ns))
return ERR_PTR(-EPERM); return ERR_PTR(-EPERM);
return proc_pid_get_link(dentry, inode, done); return proc_pid_get_link(dentry, inode, done);
......
...@@ -261,6 +261,12 @@ static inline bool bpf_capable(void) ...@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN); return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
} }
static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
{
return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
ns_capable(ns, CAP_SYS_ADMIN);
}
/* audit system wants to get cap info from files as well */ /* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
......
...@@ -408,7 +408,14 @@ struct vfs_ns_cap_data { ...@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
*/ */
#define CAP_BPF 39 #define CAP_BPF 39
#define CAP_LAST_CAP CAP_BPF
/* Allow checkpoint/restore related operations */
/* Allow PID selection during clone3() */
/* Allow writing to ns_last_pid */
#define CAP_CHECKPOINT_RESTORE 40
#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP) #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
......
...@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, ...@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
if (tid != 1 && !tmp->child_reaper) if (tid != 1 && !tmp->child_reaper)
goto out_free; goto out_free;
retval = -EPERM; retval = -EPERM;
if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN)) if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free; goto out_free;
set_tid_size--; set_tid_size--;
} }
......
...@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write, ...@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
struct ctl_table tmp = *table; struct ctl_table tmp = *table;
int ret, next; int ret, next;
if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN)) if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM; return -EPERM;
/* /*
......
...@@ -2007,12 +2007,15 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data ...@@ -2007,12 +2007,15 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
if (prctl_map.exe_fd != (u32)-1) { if (prctl_map.exe_fd != (u32)-1) {
/* /*
* Make sure the caller has the rights to * Check if the current user is checkpoint/restore capable.
* change /proc/pid/exe link: only local sys admin should * At the time of this writing, it checks for CAP_SYS_ADMIN
* be allowed to. * or CAP_CHECKPOINT_RESTORE.
* Note that a user with access to ptrace can masquerade an
* arbitrary program as any executable, even setuid ones.
* This may have implications in the tomoyo subsystem.
*/ */
if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN)) if (!checkpoint_restore_ns_capable(current_user_ns()))
return -EINVAL; return -EPERM;
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd); error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error) if (error)
......
...@@ -27,9 +27,10 @@ ...@@ -27,9 +27,10 @@
"audit_control", "setfcap" "audit_control", "setfcap"
#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \ #define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
"wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf" "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
"checkpoint_restore"
#if CAP_LAST_CAP > CAP_BPF #if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
#error New capability defined, please update COMMON_CAP2_PERMS. #error New capability defined, please update COMMON_CAP2_PERMS.
#endif #endif
......
...@@ -2,3 +2,4 @@ ...@@ -2,3 +2,4 @@
clone3 clone3
clone3_clear_sighand clone3_clear_sighand
clone3_set_tid clone3_set_tid
clone3_cap_checkpoint_restore
# SPDX-License-Identifier: GPL-2.0 # SPDX-License-Identifier: GPL-2.0
CFLAGS += -g -I../../../../usr/include/ CFLAGS += -g -I../../../../usr/include/
LDLIBS += -lcap
TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
clone3_cap_checkpoint_restore
include ../lib.mk include ../lib.mk
// SPDX-License-Identifier: GPL-2.0
/*
* Based on Christian Brauner's clone3() example.
* These tests are assuming to be running in the host's
* PID namespace.
*/
/* capabilities related code based on selftests/bpf/test_verifier.c */
#define _GNU_SOURCE
#include <errno.h>
#include <linux/types.h>
#include <linux/sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <sys/capability.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sched.h>
#include "../kselftest_harness.h"
#include "clone3_selftests.h"
#ifndef MAX_PID_NS_LEVEL
#define MAX_PID_NS_LEVEL 32
#endif
static void child_exit(int ret)
{
fflush(stdout);
fflush(stderr);
_exit(ret);
}
static int call_clone3_set_tid(struct __test_metadata *_metadata,
pid_t *set_tid, size_t set_tid_size)
{
int status;
pid_t pid = -1;
struct clone_args args = {
.exit_signal = SIGCHLD,
.set_tid = ptr_to_u64(set_tid),
.set_tid_size = set_tid_size,
};
pid = sys_clone3(&args, sizeof(struct clone_args));
if (pid < 0) {
TH_LOG("%s - Failed to create new process", strerror(errno));
return -errno;
}
if (pid == 0) {
int ret;
char tmp = 0;
TH_LOG("I am the child, my PID is %d (expected %d)", getpid(), set_tid[0]);
if (set_tid[0] != getpid())
child_exit(EXIT_FAILURE);
child_exit(EXIT_SUCCESS);
}
TH_LOG("I am the parent (%d). My child's pid is %d", getpid(), pid);
if (waitpid(pid, &status, 0) < 0) {
TH_LOG("Child returned %s", strerror(errno));
return -errno;
}
if (!WIFEXITED(status))
return -1;
return WEXITSTATUS(status);
}
static int test_clone3_set_tid(struct __test_metadata *_metadata,
pid_t *set_tid, size_t set_tid_size)
{
int ret;
TH_LOG("[%d] Trying clone3() with CLONE_SET_TID to %d", getpid(), set_tid[0]);
ret = call_clone3_set_tid(_metadata, set_tid, set_tid_size);
TH_LOG("[%d] clone3() with CLONE_SET_TID %d says:%d", getpid(), set_tid[0], ret);
return ret;
}
struct libcap {
struct __user_cap_header_struct hdr;
struct __user_cap_data_struct data[2];
};
static int set_capability(void)
{
cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
struct libcap *cap;
int ret = -1;
cap_t caps;
caps = cap_get_proc();
if (!caps) {
perror("cap_get_proc");
return -1;
}
/* Drop all capabilities */
if (cap_clear(caps)) {
perror("cap_clear");
goto out;
}
cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
cap = (struct libcap *) caps;
/* 40 -> CAP_CHECKPOINT_RESTORE */
cap->data[1].effective |= 1 << (40 - 32);
cap->data[1].permitted |= 1 << (40 - 32);
if (cap_set_proc(caps)) {
perror("cap_set_proc");
goto out;
}
ret = 0;
out:
if (cap_free(caps))
perror("cap_free");
return ret;
}
TEST(clone3_cap_checkpoint_restore)
{
pid_t pid;
int status;
int ret = 0;
pid_t set_tid[1];
test_clone3_supported();
EXPECT_EQ(getuid(), 0)
XFAIL(return, "Skipping all tests as non-root\n");
memset(&set_tid, 0, sizeof(set_tid));
/* Find the current active PID */
pid = fork();
if (pid == 0) {
TH_LOG("Child has PID %d", getpid());
child_exit(EXIT_SUCCESS);
}
ASSERT_GT(waitpid(pid, &status, 0), 0)
TH_LOG("Waiting for child %d failed", pid);
/* After the child has finished, its PID should be free. */
set_tid[0] = pid;
ASSERT_EQ(set_capability(), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
EXPECT_EQ(setgid(65534), 0)
TH_LOG("Failed to setgid(65534)");
ASSERT_EQ(setuid(65534), 0);
set_tid[0] = pid;
/* This would fail without CAP_CHECKPOINT_RESTORE */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
ASSERT_EQ(set_capability(), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
}
TEST_HARNESS_MAIN
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment