• Christian Brauner's avatar
    seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE · fb3c5386
    Christian Brauner authored
    This allows the seccomp notifier to continue a syscall. A positive
    discussion about this feature was triggered by a post to the
    ksummit-discuss mailing list (cf. [3]) and took place during KSummit
    (cf. [1]) and again at the containers/checkpoint-restore
    micro-conference at Linux Plumbers.
    
    Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
    which enables a process (watchee) to retrieve an fd for its seccomp
    filter. This fd can then be handed to another (usually more privileged)
    process (watcher). The watcher will then be able to receive seccomp
    messages about the syscalls having been performed by the watchee.
    
    This feature is heavily used in some userspace workloads. For example,
    it is currently used to intercept mknod() syscalls in user namespaces
    aka in containers.
    The mknod() syscall can be easily filtered based on dev_t. This allows
    us to only intercept a very specific subset of mknod() syscalls.
    Furthermore, mknod() is not possible in user namespaces toto coelo and
    so intercepting and denying syscalls that are not in the whitelist on
    accident is not a big deal. The watchee won't notice a difference.
    
    In contrast to mknod(), a lot of other syscall we intercept (e.g.
    setxattr()) cannot be easily filtered like mknod() because they have
    pointer arguments. Additionally, some of them might actually succeed in
    user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
    currently cannot tell seccomp to continue from a user notifier we are
    stuck with performing all of the syscalls in lieu of the container. This
    is a huge security liability since it is extremely difficult to
    correctly assume all of the necessary privileges of the calling task
    such that the syscall can be successfully emulated without escaping
    other additional security restrictions (think missing CAP_MKNOD for
    mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
    telling seccomp to resume the syscall.
    
    One thing that came up in the discussion was the problem that another
    thread could change the memory after userspace has decided to let the
    syscall continue which is a well known TOCTOU with seccomp which is
    present in other ways already.
    The discussion showed that this feature is already very useful for any
    syscall without pointer arguments. For any accidentally intercepted
    non-pointer syscall it is safe to continue.
    For syscalls with pointer arguments there is a race but for any cautious
    userspace and the main usec cases the race doesn't matter. The notifier
    is intended to be used in a scenario where a more privileged watcher
    supervises the syscalls of lesser privileged watchee to allow it to get
    around kernel-enforced limitations by performing the syscall for it
    whenever deemed save by the watcher. Hence, if a user tricks the watcher
    into allowing a syscall they will either get a deny based on
    kernel-enforced restrictions later or they will have changed the
    arguments in such a way that they manage to perform a syscall with
    arguments that they would've been allowed to do anyway.
    In general, it is good to point out again, that the notifier fd was not
    intended to allow userspace to implement a security policy but rather to
    work around kernel security mechanisms in cases where the watcher knows
    that a given action is safe to perform.
    
    /* References */
    [1]: https://linuxplumbersconf.org/event/4/contributions/560
    [2]: https://linuxplumbersconf.org/event/4/contributions/477
    [3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
    [4]: commit 6a21cc50 ("seccomp: add a return code to trap to userspace")
    Co-developed-by: default avatarKees Cook <keescook@chromium.org>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Reviewed-by: default avatarTycho Andersen <tycho@tycho.ws>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Will Drewry <wad@chromium.org>
    CC: Tyler Hicks <tyhicks@canonical.com>
    Link: https://lore.kernel.org/r/20190920083007.11475-2-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
    fb3c5386
seccomp.c 46 KB