• Gabriel Krisman Bertazi's avatar
    kernel: Implement selective syscall userspace redirection · 1446e1df
    Gabriel Krisman Bertazi authored
    Introduce a mechanism to quickly disable/enable syscall handling for a
    specific process and redirect to userspace via SIGSYS.  This is useful
    for processes with parts that require syscall redirection and parts that
    don't, but who need to perform this boundary crossing really fast,
    without paying the cost of a system call to reconfigure syscall handling
    on each boundary transition.  This is particularly important for Windows
    games running over Wine.
    
    The proposed interface looks like this:
    
      prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
    
    The range [<offset>,<offset>+<length>) is a part of the process memory
    map that is allowed to by-pass the redirection code and dispatch
    syscalls directly, such that in fast paths a process doesn't need to
    disable the trap nor the kernel has to check the selector.  This is
    essential to return from SIGSYS to a blocked area without triggering
    another SIGSYS from rt_sigreturn.
    
    selector is an optional pointer to a char-sized userspace memory region
    that has a key switch for the mechanism. This key switch is set to
    either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
    redirection without calling the kernel.
    
    The feature is meant to be set per-thread and it is disabled on
    fork/clone/execv.
    
    Internally, this doesn't add overhead to the syscall hot path, and it
    requires very little per-architecture support.  I avoided using seccomp,
    even though it duplicates some functionality, due to previous feedback
    that maybe it shouldn't mix with seccomp since it is not a security
    mechanism.  And obviously, this should never be considered a security
    mechanism, since any part of the program can by-pass it by using the
    syscall dispatcher.
    
    For the sysinfo benchmark, which measures the overhead added to
    executing a native syscall that doesn't require interception, the
    overhead using only the direct dispatcher region to issue syscalls is
    pretty much irrelevant.  The overhead of using the selector goes around
    40ns for a native (unredirected) syscall in my system, and it is (as
    expected) dominated by the supervisor-mode user-address access.  In
    fact, with SMAP off, the overhead is consistently less than 5ns on my
    test box.
    Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@collabora.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
    Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: default avatarKees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
    1446e1df
fork.c 74.1 KB