• Linus Torvalds's avatar
    Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 4f30a60a
    Linus Torvalds authored
    Pull close_range() implementation from Christian Brauner:
     "This adds the close_range() syscall. It allows to efficiently close a
      range of file descriptors up to all file descriptors of a calling
      task.
    
      This is coordinated with the FreeBSD folks which have copied our
      version of this syscall and in the meantime have already merged it in
      April 2019:
    
        https://reviews.freebsd.org/D21627
        https://svnweb.freebsd.org/base?view=revision&revision=359836
    
      The syscall originally came up in a discussion around the new mount
      API and making new file descriptor types cloexec by default. During
      this discussion, Al suggested the close_range() syscall.
    
      First, it helps to close all file descriptors of an exec()ing task.
      This can be done safely via (quoting Al's example from [1] verbatim):
    
            /* that exec is sensitive */
            unshare(CLONE_FILES);
            /* we don't want anything past stderr here */
            close_range(3, ~0U);
            execve(....);
    
      The code snippet above is one way of working around the problem that
      file descriptors are not cloexec by default. This is aggravated by the
      fact that we can't just switch them over without massively regressing
      userspace. For a whole class of programs having an in-kernel method of
      closing all file descriptors is very helpful (e.g. demons, service
      managers, programming language standard libraries, container managers
      etc.).
    
      Second, it allows userspace to avoid implementing closing all file
      descriptors by parsing through /proc/<pid>/fd/* and calling close() on
      each file descriptor and other hacks. From looking at various
      large(ish) userspace code bases this or similar patterns are very
      common in service managers, container runtimes, and programming
      language runtimes/standard libraries such as Python or Rust.
    
      In addition, the syscall will also work for tasks that do not have
      procfs mounted and on kernels that do not have procfs support compiled
      in. In such situations the only way to make sure that all file
      descriptors are closed is to call close() on each file descriptor up
      to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.
    
      Based on Linus' suggestion close_range() also comes with a new flag
      CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
      right before exec. This would usually be expressed in the sequence:
    
            unshare(CLONE_FILES);
            close_range(3, ~0U);
    
      as pointed out by Linus it might be desirable to have this be a part
      of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
      gets especially handy when we're closing all file descriptors above a
      certain threshold.
    
      Test-suite as always included"
    
    * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
      tests: add CLOSE_RANGE_UNSHARE tests
      close_range: add CLOSE_RANGE_UNSHARE
      tests: add close_range() tests
      arch: wire-up close_range()
      open: add close_range()
    4f30a60a
fork.c 73.7 KB