An error occurred fetching the project authors.
  1. 01 Aug, 2024 1 commit
    • Al Viro's avatar
      protect the fetch of ->fd[fd] in do_dup2() from mispredictions · 8aa37bde
      Al Viro authored
      both callers have verified that fd is not greater than ->max_fds;
      however, misprediction might end up with
              tofree = fdt->fd[fd];
      being speculatively executed.  That's wrong for the same reasons
      why it's wrong in close_fd()/file_close_fd_locked(); the same
      solution applies - array_index_nospec(fd, fdt->max_fds) could differ
      from fd only in case of speculative execution on mispredicted path.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8aa37bde
  2. 30 May, 2024 1 commit
  3. 15 Apr, 2024 3 commits
  4. 12 Dec, 2023 4 commits
  5. 25 Oct, 2023 1 commit
    • Christian Brauner's avatar
      file, i915: fix file reference for mmap_singleton() · 61d4fb0b
      Christian Brauner authored
      Today we got a report at [1] for rcu stalls on the i915 testsuite in [2]
      due to the conversion of files to SLAB_TYPSSAFE_BY_RCU. Afaict,
      get_file_rcu() goes into an infinite loop trying to carefully verify
      that i915->gem.mmap_singleton hasn't changed - see the splat below.
      
      So I stared at this code to figure out what it actually does. It seems
      that the i915->gem.mmap_singleton pointer itself never had rcu semantics.
      
      The i915->gem.mmap_singleton is replaced in
      file->f_op->release::singleton_release():
      
              static int singleton_release(struct inode *inode, struct file *file)
              {
                      struct drm_i915_private *i915 = file->private_data;
      
                      cmpxchg(&i915->gem.mmap_singleton, file, NULL);
                      drm_dev_put(&i915->drm);
      
                      return 0;
              }
      
      The cmpxchg() is ordered against a concurrent update of
      i915->gem.mmap_singleton from mmap_singleton(). IOW, when
      mmap_singleton() fails to get a reference on i915->gem.mmap_singleton:
      
      While mmap_singleton() does
      
              rcu_read_lock();
              file = get_file_rcu(&i915->gem.mmap_singleton);
              rcu_read_unlock();
      
      it allocates a new file via anon_inode_getfile() and does
      
              smp_store_mb(i915->gem.mmap_singleton, file);
      
      So, then what happens in the case of this bug is that at some point
      fput() is called and drops the file->f_count to zero leaving the pointer
      in i915->gem.mmap_singleton in tact.
      
      Now, there might be delays until
      file->f_op->release::singleton_release() is called and
      i915->gem.mmap_singleton is set to NULL.
      
      Say concurrently another task hits mmap_singleton() and does:
      
              rcu_read_lock();
              file = get_file_rcu(&i915->gem.mmap_singleton);
              rcu_read_unlock();
      
      When get_file_rcu() fails to get a reference via atomic_inc_not_zero()
      it will try the reload from i915->gem.mmap_singleton expecting it to be
      NULL, assuming it has comparable semantics as we expect in
      __fget_files_rcu().
      
      But it hasn't so it reloads the same pointer again, trying the same
      atomic_inc_not_zero() again and doing so until
      file->f_op->release::singleton_release() of the old file has been
      called.
      
      So, in contrast to __fget_files_rcu() here we want to not retry when
      atomic_inc_not_zero() has failed. We only want to retry in case we
      managed to get a reference but the pointer did change on reload.
      
      <3> [511.395679] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
      <3> [511.395716] rcu:   Tasks blocked on level-1 rcu_node (CPUs 0-9): P6238
      <3> [511.395934] rcu:   (detected by 16, t=65002 jiffies, g=123977, q=439 ncpus=20)
      <6> [511.395944] task:i915_selftest   state:R  running task     stack:10568 pid:6238  tgid:6238  ppid:1001   flags:0x00004002
      <6> [511.395962] Call Trace:
      <6> [511.395966]  <TASK>
      <6> [511.395974]  ? __schedule+0x3a8/0xd70
      <6> [511.395995]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
      <6> [511.396003]  ? lockdep_hardirqs_on+0xc3/0x140
      <6> [511.396013]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
      <6> [511.396029]  ? get_file_rcu+0x10/0x30
      <6> [511.396039]  ? get_file_rcu+0x10/0x30
      <6> [511.396046]  ? i915_gem_object_mmap+0xbc/0x450 [i915]
      <6> [511.396509]  ? i915_gem_mmap+0x272/0x480 [i915]
      <6> [511.396903]  ? mmap_region+0x253/0xb60
      <6> [511.396925]  ? do_mmap+0x334/0x5c0
      <6> [511.396939]  ? vm_mmap_pgoff+0x9f/0x1c0
      <6> [511.396949]  ? rcu_is_watching+0x11/0x50
      <6> [511.396962]  ? igt_mmap_offset+0xfc/0x110 [i915]
      <6> [511.397376]  ? __igt_mmap+0xb3/0x570 [i915]
      <6> [511.397762]  ? igt_mmap+0x11e/0x150 [i915]
      <6> [511.398139]  ? __trace_bprintk+0x76/0x90
      <6> [511.398156]  ? __i915_subtests+0xbf/0x240 [i915]
      <6> [511.398586]  ? __pfx___i915_live_setup+0x10/0x10 [i915]
      <6> [511.399001]  ? __pfx___i915_live_teardown+0x10/0x10 [i915]
      <6> [511.399433]  ? __run_selftests+0xbc/0x1a0 [i915]
      <6> [511.399875]  ? i915_live_selftests+0x4b/0x90 [i915]
      <6> [511.400308]  ? i915_pci_probe+0x106/0x200 [i915]
      <6> [511.400692]  ? pci_device_probe+0x95/0x120
      <6> [511.400704]  ? really_probe+0x164/0x3c0
      <6> [511.400715]  ? __pfx___driver_attach+0x10/0x10
      <6> [511.400722]  ? __driver_probe_device+0x73/0x160
      <6> [511.400731]  ? driver_probe_device+0x19/0xa0
      <6> [511.400741]  ? __driver_attach+0xb6/0x180
      <6> [511.400749]  ? __pfx___driver_attach+0x10/0x10
      <6> [511.400756]  ? bus_for_each_dev+0x77/0xd0
      <6> [511.400770]  ? bus_add_driver+0x114/0x210
      <6> [511.400781]  ? driver_register+0x5b/0x110
      <6> [511.400791]  ? i915_init+0x23/0xc0 [i915]
      <6> [511.401153]  ? __pfx_i915_init+0x10/0x10 [i915]
      <6> [511.401503]  ? do_one_initcall+0x57/0x270
      <6> [511.401515]  ? rcu_is_watching+0x11/0x50
      <6> [511.401521]  ? kmalloc_trace+0xa3/0xb0
      <6> [511.401532]  ? do_init_module+0x5f/0x210
      <6> [511.401544]  ? load_module+0x1d00/0x1f60
      <6> [511.401581]  ? init_module_from_file+0x86/0xd0
      <6> [511.401590]  ? init_module_from_file+0x86/0xd0
      <6> [511.401613]  ? idempotent_init_module+0x17c/0x230
      <6> [511.401639]  ? __x64_sys_finit_module+0x56/0xb0
      <6> [511.401650]  ? do_syscall_64+0x3c/0x90
      <6> [511.401659]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      <6> [511.401684]  </TASK>
      
      Link: [1]: https://lore.kernel.org/intel-gfx/SJ1PR11MB6129CB39EED831784C331BAFB9DEA@SJ1PR11MB6129.namprd11.prod.outlook.com
      Link: [2]: https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@live@mman.html#dmesg-warnings10963
      Cc: Jann Horn <jannh@google.com>,
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lore.kernel.org/r/20231025-formfrage-watscheln-84526cd3bd7d@braunerSigned-off-by: default avatarChristian Brauner <brauner@kernel.org>
      61d4fb0b
  6. 19 Oct, 2023 2 commits
    • Christian Brauner's avatar
      backing file: free directly · 6cf41fcf
      Christian Brauner authored
      Backing files as used by overlayfs are never installed into file
      descriptor tables and are explicitly documented as such. They aren't
      subject to rcu access conditions like regular files are.
      
      Their lifetime is bound to the lifetime of the overlayfs file, i.e.,
      they're stashed in ovl_file->private_data and go away otherwise. If
      they're set as vma->vm_file - which is their main purpose - then they're
      subject to regular refcount rules and vma->vm_file can't be installed
      into an fdtable after having been set. All in all I don't see any need
      for rcu delay here. So free it directly.
      
      This all hinges on such hybrid beasts to never actually be installed
      into fdtables which - as mentioned before - is not allowed. So add an
      explicit WARN_ON_ONCE() so we catch any case where someone is suddenly
      trying to install one of those things into a file descriptor table so we
      can have a nice long chat with them.
      
      Link: https://lore.kernel.org/r/20231005-sakralbau-wappnen-f5c31755ed70@braunerAcked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      6cf41fcf
    • Christian Brauner's avatar
      file: convert to SLAB_TYPESAFE_BY_RCU · 0ede61d8
      Christian Brauner authored
      In recent discussions around some performance improvements in the file
      handling area we discussed switching the file cache to rely on
      SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
      freeing for files completely. This is a pretty sensitive change overall
      but it might actually be worth doing.
      
      The main downside is the subtlety. The other one is that we should
      really wait for Jann's patch to land that enables KASAN to handle
      SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
      exists.
      
      With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
      which requires a few changes. So it isn't sufficient anymore to just
      acquire a reference to the file in question under rcu using
      atomic_long_inc_not_zero() since the file might have already been
      recycled and someone else might have bumped the reference.
      
      In other words, callers might see reference count bumps from newer
      users. For this reason it is necessary to verify that the pointer is the
      same before and after the reference count increment. This pattern can be
      seen in get_file_rcu() and __files_get_rcu().
      
      In addition, it isn't possible to access or check fields in struct file
      without first aqcuiring a reference on it. Not doing that was always
      very dodgy and it was only usable for non-pointer data in struct file.
      With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
      reference under rcu or they must hold the files_lock of the fdtable.
      Failing to do either one of this is a bug.
      
      Thanks to Jann for pointing out that we need to ensure memory ordering
      between reallocations and pointer check by ensuring that all subsequent
      loads have a dependency on the second load in get_file_rcu() and
      providing a fixup that was folded into this patch.
      
      Cc: Jann Horn <jannh@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      0ede61d8
  7. 19 Aug, 2023 1 commit
  8. 06 Aug, 2023 1 commit
  9. 04 Aug, 2023 2 commits
    • Linus Torvalds's avatar
      file: reinstate f_pos locking optimization for regular files · 79796425
      Linus Torvalds authored
      In commit 20ea1e7d ("file: always lock position for
      FMODE_ATOMIC_POS") we ended up always taking the file pos lock, because
      pidfd_getfd() could get a reference to the file even when it didn't have
      an elevated file count due to threading of other sharing cases.
      
      But Mateusz Guzik reports that the extra locking is actually measurable,
      so let's re-introduce the optimization, and only force the locking for
      directory traversal.
      
      Directories need the lock for correctness reasons, while regular files
      only need it for "POSIX semantics".  Since pidfd_getfd() is about
      debuggers etc special things that are _way_ outside of POSIX, we can
      relax the rules for that case.
      Reported-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Link: https://lore.kernel.org/linux-fsdevel/20230803095311.ijpvhx3fyrbkasul@f/Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79796425
    • Mateusz Guzik's avatar
      file: mostly eliminate spurious relocking in __range_close · ed192c59
      Mateusz Guzik authored
      Stock code takes a lock trip for every fd in range, but this can be
      trivially avoided and real-world consumers do have plenty of already
      closed cases.
      
      Just booting Debian 12 with a debug printk shows:
      (sh) min 3 max 17 closed 15 empty 0
      (sh) min 19 max 63 closed 31 empty 14
      (sh) min 4 max 63 closed 0 empty 60
      (spawn) min 3 max 63 closed 13 empty 48
      (spawn) min 3 max 63 closed 13 empty 48
      (mount) min 3 max 17 closed 15 empty 0
      (mount) min 19 max 63 closed 32 empty 13
      
      and so on.
      
      While here use more idiomatic naming.
      
      An avoidable relock is left in place to avoid uglifying the code.
      The code was not switched to bitmap traversal for the same reason.
      
      Tested with ltp kernel/syscalls/close_range
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Message-Id: <20230727113809.800067-1-mjguzik@gmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      ed192c59
  10. 24 Jul, 2023 1 commit
    • Christian Brauner's avatar
      file: always lock position for FMODE_ATOMIC_POS · 20ea1e7d
      Christian Brauner authored
      The pidfd_getfd() system call allows a caller with ptrace_may_access()
      abilities on another process to steal a file descriptor from this
      process. This system call is used by debuggers, container runtimes,
      system call supervisors, networking proxies etc. So while it is a
      special interest system call it is used in common tools.
      
      That ability ends up breaking our long-time optimization in fdget_pos(),
      which "knew" that if we had exclusive access to the file descriptor
      nobody else could access it, and we didn't need the lock for the file
      position.
      
      That check for file_count(file) was always fairly subtle - it depended
      on __fdget() not incrementing the file count for single-threaded
      processes and thus included that as part of the rule - but it did mean
      that we didn't need to take the lock in all those traditional unix
      process contexts.
      
      So it's sad to see this go, and I'd love to have some way to re-instate
      the optimization. At the same time, the lock obviously isn't ever
      contended in the case we optimized, so all we were optimizing away is
      the atomics and the cacheline dirtying. Let's see if anybody even
      notices that the optimization is gone.
      
      Link: https://lore.kernel.org/linux-fsdevel/20230724-vfs-fdget_pos-v1-1-a4abfd7103f3@kernel.org/
      Fixes: 8649c322 ("pid: Implement pidfd_getfd syscall")
      Cc: stable@kernel.org
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20ea1e7d
  11. 10 Mar, 2023 1 commit
  12. 31 Oct, 2022 1 commit
    • Jann Horn's avatar
      fs: use acquire ordering in __fget_light() · 7ee47dcf
      Jann Horn authored
      We must prevent the CPU from reordering the files->count read with the
      FD table access like this, on architectures where read-read reordering is
      possible:
      
          files_lookup_fd_raw()
                                        close_fd()
                                        put_files_struct()
          atomic_read(&files->count)
      
      I would like to mark this for stable, but the stable rules explicitly say
      "no theoretical races", and given that the FD table pointer and
      files->count are explicitly stored in the same cacheline, this sort of
      reordering seems quite unlikely in practice...
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7ee47dcf
  13. 29 Jun, 2022 1 commit
    • Andreas Gruenbacher's avatar
      gfs2: Add glockfd debugfs file · 4480c27c
      Andreas Gruenbacher authored
      When a process has a gfs2 file open, the file is keeping a reference on the
      underlying gfs2 inode, and the inode is keeping the inode's iopen glock held in
      shared mode.  In other words, the process depends on the iopen glock of each
      open gfs2 file.  Expose those dependencies in a new "glockfd" debugfs file.
      
      The new debugfs file contains one line for each gfs2 file descriptor,
      specifying the tgid, file descriptor number, and glock name, e.g.,
      
        1601 6 5/816d
      
      This list is compiled by iterating all tasks on the system using find_ge_pid(),
      and all file descriptors of each task using task_lookup_next_fd_rcu().  To make
      that work from gfs2, export those two functions.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      4480c27c
  14. 05 Jun, 2022 1 commit
  15. 14 May, 2022 2 commits
    • Al Viro's avatar
      Unify the primitives for file descriptor closing · 6319194e
      Al Viro authored
      Currently we have 3 primitives for removing an opened file from descriptor
      table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
      calling conventions are rather odd and there's a code duplication for no
      good reason.  They can be unified -
      
      1) have __range_close() cap max_fd in the very beginning; that way
      we don't need separate way for pick_file() to report being past the end
      of descriptor table.
      
      2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
      than returning it via struct file ** argument.  Don't bother with
      (bogus) return value - nobody wants that -ENOENT.
      
      3) make pick_file() return NULL on unopened descriptor - the only caller
      that used to care about the distinction between descriptor past the end
      of descriptor table and finding NULL in descriptor table doesn't give
      a damn after (1).
      
      4) lift ->files_lock out of pick_file()
      
      That actually simplifies the callers, as well as the primitives themselves.
      Code duplication is also gone...
      Reviewed-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6319194e
    • Gou Hao's avatar
      fs: remove fget_many and fput_many interface · 81132a39
      Gou Hao authored
      These two interface were added in 091141a4 commit,
      but now there is no place to call them.
      
      The only user of fput/fget_many() was removed in commit
      62906e89 ("io_uring: remove file batch-get optimisation").
      
      A user of get_file_rcu_many() were removed in commit
      f0735310 ("init: add an init_dup helper").
      
      And replace atomic_long_sub/add to atomic_long_dec/inc
      can improve performance.
      
      Here are the test results of unixbench:
      
      Cmd: ./Run -c 64 context1
      
      Without patch:
      System Benchmarks Partial Index              BASELINE       RESULT    INDEX
      Pipe-based Context Switching                   4000.0    2798407.0   6996.0
                                                                         ========
      System Benchmarks Index Score (Partial Only)                         6996.0
      
      With patch:
      System Benchmarks Partial Index              BASELINE       RESULT    INDEX
      Pipe-based Context Switching                   4000.0    3486268.8   8715.7
                                                                         ========
      System Benchmarks Index Score (Partial Only)                         8715.7
      Signed-off-by: default avatarGou Hao <gouhao@uniontech.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      81132a39
  16. 30 Mar, 2022 1 commit
    • Linus Torvalds's avatar
      fs: fix fd table size alignment properly · d888c83f
      Linus Torvalds authored
      Jason Donenfeld reports that my commit 1c24a186 ("fs: fd tables have
      to be multiples of BITS_PER_LONG") doesn't work, and the reason is an
      embarrassing brown-paper-bag bug.
      
      Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the
      reason they might not be aligned is because the incoming 'max_fd'
      argument might not be aligned.
      
      But aligining the argument - while simple - will cause a "infinitely
      big" maxfd (eg NR_OPEN_MAX) to just overflow to zero.  Which most
      definitely isn't what we want either.
      
      The obvious fix was always just to do the alignment last, but I had
      moved it earlier just to make the patch smaller and the code look
      simpler.  Duh.  It certainly made _me_ look simple.
      
      Fixes: 1c24a186 ("fs: fd tables have to be multiples of BITS_PER_LONG")
      Reported-and-tested-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Fedor Pchelkin <aissur0002@gmail.com>
      Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d888c83f
  17. 29 Mar, 2022 1 commit
    • Linus Torvalds's avatar
      fs: fd tables have to be multiples of BITS_PER_LONG · 1c24a186
      Linus Torvalds authored
      This has always been the rule: fdtables have several bitmaps in them,
      and as a result they have to be sized properly for bitmaps.  We walk
      those bitmaps in chunks of 'unsigned long' in serveral cases, but even
      when we don't, we use the regular kernel bitops that are defined to work
      on arrays of 'unsigned long', not on some byte array.
      
      Now, the distinction between arrays of bytes and 'unsigned long'
      normally only really ends up being noticeable on big-endian systems, but
      Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps()
      could be called with an argument that wasn't even a multiple of
      BITS_PER_BYTE.  And then it fails to do the proper copy even on
      little-endian machines.
      
      The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which
      didn't actually sanitize the fdtable size sufficiently, and never made
      sure it had the proper BITS_PER_LONG alignment.
      
      That's partly because the alignment historically came not from having to
      explicitly align things, but simply from previous fdtable sizes, and
      from count_open_files(), which counts the file descriptors by walking
      them one 'unsigned long' word at a time and thus naturally ends up doing
      sizing in the proper 'chunks of unsigned long'.
      
      But with the introduction of close_range(), we now have an external
      source of "this is how many files we want to have", and so
      sane_fdtable_size() needs to do a better job.
      
      This also adds that explicit alignment to alloc_fdtable(), although
      there it is mainly just for documentation at a source code level.  The
      arithmetic we do there to pick a reasonable fdtable size already aligns
      the result sufficiently.
      
      In fact,clang notices that the added ALIGN() in that function doesn't
      actually do anything, and does not generate any extra code for it.
      
      It turns out that gcc ends up confusing itself by combining a previous
      constant-sized shift operation with the variable-sized shift operations
      in roundup_pow_of_two().  And probably due to that doesn't notice that
      the ALIGN() is a no-op.  But that's a (tiny) gcc misfeature that doesn't
      matter.  Having the explicit alignment makes sense, and would actually
      matter on a 128-bit architecture if we ever go there.
      
      This also adds big comments above both functions about how fdtable sizes
      have to have that BITS_PER_LONG alignment.
      
      Fixes: 60997c3d ("close_range: add CLOSE_RANGE_UNSHARE")
      Reported-by: default avatarFedor Pchelkin <aissur0002@gmail.com>
      Reported-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/Tested-and-acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c24a186
  18. 13 Dec, 2021 1 commit
    • Linus Torvalds's avatar
      fget: clarify and improve __fget_files() implementation · e386dfc5
      Linus Torvalds authored
      Commit 054aa8d4 ("fget: check that the fd still exists after getting
      a ref to it") fixed a race with getting a reference to a file just as it
      was being closed.  It was a fairly minimal patch, and I didn't think
      re-checking the file pointer lookup would be a measurable overhead,
      since it was all right there and cached.
      
      But I was wrong, as pointed out by the kernel test robot.
      
      The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
      quite noticeably.  Admittedly it seems to be a very artificial test:
      doing "poll()" system calls on regular files in a very tight loop in
      multiple threads.
      
      That means that basically all the time is spent just looking up file
      descriptors without ever doing anything useful with them (not that doing
      'poll()' on a regular file is useful to begin with).  And as a result it
      shows the extra "re-check fd" cost as a sore thumb.
      
      Happily, the regression is fixable by just writing the code to loook up
      the fd to be better and clearer.  There's still a cost to verify the
      file pointer, but now it's basically in the noise even for that
      benchmark that does nothing else - and the code is more understandable
      and has better comments too.
      
      [ Side note: this patch is also a classic case of one that looks very
        messy with the default greedy Myers diff - it's much more legible with
        either the patience of histogram diff algorithm ]
      
      Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
      Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarCarel Si <beibei.si@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e386dfc5
  19. 03 Dec, 2021 1 commit
    • Linus Torvalds's avatar
      fget: check that the fd still exists after getting a ref to it · 054aa8d4
      Linus Torvalds authored
      Jann Horn points out that there is another possible race wrt Unix domain
      socket garbage collection, somewhat reminiscent of the one fixed in
      commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK").
      
      See the extended comment about the garbage collection requirements added
      to unix_peek_fds() by that commit for details.
      
      The race comes from how we can locklessly look up a file descriptor just
      as it is in the process of being closed, and with the right artificial
      timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
      Unix domain socket garbage collector could see the reference count
      decrement of the close() happen before fget() took its reference to the
      file and the file was attached onto a new file descriptor.
      
      This is all (intentionally) correct on the 'struct file *' side, with
      RCU lookups and lockless reference counting very much part of the
      design.  Getting that reference count out of order isn't a problem per
      se.
      
      But the garbage collector can get confused by seeing this situation of
      having seen a file not having any remaining external references and then
      seeing it being attached to an fd.
      
      In commit cbcf0112 ("af_unix: fix garbage collect vs MSG_PEEK") the
      fix was to serialize the file descriptor install with the garbage
      collector by taking and releasing the unix_gc_lock.
      
      That's not really an option here, but since this all happens when we are
      in the process of looking up a file descriptor, we can instead simply
      just re-check that the file hasn't been closed in the meantime, and just
      re-do the lookup if we raced with a concurrent close() of the same file
      descriptor.
      Reported-and-tested-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      054aa8d4
  20. 06 Sep, 2021 1 commit
  21. 16 Apr, 2021 1 commit
  22. 02 Apr, 2021 3 commits
    • Christian Brauner's avatar
      file: simplify logic in __close_range() · 03ba0fe4
      Christian Brauner authored
      It never looked too pleasant and it doesn't really buy us anything
      anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the
      current maximum under the lock for it anyway. This also makes the logic
      easier to follow.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      03ba0fe4
    • Christian Brauner's avatar
      file: fix close_range() for unshare+cloexec · 9b5b8722
      Christian Brauner authored
      syzbot reported a bug when putting the last reference to a tasks file
      descriptor table. Debugging this showed we didn't recalculate the
      current maximum fd number for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
      after we unshared the file descriptors table. So max_fd could exceed the
      current fdtable maximum causing us to set excessive bits. As a concrete
      example, let's say the user requested everything from fd 4 to ~0UL to be
      closed and their current fdtable size is 256 with their highest open fd
      being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new
      fdtable which has room for 64 file descriptors since that is the lowest
      fdtable size we accept. But now max_fd will still point to 255 and needs
      to be adjusted. Fix this by retrieving the correct maximum fd value in
      __range_cloexec().
      
      Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Fixes: fec8a6a6 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      9b5b8722
    • Christian Brauner's avatar
      file: let pick_file() tell caller it's done · f49fd6d3
      Christian Brauner authored
      Let pick_file() report back that the fd it was passed exceeded the
      maximum fd in that fdtable. This allows us to simplify the caller of
      this helper because it doesn't need to care anymore whether the passed
      in max_fd is excessive. It can rely on pick_file() telling it that it's
      past the last valid fd.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      f49fd6d3
  23. 01 Feb, 2021 1 commit
  24. 31 Dec, 2020 1 commit
    • Pavel Begunkov's avatar
      kernel/io_uring: cancel io_uring before task works · b1b6b5a3
      Pavel Begunkov authored
      For cancelling io_uring requests it needs either to be able to run
      currently enqueued task_works or having it shut down by that moment.
      Otherwise io_uring_cancel_files() may be waiting for requests that won't
      ever complete.
      
      Go with the first way and do cancellations before setting PF_EXITING and
      so before putting the task_work infrastructure into a transition state
      where task_work_run() would better not be called.
      
      Cc: stable@vger.kernel.org # 5.5+
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1b6b5a3
  25. 19 Dec, 2020 1 commit
    • Christian Brauner's avatar
      close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC · fec8a6a6
      Christian Brauner authored
      After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
      CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
      When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
      file descriptor table in case their file descriptor table is currently
      shared.
      
      For the case where the caller has requested all file descriptors to be
      actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
      the caller does not need any of the file descriptors anymore and will
      optimize the close operation by only copying all files in the range from
      0 to 3 and no others.
      
      However, if the caller requested CLOSE_RANGE_CLOEXEC together with
      CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
      descriptors so the kernel needs to copy all of them and can't optimize.
      
      The original patch didn't account for this and thus could cause oopses
      as evidenced by the syzbot report because it assumed that all fds had
      been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.
      
      syzbot reported
      ==================================================================
      BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
      BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
      BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
      BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
      Read of size 8 at addr 0000000000000077 by task syz-executor511/8522
      
      CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:120
       __kasan_report mm/kasan/report.c:549 [inline]
       kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
       check_memory_region_inline mm/kasan/generic.c:186 [inline]
       check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
       atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
       filp_close+0x22/0x170 fs/open.c:1274
       close_files fs/file.c:402 [inline]
       put_files_struct fs/file.c:417 [inline]
       put_files_struct+0x1cc/0x350 fs/file.c:414
       exit_files+0x12a/0x170 fs/file.c:435
       do_exit+0xb4f/0x2a00 kernel/exit.c:818
       do_group_exit+0x125/0x310 kernel/exit.c:920
       get_signal+0x428/0x2100 kernel/signal.c:2792
       arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
       handle_signal_work kernel/entry/common.c:147 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
       __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
       syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x447039
      Code: Unable to access opcode bytes at RIP 0x44700f.
      RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
      RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
      RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
      RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
      R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
      ==================================================================
      
      syzbot has tested the proposed patch and the reproducer did not trigger any issue:
      
      Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      
      Tested on:
      
      commit:         10f7cddd selftests/core: add regression test for CLOSE_RAN..
      git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
      kernel config:  https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
      dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
      compiler:       gcc (GCC) 10.1.0-syz 20200507
      
      Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
      Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
      Cc: Giuseppe Scrivano <gscrivan@redhat.com>
      Cc: linux-fsdevel@vger.kernel.org
      Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.comSigned-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      fec8a6a6
  26. 10 Dec, 2020 5 commits