1. 06 Nov, 2021 1 commit
  2. 09 Aug, 2021 1 commit
  3. 01 Jun, 2021 1 commit
  4. 22 Apr, 2021 1 commit
  5. 25 Jan, 2021 1 commit
  6. 21 Jan, 2021 1 commit
  7. 01 Dec, 2020 2 commits
  8. 11 Nov, 2020 3 commits
    • Darrick J. Wong's avatar
      vfs: move __sb_{start,end}_write* to fs.h · 9b852342
      Darrick J. Wong authored
      
      Now that we've straightened out the callers, move these three functions
      to fs.h since they're fairly trivial.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      9b852342
    • Darrick J. Wong's avatar
      vfs: separate __sb_start_write into blocking and non-blocking helpers · 8a3c84b6
      Darrick J. Wong authored
      
      Break this function into two helpers so that it's obvious that the
      trylock versions return a value that must be checked, and the blocking
      versions don't require that.  While we're at it, clean up the return
      type mismatch.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8a3c84b6
    • Darrick J. Wong's avatar
      vfs: remove lockdep bogosity in __sb_start_write · 22843291
      Darrick J. Wong authored
      __sb_start_write has some weird looking lockdep code that claims to
      exist to handle nested freeze locking requests from xfs.  The code as
      written seems broken -- if we think we hold a read lock on any of the
      higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to
      lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a
      trylock.
      
      However, it's not correct to downgrade a blocking lock attempt to a
      trylock unless the downgrading code or the callers are prepared to deal
      with that situation.  Neither __sb_start_write nor its callers handle
      this at all.  For example:
      
      sb_start_pagefault ignores the return value completely, with the result
      that if xfs_filemap_fault loses a race with a different thread trying to
      fsfreeze, it will proceed without pagefault freeze protection (thereby
      breaking locking rules) and then unlocks the pagefault freeze lock that
      it doesn't own on its way out (thereby corrupting the lock state), which
      leads to a system hang shortly afterwards.
      
      Normally, this won't happen because our ownership of a read lock on a
      higher freeze protection level blocks fsfreeze from grabbing a write
      lock on that higher level.  *However*, if lockdep is offline,
      lock_is_held_type unconditionally returns 1, which means that
      percpu_rwsem_is_held returns 1, which means that __sb_start_write
      unconditionally converts blocking freeze lock attempts into trylocks,
      even when we *don't* hold anything that would block a fsfreeze.
      
      Apparently this all held together until 5.10-rc1, when bugs in lockdep
      caused lockdep to shut itself off early in an fstests run, and once
      fstests gets to the "race writes with freezer" tests, kaboom.  This
      might explain the long trail of vanishingly infrequent livelocks in
      fstests after lockdep goes offline that I've never been able to
      diagnose.
      
      We could fix it by spinning on the trylock if wait==true, but AFAICT the
      locking works fine if lockdep is not built at all (and I didn't see any
      complaints running fstests overnight), so remove this snippet entirely.
      
      NOTE: Commit f4b554af in 2015 created the current weird logic (which
      used to exist in a different form in commit 5accdf82
      
       from 2012) in
      __sb_start_write.  XFS solved this whole problem in the late 2.6 era by
      creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't
      grab intwrite freeze protection, thus making lockdep's solution
      unnecessary.  The commit claims that Dave Chinner explained that the
      trylock hack + comment could be removed, but nobody ever did.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      22843291
  9. 24 Sep, 2020 1 commit
  10. 29 May, 2020 1 commit
  11. 09 May, 2020 2 commits
  12. 28 Apr, 2020 1 commit
    • David Howells's avatar
      Fix use after free in get_tree_bdev() · dd7bc815
      David Howells authored
      Commit 6fcf0c72, a fix to get_tree_bdev() put a missing blkdev_put() in
      the wrong place, before a warnf() that displays the bdev under
      consideration rather after it.
      
      This results in a silent lockup in printk("%pg") called via warnf() from
      get_tree_bdev() under some circumstances when there's a race with the
      blockdev being frozen.  This can be caused by xfstests/tests/generic/085 in
      combination with Lukas Czerner's ext4 mount API conversion patchset.  It
      looks like it ought to occur with other users of get_tree_bdev() such as
      XFS, but apparently doesn't.
      
      Fix this by switching the order of the lines.
      
      Fixes: 6fcf0c72
      
       ("vfs: add missing blkdev_put() in get_tree_bdev()")
      Reported-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Ian Kent <raven@themaw.net>
      cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd7bc815
  13. 18 Dec, 2019 1 commit
    • Eric Sandeen's avatar
      fs: call fsnotify_sb_delete after evict_inodes · 1edc8eb2
      Eric Sandeen authored
      
      When a filesystem is unmounted, we currently call fsnotify_sb_delete()
      before evict_inodes(), which means that fsnotify_unmount_inodes()
      must iterate over all inodes on the superblock looking for any inodes
      with watches.  This is inefficient and can lead to livelocks as it
      iterates over many unwatched inodes.
      
      At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
      the inode out out immediately, so anything processed by
      fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.
      
      After that, the call to evict_inodes will evict everything else with a
      zero refcount.
      
      This should speed things up overall, and avoid livelocks in
      fsnotify_unmount_inodes().
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1edc8eb2
  14. 10 Oct, 2019 1 commit
  15. 06 Sep, 2019 1 commit
  16. 05 Sep, 2019 3 commits
  17. 30 Aug, 2019 1 commit
  18. 13 Aug, 2019 1 commit
    • Eric Biggers's avatar
      fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl · 22d94f49
      Eric Biggers authored
      Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY.  This ioctl adds an
      encryption key to the filesystem's fscrypt keyring ->s_master_keys,
      making any files encrypted with that key appear "unlocked".
      
      Why we need this
      ~~~~~~~~~~~~~~~~
      
      The main problem is that the "locked/unlocked" (ciphertext/plaintext)
      status of encrypted files is global, but the fscrypt keys are not.
      fscrypt only looks for keys in the keyring(s) the process accessing the
      filesystem is subscribed to: the thread keyring, process keyring, and
      session keyring, where the session keyring may contain the user keyring.
      
      Therefore, userspace has to put fscrypt keys in the keyrings for
      individual users or sessions.  But this means that when a process with a
      different keyring tries to access encrypted files, whether they appear
      "unlocked" or not is nondeterministic.  This is because it depends on
      whether the files are currently present in the inode cache.
      
      Fixing this by consistently providing each process its own view of the
      filesystem depending on whether it has the key or not isn't feasible due
      to how the VFS caches work.  Furthermore, while sometimes users expect
      this behavior, it is misguided for two reasons.  First, it would be an
      OS-level access control mechanism largely redundant with existing access
      control mechanisms such as UNIX file permissions, ACLs, LSMs, etc.
      Encryption is actually for protecting the data at rest.
      
      Second, almost all users of fscrypt actually do need the keys to be
      global.  The largest users of fscrypt, Android and Chromium OS, achieve
      this by having PID 1 create a "session keyring" that is inherited by
      every process.  This works, but it isn't scalable because it prevents
      session keyrings from being used for any other purpose.
      
      On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't
      similarly abuse the session keyring, so to make 'sudo' work on all
      systems it has to link all the user keyrings into root's user keyring
      [2].  This is ugly and raises security concerns.  Moreover it can't make
      the keys available to system services, such as sshd trying to access the
      user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to
      read certificates from the user's home directory (see [5]); or to Docker
      containers (see [6], [7]).
      
      By having an API to add a key to the *filesystem* we'll be able to fix
      the above bugs, remove userspace workarounds, and clearly express the
      intended semantics: the locked/unlocked status of an encrypted directory
      is global, and encryption is orthogonal to OS-level access control.
      
      Why not use the add_key() syscall
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      We use an ioctl for this API rather than the existing add_key() system
      call because the ioctl gives us the flexibility needed to implement
      fscrypt-specific semantics that will be introduced in later patches:
      
      - Supporting key removal with the semantics such that the secret is
        removed immediately and any unused inodes using the key are evicted;
        also, the eviction of any in-use inodes can be retried.
      
      - Calculating a key-dependent cryptographic identifier and returning it
        to userspace.
      
      - Allowing keys to be added and removed by non-root users, but only keys
        for v2 encryption policies; and to prevent denial-of-service attacks,
        users can only remove keys they themselves have added, and a key is
        only really removed after all users who added it have removed it.
      
      Trying to shoehorn these semantics into the keyrings syscalls would be
      very difficult, whereas the ioctls make things much easier.
      
      However, to reuse code the implementation still uses the keyrings
      service internally.  Thus we get lockless RCU-mode key lookups without
      having to re-implement it, and the keys automatically show up in
      /proc/keys for debugging purposes.
      
      References:
      
          [1] https://github.com/google/fscrypt
          [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb
          [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939
          [4] https://github.com/google/fscrypt/issues/116
          [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715
          [6] https://github.com/google/fscrypt/issues/128
          [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem
      
      Reviewed-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      22d94f49
  19. 31 Jul, 2019 1 commit
  20. 05 Jul, 2019 2 commits
  21. 25 May, 2019 9 commits
    • David Howells's avatar
      vfs: Kill sget_userns() · 023d066a
      David Howells authored
      
      Kill sget_userns(), folding it into sget() as that's the only remaining
      user.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-fsdevel@vger.kernel.org
      023d066a
    • David Howells's avatar
      vfs: Provide sb->s_iflags settings in fs_context struct · c80fa7c8
      David Howells authored
      
      Provide a field in the fs_context struct through which bits in the
      sb->s_iflags superblock field can be set.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-fsdevel@vger.kernel.org
      c80fa7c8
    • Al Viro's avatar
      move mount_capable() further out · c3aabf07
      Al Viro authored
      
      Call graph of vfs_get_tree():
      	vfs_fsconfig_locked()	# neither kernmount, nor submount
      	do_new_mount()		# neither kernmount, nor submount
      	fc_mount()
      		afs_mntpt_do_automount()	# submount
      		mount_one_hugetlbfs()		# kernmount
      		pid_ns_prepare_proc()		# kernmount
      		mq_create_mount()		# kernmount
      		vfs_kern_mount()
      			simple_pin_fs()		# kernmount
      			vfs_submount()		# submount
      			kern_mount()		# kernmount
      			init_mount_tree()
      			btrfs_mount()
      			nfs_do_root_mount()
      
      	The first two need the check (unconditionally).
      init_mount_tree() is setting rootfs up; any capability
      checks make zero sense for that one.  And btrfs_mount()/
      nfs_do_root_mount() have the checks already done in their
      callers.
      
      	IOW, we can shift mount_capable() handling into
      the two callers - one in the normal case of mount(2),
      another - in fsconfig(2) handling of FSCONFIG_CMD_CREATE.
      I.e. the syscalls that set a new filesystem up.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c3aabf07
    • Al Viro's avatar
      move mount_capable() calls to vfs_get_tree() · 059338aa
      Al Viro authored
      
      sget_fc() is called only from ->get_tree() instances and
      the only instance not calling it is legacy_get_tree(),
      which calls mount_capable() directly.
      
      In all sget_fc() callers the checks could be moved to the
      very beginning of ->get_tree() - ->user_ns is not changed
      in between.  So lifting the checks to the only caller of
      ->get_tree() is OK.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      059338aa
    • Al Viro's avatar
      switch mount_capable() to fs_context · 20284ab7
      Al Viro authored
      
      	now both callers of mount_capable() have access to fs_context;
      the only difference is that for sget_fc() we have the possibility
      of fc->global being true, while for legacy_get_tree() it's guaranteed
      to be impossible.  Unify to more generic variant...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      20284ab7
    • Al Viro's avatar
      move the capability checks from sget_userns() to legacy_get_tree() · 2527b284
      Al Viro authored
      
      1) all call chains leading to sget_userns() pass through ->mount()
      instances.
      2) none of ->mount() instances is ever called directly - the only
      call site is legacy_get_tree()
      3) all remaining ->mount() instances end up calling sget_userns()
      
      IOW, we might as well do the capability checks just before calling
      ->mount().  As for the arguments passed to mount_capable(),
      in case of call chains to sget_userns() going through sget(),
      we either don't call mount_capable() at all, or pass current_user_ns()
      to it.  The call chains going through mount_pseudo_xattr() don't
      call mount_capable() at all (SB_KERNMOUNT in flags on those).
      
      That could've been split into smaller steps (lifting the checks
      into sget(), then callers of sget(), then all the way to the
      entries of every ->mount() out there, then to the sole caller),
      but that would be too much churn for little benefit...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2527b284
    • David Howells's avatar
      vfs: Kill mount_ns() · bb7b6b2b
      David Howells authored
      
      Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount
      API.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      bb7b6b2b
    • Al Viro's avatar
      consolidate the capability checks in sget_{fc,userns}() · 0ce0cf12
      Al Viro authored
      
      ... into a common helper - mount_capable(type, userns)
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0ce0cf12
    • Al Viro's avatar
      start massaging the checks in sget_...(): move to sget_userns() · feb8ae43
      Al Viro authored
      
      there are 3 remaining callers of sget_userns() - sget(), mount_ns()
      and mount_pseudo_xattr().  Extra check in sget() is conditional
      upon mount being neither KERNMOUNT nor SUBMOUNT, the identical one
      in mount_ns() - upon being not KERNMOUNT; mount_pseudo_xattr()
      has no such checks at all.
      
      However, mount_ns() is never used with SUBMOUNT and mount_pseudo_xattr()
      is used only for KERNMOUNT, so both would be fine with the same logics
      as currently done in sget(), allowing to consolidate the entire thing
      in sget_userns() itself.
      
      That's not where these checks will end up in the long run, though -
      the whole reason why they'd been done so deep in the bowels of
      mount(2) was that there had been no way for a filesystem to specify
      which userns to look at until it has entered ->mount().
      
      Now there is a place where filesystem could override the defaults -
      ->init_fs_context().  Which allows to pull the checks out into
      the callers of vfs_get_tree().  That'll take quite a bit of
      massage, but that mess is possible to tease apart.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      feb8ae43
  22. 29 Apr, 2019 1 commit
  23. 28 Feb, 2019 2 commits
  24. 30 Jan, 2019 1 commit