1. 04 Jul, 2017 1 commit
    • Tahsin Erdogan's avatar
      ext4: change fast symlink test to not rely on i_blocks · 407cd7fb
      Tahsin Erdogan authored
      ext4_inode_info->i_data is the storage area for 4 types of data:
      
        a) Extents data
        b) Inline data
        c) Block map
        d) Fast symlink data (symlink length < 60)
      
      Extents data case is positively identified by EXT4_INODE_EXTENTS flag.
      Inline data case is also obvious because of EXT4_INODE_INLINE_DATA
      flag.
      
      Distinguishing c) and d) however requires additional logic. This
      currently relies on i_blocks count. After subtracting external xattr
      block from i_blocks, if it is greater than 0 then we know that some
      data blocks exist, so there must be a block map.
      
      This logic got broken after ea_inode feature was added. That feature
      charges the data blocks of external xattr inodes to the referencing
      inode and so adds them to the i_blocks. To fix this, we could subtract
      ea_inode blocks by iterating through all xattr entries and then check
      whether remaining i_blocks count is zero. Besides being complicated,
      this won't change the fact that the current way of distinguishing
      between c) and d) is fragile.
      
      The alternative solution is to test whether i_size is less than 60 to
      determine fast symlink case. ext4_symlink() uses the same test to decide
      whether to store the symlink in i_data. There is one caveat to address
      before this can work though.
      
      If an inode's i_nlink is zero during eviction, its i_size is set to
      zero and its data is truncated. If system crashes before inode is removed
      from the orphan list, next boot orphan cleanup may find the inode with
      zero i_size. So, a symlink that had its data stored in a block may now
      appear to be a fast symlink. The solution used in this patch is to treat
      i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal
      so the only time this can happen is the mentioned scenario. This is also
      logically correct because a i_size = 0 symlink has no data stored in
      i_data.
      Suggested-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      407cd7fb
  2. 23 Jun, 2017 9 commits
    • Eric Biggers's avatar
      ext4: require key for truncate(2) of encrypted file · 63136858
      Eric Biggers authored
      Currently, filesystems allow truncate(2) on an encrypted file without
      the encryption key.  However, it's impossible to correctly handle the
      case where the size being truncated to is not a multiple of the
      filesystem block size, because that would require decrypting the final
      block, zeroing the part beyond i_size, then encrypting the block.
      
      As other modifications to encrypted file contents are prohibited without
      the key, just prohibit truncate(2) as well, making it fail with ENOKEY.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      63136858
    • Eric Biggers's avatar
      ext4: don't bother checking for encryption key in ->mmap() · 66e0aaad
      Eric Biggers authored
      Since only an open file can be mmap'ed, and we only allow open()ing an
      encrypted file when its key is available, there is no need to check for
      the key again before permitting each mmap().
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      66e0aaad
    • Chao Yu's avatar
      ext4: check return value of kstrtoull correctly in reserved_clusters_store · 1ea1516f
      Chao Yu authored
      kstrtoull returns 0 on success, however, in reserved_clusters_store we
      will return -EINVAL if kstrtoull returns 0, it makes us fail to update
      reserved_clusters value through sysfs.
      
      Fixes: 76d33bca
      Cc: stable@vger.kernel.org # 4.4
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarMiao Xie <miaoxie@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1ea1516f
    • Darrick J. Wong's avatar
      ext4: fix off-by-one fsmap error on 1k block filesystems · 4a495624
      Darrick J. Wong authored
      For 1k-block filesystems, the filesystem starts at block 1, not block 0.
      This fact is recorded in s_first_data_block, so use that to bump up the
      start_fsb before we start querying the filesystem for its space map.
      Without this, ext4/026 fails on 1k block ext4 because various functions
      (notably ext4_get_group_no_and_offset) don't know what to do with an
      fsblock that is "before" the start of the filesystem and return garbage
      results (blockgroup 2^32-1, etc.) that confuse fsmap.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4a495624
    • Theodore Ts'o's avatar
      ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry() · bdddf342
      Theodore Ts'o authored
      Previously a bad directory block with a bad checksum is skipped; we
      should be returning EFSBADCRC (aka EBADMSG).
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bdddf342
    • Khazhismel Kumykov's avatar
      ext4: return EIO on read error in ext4_find_entry · 6febe6f2
      Khazhismel Kumykov authored
      Previously, a read error would be ignored and we would eventually return
      NULL from ext4_find_entry, which signals "no such file or directory". We
      should be returning EIO.
      Signed-off-by: default avatarKhazhismel Kumykov <khazhy@google.com>
      6febe6f2
    • Eric Biggers's avatar
      ext4: forbid encrypting root directory · 9ce0151a
      Eric Biggers authored
      Currently it's possible to encrypt all files and directories on an ext4
      filesystem by deleting everything, including lost+found, then setting an
      encryption policy on the root directory.  However, this is incompatible
      with e2fsck because e2fsck expects to find, create, and/or write to
      lost+found and does not have access to any encryption keys.  Especially
      problematic is that if e2fsck can't find lost+found, it will create it
      without regard for whether the root directory is encrypted.  This is
      wrong for obvious reasons, and it causes a later run of e2fsck to
      consider the lost+found directory entry to be corrupted.
      
      Encrypting the root directory may also be of limited use because it is
      the "all-or-nothing" use case, for which dm-crypt can be used instead.
      (By design, encryption policies are inherited and cannot be overridden;
      so the root directory having an encryption policy implies that all files
      and directories on the filesystem have that same encryption policy.)
      
      In any case, encrypting the root directory is broken currently and must
      not be allowed; so start returning an error if userspace requests it.
      For now only do this in ext4, because f2fs and ubifs do not appear to
      have the lost+found requirement.  We could move it into
      fscrypt_ioctl_set_policy() later if desired, though.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      9ce0151a
    • Daeho Jeong's avatar
      ext4: send parallel discards on commit completions · a0154344
      Daeho Jeong authored
      Now, when we mount ext4 filesystem with '-o discard' option, we have to
      issue all the discard commands for the blocks to be deallocated and
      wait for the completion of the commands on the commit complete phase.
      Because this procedure might involve a lot of sequential combinations of
      issuing discard commands and waiting for that, the delay of this
      procedure might be too much long, even to 17.0s in our test,
      and it results in long commit delay and fsync() performance degradation.
      
      To reduce this kind of delay, instead of adding callback for each
      extent and handling all of them in a sequential manner on commit phase,
      we instead add a separate list of extents to free to the superblock and
      then process this list at once after transaction commits so that
      we can issue all the discard commands in a parallel manner like XFS
      filesystem.
      
      Finally, we could enhance the discard command handling performance.
      The result was such that 17.0s delay of a single commit in the worst
      case has been enhanced to 4.8s.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Tested-by: default avatarHobin Woo <hobin.woo@samsung.com>
      Tested-by: default avatarKitae Lee <kitae87.lee@samsung.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      a0154344
    • Jan Kara's avatar
      ext4: avoid unnecessary stalls in ext4_evict_inode() · 3abb1a0f
      Jan Kara authored
      These days inode reclaim calls evict_inode() only when it has no pages
      in the mapping.  In that case it is not necessary to wait for transaction
      commit in ext4_evict_inode() as there can be no pages waiting to be
      committed.  So avoid unnecessary transaction waiting in that case.
      
      We still have to keep the check for the case where ext4_evict_inode()
      gets called from other paths (e.g. umount) where inode still can have
      some page cache pages.
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3abb1a0f
  3. 22 Jun, 2017 30 commits
    • Tahsin Erdogan's avatar
      ext4: add nombcache mount option · cdb7ee4c
      Tahsin Erdogan authored
      The main purpose of mb cache is to achieve deduplication in
      extended attributes. In use cases where opportunity for deduplication
      is unlikely, it only adds overhead.
      
      Add a mount option to explicitly turn off mb cache.
      Suggested-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      cdb7ee4c
    • Tahsin Erdogan's avatar
      ext4: strong binding of xattr inode references · b9fc761e
      Tahsin Erdogan authored
      To verify that a xattr entry is not pointing to the wrong xattr inode,
      we currently check that the target inode has EXT4_EA_INODE_FL flag set and
      also the entry size matches the target inode size.
      
      For stronger validation, also incorporate crc32c hash of the value into
      the e_hash field. This is done regardless of whether the entry lives in
      the inode body or external attribute block.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b9fc761e
    • Tahsin Erdogan's avatar
      ext4: eliminate xattr entry e_hash recalculation for removes · daf83281
      Tahsin Erdogan authored
      When an extended attribute block is modified, ext4_xattr_hash_entry()
      recalculates e_hash for the entry that is pointed by s->here. This  is
      unnecessary if the modification is to remove an entry.
      
      Currently, if the removed entry is the last one and there are other
      entries remaining, hash calculation targets the just erased entry which
      has been filled with zeroes and effectively does nothing.  If the removed
      entry is not the last one and there are more entries, this time it will
      recalculate hash on the next entry which is totally unnecessary.
      
      Fix these by moving the decision on when to recalculate hash to
      ext4_xattr_set_entry().
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      daf83281
    • Tahsin Erdogan's avatar
      ext4: reserve space for xattr entries/names · 9c6e7853
      Tahsin Erdogan authored
      New ea_inode feature allows putting large xattr values into external
      inodes.  struct ext4_xattr_entry and the attribute name however have to
      remain in the inode extra space or external attribute block.  Once that
      space is exhausted, no further entries can be added.  Some of that space
      could also be used by values that fit in there at the time of addition.
      
      So, a single xattr entry whose value barely fits in the external block
      could prevent further entries being added.
      
      To mitigate the problem, this patch introduces a notion of reserved
      space in the external attribute block that cannot be used by value data.
      This reserve is enforced when ea_inode feature is enabled.  The amount
      of reserve is arbitrarily chosen to be min(block_size/8, 1024).  The
      table below shows how much space is reserved for each block size and the
      guaranteed mininum number of entries that can be placed in the external
      attribute block.
      
      block size     reserved bytes  entries (name length = 16)
       1k            128              3
       2k            256              7
       4k            512             15
       8k            1024            31
      16k            1024            31
      32k            1024            31
      64k            1024            31
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9c6e7853
    • Tahsin Erdogan's avatar
      quota: add get_inode_usage callback to transfer multi-inode charges · 7a9ca53a
      Tahsin Erdogan authored
      Ext4 ea_inode feature allows storing xattr values in external inodes to
      be able to store values that are bigger than a block in size. Ext4 also
      has deduplication support for these type of inodes. With deduplication,
      the actual storage waste is eliminated but the users of such inodes are
      still charged full quota for the inodes as if there was no sharing
      happening in the background.
      
      This design requires ext4 to manually charge the users because the
      inodes are shared.
      
      An implication of this is that, if someone calls chown on a file that
      has such references we need to transfer the quota for the file and xattr
      inodes. Current dquot_transfer() function implicitly transfers one inode
      charge. With ea_inode feature, we would like to transfer multiple inode
      charges.
      
      Add get_inode_usage callback which can interrogate the total number of
      inodes that were charged for a given inode.
      
      [ Applied fix from Colin King to make sure the 'ret' variable is
        initialized on the successful return path.  Detected by
        CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      7a9ca53a
    • Tahsin Erdogan's avatar
      ext4: xattr inode deduplication · dec214d0
      Tahsin Erdogan authored
      Ext4 now supports xattr values that are up to 64k in size (vfs limit).
      Large xattr values are stored in external inodes each one holding a
      single value. Once written the data blocks of these inodes are immutable.
      
      The real world use cases are expected to have a lot of value duplication
      such as inherited acls etc. To reduce data duplication on disk, this patch
      implements a deduplicator that allows sharing of xattr inodes.
      
      The deduplication is based on an in-memory hash lookup that is a best
      effort sharing scheme. When a xattr inode is read from disk (i.e.
      getxattr() call), its crc32c hash is added to a hash table. Before
      creating a new xattr inode for a value being set, the hash table is
      checked to see if an existing inode holds an identical value. If such an
      inode is found, the ref count on that inode is incremented. On value
      removal the ref count is decremented and if it reaches zero the inode is
      deleted.
      
      The quota charging for such inodes is manually managed. Every reference
      holder is charged the full size as if there was no sharing happening.
      This is consistent with how xattr blocks are also charged.
      
      [ Fixed up journal credits calculation to handle inline data and the
        rare case where an shared xattr block can get freed when two thread
        race on breaking the xattr block sharing. --tytso ]
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      dec214d0
    • Tahsin Erdogan's avatar
      ext4: cleanup transaction restarts during inode deletion · 30a7eb97
      Tahsin Erdogan authored
      During inode deletion, the number of journal credits that will be
      needed is hard to determine.  For that reason we have journal
      extend/restart calls in several places.  Whenever a transaction is
      restarted, filesystem must be in a consistent state because there is
      no atomicity guarantee beyond a restart call.
      
      Add ext4_xattr_ensure_credits() helper function which takes care of
      journal extend/restart logic.  It also handles getting jbd2 write
      access and dirty metadata calls.  This function is called at every
      iteration of handling an ea_inode reference.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      30a7eb97
    • Tahsin Erdogan's avatar
      ext4: add ext4_is_quota_file() · 02749a4c
      Tahsin Erdogan authored
      IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
      also uses it to check whether an inode is for a quota file. The
      distinction currently doesn't matter because quota is disabled only
      for the quota files. When we start disabling quota for other inodes
      in the future, we will want to make the distinction clear.
      
      Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
      we are checking for quota files.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      02749a4c
    • Tahsin Erdogan's avatar
      ext2, ext4: make mb block cache names more explicit · 47387409
      Tahsin Erdogan authored
      There will be a second mb_cache instance that tracks ea_inodes. Make
      existing names more explicit so that it is clear that they refer to
      xattr block cache.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      47387409
    • Tahsin Erdogan's avatar
      mbcache: make mbcache naming more generic · c07dfcb4
      Tahsin Erdogan authored
      Make names more generic so that mbcache usage is not limited to
      block sharing. In a subsequent patch in the series
      ("ext4: xattr inode deduplication"), we start using the mbcache code
      for sharing xattr inodes. With that patch, old mb_cache_entry.e_block
      field could be holding either a block number or an inode number.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c07dfcb4
    • Tahsin Erdogan's avatar
      ext4: move struct ext4_xattr_inode_array to xattr.h · b6d9029d
      Tahsin Erdogan authored
      Since this is a xattr specific data structure it is cleaner to keep it in
      xattr header file.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b6d9029d
    • Tahsin Erdogan's avatar
      ext4: modify ext4_xattr_ino_array to hold struct inode * · 0421a189
      Tahsin Erdogan authored
      Tracking struct inode * rather than the inode number eliminates the
      repeated ext4_xattr_inode_iget() call later. The second call cannot
      fail in practice but still requires explanation when it wants to ignore
      the return value. Avoid the trouble and make things simple.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0421a189
    • Tahsin Erdogan's avatar
      ext4: improve journal credit handling in set xattr paths · c1a5d5f6
      Tahsin Erdogan authored
      Both ext4_set_acl() and ext4_set_context() need to be made aware of
      ea_inode feature when it comes to credits calculation.
      
      Also add a sufficient credits check in ext4_xattr_set_handle() right
      after xattr write lock is grabbed. Original credits calculation is done
      outside the lock so there is a possiblity that the initially calculated
      credits are not sufficient anymore.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c1a5d5f6
    • Tahsin Erdogan's avatar
      ext4: ext4_xattr_delete_inode() should return accurate errors · 65d30005
      Tahsin Erdogan authored
      In a few places the function returns without trying to pass the actual
      error code to the caller. Fix those.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      65d30005
    • Tahsin Erdogan's avatar
      ext4: retry storing value in external inode with xattr block too · b347e2bc
      Tahsin Erdogan authored
      When value size is <= EXT4_XATTR_MIN_LARGE_EA_SIZE(), and it
      doesn't fit in either inline or xattr block, a second try is made to
      store it in an external inode while storing the entry itself in inline
      area. There should also be an attempt to store the entry in xattr block.
      
      This patch adds a retry loop to do that. It also makes the caller the
      sole decider on whether to store a value in an external inode.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b347e2bc
    • Tahsin Erdogan's avatar
      ext4: fix credits calculation for xattr inode · b3155298
      Tahsin Erdogan authored
      When there is no space for a value in xattr block, it may be stored
      in an xattr inode even if the value length is less than
      EXT4_XATTR_MIN_LARGE_EA_SIZE(). So the current assumption in credits
      calculation is wrong.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b3155298
    • Tahsin Erdogan's avatar
      ext4: fix ext4_xattr_cmp() · 7cec1918
      Tahsin Erdogan authored
      When a xattr entry refers to an external inode, the value data is not
      available in the inline area so we should not attempt to read it using
      value offset.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7cec1918
    • Tahsin Erdogan's avatar
      ext4: fix ext4_xattr_move_to_block() · f6109100
      Tahsin Erdogan authored
      When moving xattr entries from inline area to a xattr block, entries
      that refer to external xattr inodes need special handling because
      value data is not available in the inline area but rather should be
      read from its external inode.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f6109100
    • Tahsin Erdogan's avatar
      ext4: fix ext4_xattr_make_inode_space() value size calculation · 9bb21ced
      Tahsin Erdogan authored
      ext4_xattr_make_inode_space() is interested in calculating the inline
      space used in an inode. When a xattr entry refers to an external inode
      the value size indicates the external inode size, not the value size in
      the inline area. Change the function to take this into account.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9bb21ced
    • Tahsin Erdogan's avatar
      ext4: ext4_xattr_value_same() should return false for external data · 0bd454c0
      Tahsin Erdogan authored
      ext4_xattr_value_same() is used as a quick optimization in case the new
      xattr value is identical to the previous value. When xattr value is
      stored in a xattr inode the check becomes expensive so it is better to
      just assume that they are not equal.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0bd454c0
    • Tahsin Erdogan's avatar
      ext4: add missing le32_to_cpu(e_value_inum) conversions · 990461dd
      Tahsin Erdogan authored
      Two places in code missed converting xattr inode number using
      le32_to_cpu().
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      990461dd
    • Tahsin Erdogan's avatar
      ext4: clean up ext4_xattr_inode_get() · 90966693
      Tahsin Erdogan authored
      The input and output values of *size parameter are equal on successful
      return from ext4_xattr_inode_get().  On error return, the callers ignore
      the output value so there is no need to update it.
      
      Also check for NULL return from ext4_bread().  If the actual xattr inode
      size happens to be smaller than the expected size, ext4_bread() may
      return NULL which would indicate data corruption.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      90966693
    • Tahsin Erdogan's avatar
      ext4: change ext4_xattr_inode_iget() signature · bab79b04
      Tahsin Erdogan authored
      In general, kernel functions indicate success/failure through their return
      values. This function returns the status as an output parameter and reserves
      the return value for the inode. Make it follow the general convention.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bab79b04
    • Tahsin Erdogan's avatar
      ext4: extended attribute value size limit is enforced by vfs · 0eefb107
      Tahsin Erdogan authored
      EXT4_XATTR_MAX_LARGE_EA_SIZE definition in ext4 is currently unused.
      Besides, vfs enforces its own 64k limit which makes the 1MB limit in
      ext4 redundant. Remove it.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0eefb107
    • Tahsin Erdogan's avatar
      ext4: fix ref counting for ea_inode · 1e7d359d
      Tahsin Erdogan authored
      The ref count on ea_inode is incremented by
      ext4_xattr_inode_orphan_add() which is supposed to be decremented by
      ext4_xattr_inode_array_free(). The decrement is conditioned on whether
      the ea_inode is currently on the orphan list. However, the orphan list
      addition only happens when journaling is enabled. In non-journaled case,r
      we fail to release the ref count causing an error message like below.
      
      "VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.
      Have a nice day..."
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1e7d359d
    • Tahsin Erdogan's avatar
      ext4: call journal revoke when freeing ea_inode blocks · ddfa17e4
      Tahsin Erdogan authored
      ea_inode contents are treated as metadata, that's why it is journaled
      during initial writes. Failing to call revoke during freeing could cause
      user data to be overwritten with original ea_inode contents during journal
      replay.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ddfa17e4
    • Tahsin Erdogan's avatar
      ext4: ea_inode owner should be the same as the inode owner · 9e1ba001
      Tahsin Erdogan authored
      Quota charging is based on the ownership of the inode. Currently, the
      xattr inode owner is set to the caller which may be different from the
      parent inode owner. This is inconsistent with how quota is charged for
      xattr block and regular data block writes.
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      9e1ba001
    • Tahsin Erdogan's avatar
      ext4: attach jinode after creation of xattr inode · bd3b963b
      Tahsin Erdogan authored
      In data=ordered mode jinode needs to be attached to the xattr inode when
      writing data to it. Attachment normally occurs during file open for regular
      files. Since we are not using file interface to write to the xattr inode,
      the jinode attach needs to be done manually.
      
      Otherwise the following crash occurs in data=ordered mode.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: jbd2_journal_file_inode+0x37/0x110
       PGD 13b3c0067
       P4D 13b3c0067
       PUD 137660067
       PMD 0
      
       Oops: 0000 [#1] SMP
       CPU: 3 PID: 1877 Comm: python Not tainted 4.12.0-rc1+ #749
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       task: ffff88010e368980 task.stack: ffffc90000374000
       RIP: 0010:jbd2_journal_file_inode+0x37/0x110
       RSP: 0018:ffffc90000377980 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff880123b06230 RCX: 0000000000280000
       RDX: 0000000000000006 RSI: 0000000000000000 RDI: ffff88012c8585d0
       RBP: ffffc900003779b0 R08: 0000000000000202 R09: 0000000000000001
       R10: 0000000000000000 R11: 0000000000000400 R12: ffff8801111f81c0
       R13: ffff88013b2b6800 R14: ffffc90000377ab0 R15: 0000000000000001
       FS:  00007f0c99b77740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000000136d91000 CR4: 00000000000006e0
       Call Trace:
        jbd2_journal_inode_add_write+0xe/0x10
        ext4_map_blocks+0x59e/0x620
        ext4_xattr_set_entry+0x501/0x7d0
        ext4_xattr_block_set+0x1b2/0x9b0
        ext4_xattr_set_handle+0x322/0x4f0
        ext4_xattr_set+0x144/0x1a0
        ext4_xattr_user_set+0x34/0x40
        __vfs_setxattr+0x66/0x80
        __vfs_setxattr_noperm+0x69/0x1c0
        vfs_setxattr+0xa2/0xb0
        setxattr+0x12e/0x150
        path_setxattr+0x87/0xb0
        SyS_setxattr+0xf/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bd3b963b
    • Tahsin Erdogan's avatar
      ext4: do not set posix acls on xattr inodes · 1b917ed8
      Tahsin Erdogan authored
      We don't need acls on xattr inodes because they are not directly
      accessible from user mode.
      
      Besides lockdep complains about recursive locking of xattr_sem as seen
      below.
      
        =============================================
        [ INFO: possible recursive locking detected ]
        4.11.0-rc8+ #402 Not tainted
        ---------------------------------------------
        python/1894 is trying to acquire lock:
         (&ei->xattr_sem){++++..}, at: [<ffffffff804878a6>] ext4_xattr_get+0x66/0x270
      
        but task is already holding lock:
         (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&ei->xattr_sem);
          lock(&ei->xattr_sem);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        3 locks held by python/1894:
         #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff803d829f>] mnt_want_write+0x1f/0x50
         #1:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803dda27>] vfs_setxattr+0x57/0xb0
         #2:  (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
      
        stack backtrace:
        CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         dump_stack+0x67/0x99
         __lock_acquire+0x5f3/0x1830
         lock_acquire+0xb5/0x1d0
         down_read+0x2f/0x60
         ext4_xattr_get+0x66/0x270
         ext4_get_acl+0x43/0x1e0
         get_acl+0x72/0xf0
         posix_acl_create+0x5e/0x170
         ext4_init_acl+0x21/0xc0
         __ext4_new_inode+0xffd/0x16b0
         ext4_xattr_set_entry+0x5ea/0xb70
         ext4_xattr_block_set+0x1b5/0x970
         ext4_xattr_set_handle+0x351/0x5d0
         ext4_xattr_set+0x124/0x180
         ext4_xattr_user_set+0x34/0x40
         __vfs_setxattr+0x66/0x80
         __vfs_setxattr_noperm+0x69/0x1c0
         vfs_setxattr+0xa2/0xb0
         setxattr+0x129/0x160
         path_setxattr+0x87/0xb0
         SyS_setxattr+0xf/0x20
         entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1b917ed8
    • Tahsin Erdogan's avatar
      ext4: lock inode before calling ext4_orphan_add() · 0de5983d
      Tahsin Erdogan authored
      ext4_orphan_add() requires caller to be holding the inode lock.
      Add missing lock statements.
      
       WARNING: CPU: 3 PID: 1806 at fs/ext4/namei.c:2731 ext4_orphan_add+0x4e/0x240
       CPU: 3 PID: 1806 Comm: python Not tainted 4.12.0-rc1+ #746
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       task: ffff880135d466c0 task.stack: ffffc900014b0000
       RIP: 0010:ext4_orphan_add+0x4e/0x240
       RSP: 0018:ffffc900014b3d50 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff8801348fe1f0 RCX: ffffc900014b3c64
       RDX: 0000000000000000 RSI: ffff8801348fe1f0 RDI: ffff8801348fe1f0
       RBP: ffffc900014b3da0 R08: 0000000000000000 R09: ffffffff80e82025
       R10: 0000000000004692 R11: 000000000000468d R12: ffff880137598000
       R13: ffff880137217000 R14: ffff880134ac58d0 R15: 0000000000000000
       FS:  00007fc50f09e740(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00000000008bc2e0 CR3: 00000001375ac000 CR4: 00000000000006e0
       Call Trace:
        ext4_xattr_inode_orphan_add.constprop.19+0x9d/0xf0
        ext4_xattr_delete_inode+0x1c4/0x2f0
        ext4_evict_inode+0x15a/0x7f0
        evict+0xc0/0x1a0
        iput+0x16a/0x270
        do_unlinkat+0x172/0x290
        SyS_unlink+0x11/0x20
        entry_SYSCALL_64_fastpath+0x18/0xad
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0de5983d