1. 25 Jan, 2020 5 commits
    • Dmitry Monakhov's avatar
      ext4: fix extent_status trace points · 52144d89
      Dmitry Monakhov authored
      Show pblock only if it has meaningful value.
      
      # before
         ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 576460752303423487 H
         ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 576460752303423487 HR
      # after
         ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [1/4294967294) 0 H
         ext4:ext4_es_lookup_extent_exit: dev 253,0 ino 12 found 1 [2/4294967293) 0 HR
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@gmail.com>
      Link: https://lore.kernel.org/r/20191114200147.1073-2-dmonakhov@gmail.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      52144d89
    • Dmitry Monakhov's avatar
      ext4: fix symbolic enum printing in trace output · 459c8074
      Dmitry Monakhov authored
      Trace's macro __print_flags() produce raw event's decraration w/o knowing actual
      flags value
      
      cat /sys/kernel/debug/tracing/events/ext4/ext4_ext_map_blocks_exit/format
      ..
      __print_flags(REC->mflags, "", { (1 << BH_New),
      
      For that reason we have to explicitly define it via special macro TRACE_DEFINE_ENUM()
      Also add missed EXTENT_STATUS_REFERENCED flag.
      
      #Before patch
      ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 2 flags  lblk 0 pblk 4177 len 1 mflags 0x20 ret 1
      ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 12 flags CREATE lblk 0 pblk 34304 len 1 mflags 0x60 ret 1
      
      #With patch
      ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 2 flags  lblk 0 pblk 4177 len 1 mflags M ret 1
      ext4:ext4_ext_map_blocks_exit: dev 253,0 ino 12 flags CREATE lblk 0 pblk 34816 len 1 mflags NM ret 1
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@gmail.com>
      Link: https://lore.kernel.org/r/20191114200147.1073-1-dmonakhov@gmail.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      459c8074
    • Chengguang Xu's avatar
      ext4: choose hardlimit when softlimit is larger than hardlimit in ext4_statfs_project() · 57c32ea4
      Chengguang Xu authored
      Setting softlimit larger than hardlimit seems meaningless
      for disk quota but currently it is allowed. In this case,
      there may be a bit of comfusion for users when they run
      df comamnd to directory which has project quota.
      
      For example, we set 20M softlimit and 10M hardlimit of
      block usage limit for project quota of test_dir(project id 123).
      
      [root@hades mnt_ext4]# repquota -P -a
      *** Report for project quotas on device /dev/loop0
      Block grace time: 7days; Inode grace time: 7days
                              Block limits                File limits
      Project         used    soft    hard  grace    used  soft  hard  grace
      ----------------------------------------------------------------------
       0        --      13       0       0              2     0     0
       123      --   10237   20480   10240              5   200   100
      
      The result of df command as below:
      
      [root@hades mnt_ext4]# df -h test_dir
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/loop0       20M   10M   10M  50% /home/cgxu/test/mnt_ext4
      
      Even though it looks like there is another 10M free space to use,
      if we write new data to diretory test_dir(inherit project id),
      the write will fail with errno(-EDQUOT).
      
      After this patch, the df result looks like below.
      
      [root@hades mnt_ext4]# df -h test_dir
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/loop0       10M   10M  3.0K 100% /home/cgxu/test/mnt_ext4
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.netSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      57c32ea4
    • Eric Biggers's avatar
      ext4: fix race conditions in ->d_compare() and ->d_hash() · ec772f01
      Eric Biggers authored
      Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
      ->d_parent and ->d_inode can be concurrently modified, and in
      particular, ->d_inode may be changed to NULL.  For ext4_d_hash() this
      resulted in a reproducible NULL dereference if a lookup is done in a
      directory being deleted, e.g. with:
      
      	int main()
      	{
      		if (fork()) {
      			for (;;) {
      				mkdir("subdir", 0700);
      				rmdir("subdir");
      			}
      		} else {
      			for (;;)
      				access("subdir/file", 0);
      		}
      	}
      
      ... or by running the 't_encrypted_d_revalidate' program from xfstests.
      Both repros work in any directory on a filesystem with the encoding
      feature, even if the directory doesn't actually have the casefold flag.
      
      I couldn't reproduce a crash in ext4_d_compare(), but it appears that a
      similar crash is possible there.
      
      Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
      falling back to the case sensitive behavior if the inode is NULL.
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Fixes: b886ee3e ("ext4: Support case-insensitive file name lookups")
      Cc: <stable@vger.kernel.org> # v5.2+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ec772f01
    • Theodore Ts'o's avatar
      ext4: make dioread_nolock the default · 244adf64
      Theodore Ts'o authored
      This fixes the direct I/O versus writeback race which can reveal stale
      data, and it improves the tail latency of commits on slow devices.
      
      Link: https://lore.kernel.org/r/20200125022254.1101588-1-tytso@mit.eduSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      244adf64
  2. 23 Jan, 2020 1 commit
    • Dmitry Monakhov's avatar
      ext4: fix extent_status fragmentation for plain files · 4068664e
      Dmitry Monakhov authored
      Extents are cached in read_extent_tree_block(); as a result, extents
      are not cached for inodes with depth == 0 when we try to find the
      extent using ext4_find_extent().  The result of the lookup is cached
      in ext4_map_blocks() but is only a subset of the extent on disk.  As a
      result, the contents of extents status cache can get very badly
      fragmented for certain workloads, such as a random 4k read workload.
      
      File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes)
       ext:     logical_offset:        physical_offset: length:   expected: flags:
         0:        0..    8191:      40960..     49151:   8192:             last,eof
      
      $ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test
      $ perf script | grep ext4_es_insert_extent | head -n 10
                   fio   131 [000]    13.975421:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W
                   fio   131 [000]    13.975939:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W
                   fio   131 [000]    13.976467:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W
                   fio   131 [000]    13.976937:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W
                   fio   131 [000]    13.977440:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W
                   fio   131 [000]    13.977931:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W
                   fio   131 [000]    13.978376:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W
                   fio   131 [000]    13.978957:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W
                   fio   131 [000]    13.979474:           ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W
      
      Fix this by caching the extents for inodes with depth == 0 in
      ext4_find_extent().
      
      [ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this
        newly added function is not in extents_cache.c, and to avoid
        potential visual confusion with ext4_es_cache_extent().  -TYT ]
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@gmail.com>
      Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4068664e
  3. 17 Jan, 2020 27 commits
  4. 26 Dec, 2019 5 commits
  5. 23 Dec, 2019 2 commits
    • Ritesh Harjani's avatar
      ext4: Move to shared i_rwsem even without dioread_nolock mount opt · bc6385da
      Ritesh Harjani authored
      We were using shared locking only in case of dioread_nolock mount option in case
      of DIO overwrites. This mount condition is not needed anymore with current code,
      since:-
      
      1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
      exclusive lock & DIO overwrites will take shared locking. Also DIO path will
      make sure to flush and wait for any dirty page cache data.
      
      2. No race between buffered reads & DIO overwrites, since there is no block
      allocation that is possible with DIO overwrites. So no stale data exposure
      should happen. Same is the case between DIO reads & DIO overwrites.
      
      3. Also other paths like truncate is protected, since we wait there for any DIO
      in flight to be over.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bc6385da
    • Ritesh Harjani's avatar
      ext4: Start with shared i_rwsem in case of DIO instead of exclusive · aa9714d0
      Ritesh Harjani authored
      Earlier there was no shared lock in DIO read path. But this patch
      (16c54688: ext4: Allow parallel DIO reads)
      simplified some of the locking mechanism while still allowing for parallel DIO
      reads by adding shared lock in inode DIO read path.
      
      But this created problem with mixed read/write workload. It is due to the fact
      that in DIO path, we first start with exclusive lock and only when we determine
      that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
      we still have shared locking in DIO reads.
      
      So, this patch tries to fix this issue by starting with shared lock and then
      switching to exclusive lock only when required based on ext4_dio_write_checks().
      
      Other than that, it also simplifies below cases:-
      
      1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
      abused in the sense that it was not really checking for AIO anywhere also it
      used to check for extending writes. So this API was renamed and simplified to
      ext4_unaligned_io() which actully only checks if the IO is really unaligned.
      
      Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
      block and that will require serialization against other direct IOs in the same
      block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
      we also need to wait for any outstanding IOs to complete so that conversion from
      unwritten to written is completed before anyone try to map the overlapping block.
      Hence we take exclusive inode lock and also wait for inode_dio_wait() for
      unaligned DIO case. Please note since we are anyway taking an exclusive lock in
      unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
      
      2. Added ext4_extending_io(). This checks if the IO is extending the file.
      
      3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
      only switch to exclusive lock if required. So in most cases with aligned,
      non-extending, dioread_nolock & overwrites, it tries to write with a shared
      lock. If not, then we restart the operation in ext4_dio_write_checks(), after
      acquiring exclusive lock.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      aa9714d0