1. 03 Aug, 2015 2 commits
    • Lukas Czerner's avatar
      ext4: fix reservation release on invalidatepage for delalloc fs · 3959e2a6
      Lukas Czerner authored
      commit 9705acd6
      
       upstream.
      
      On delalloc enabled file system on invalidatepage operation
      in ext4_da_page_release_reservation() we want to clear the delayed
      buffer and remove the extent covering the delayed buffer from the extent
      status tree.
      
      However currently there is a bug where on the systems with page size >
      block size we will always remove extents from the start of the page
      regardless where the actual delayed buffers are positioned in the page.
      This leads to the errors like this:
      
      EXT4-fs warning (device loop0): ext4_da_release_space:1225:
      ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
      blocks
      
      This however can cause data loss on writeback time if the file system is
      in ENOSPC condition because we're releasing reservation for someones
      else delayed buffer.
      
      Fix this by only removing extents that corresponds to the part of the
      page we want to invalidate.
      
      This problem is reproducible by the following fio receipt (however I was
      only able to reproduce it with fio-2.1 or older.
      
      [global]
      bs=8k
      iodepth=1024
      iodepth_batch=60
      randrepeat=1
      size=1m
      directory=/mnt/test
      numjobs=20
      [job1]
      ioengine=sync
      bs=1k
      direct=1
      rw=randread
      filename=file1:file2
      [job2]
      ioengine=libaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job3]
      bs=1k
      ioengine=posixaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job5]
      bs=1k
      ioengine=sync
      rw=randread
      filename=file1:file2
      [job7]
      ioengine=libaio
      rw=randwrite
      filename=file1:file2
      [job8]
      ioengine=posixaio
      rw=randwrite
      filename=file1:file2
      [job10]
      ioengine=mmap
      rw=randwrite
      bs=1k
      filename=file1:file2
      [job11]
      ioengine=mmap
      rw=randwrite
      direct=1
      filename=file1:file2
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3959e2a6
    • Theodore Ts'o's avatar
      ext4: fix race between truncate and __ext4_journalled_writepage() · 8fb9ff99
      Theodore Ts'o authored
      commit bdf96838 upstream.
      
      The commit cf108bca
      
      : "ext4: Invert the locking order of page_lock
      and transaction start" caused __ext4_journalled_writepage() to drop
      the page lock before the page was written back, as part of changing
      the locking order to jbd2_journal_start -> page_lock.  However, this
      introduced a potential race if there was a truncate racing with the
      data=journalled writeback mode.
      
      Fix this by grabbing the page lock after starting the journal handle,
      and then checking to see if page had gotten truncated out from under
      us.
      
      This fixes a number of different warnings or BUG_ON's when running
      xfstests generic/086 in data=journalled mode, including:
      
      jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
      c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0
      
      	      	      	  - and -
      
      kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
          ...
      Call Trace:
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c027d883>] ? lock_buffer+0x36/0x36
       [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
       [<c0229139>] do_invalidatepage+0x22/0x26
       [<c0229198>] truncate_inode_page+0x5b/0x85
       [<c022934b>] truncate_inode_pages_range+0x156/0x38c
       [<c0229592>] truncate_inode_pages+0x11/0x15
       [<c022962d>] truncate_pagecache+0x55/0x71
       [<c02b913b>] ext4_setattr+0x4a9/0x560
       [<c01ca542>] ? current_kernel_time+0x10/0x44
       [<c026c4d8>] notify_change+0x1c7/0x2be
       [<c0256a00>] do_truncate+0x65/0x85
       [<c0226f31>] ? file_ra_state_init+0x12/0x29
      
      	      	      	  - and -
      
      WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
      irty_metadata+0x14a/0x1ae()
          ...
      Call Trace:
       [<c01b879f>] ? console_unlock+0x3a1/0x3ce
       [<c082cbb4>] dump_stack+0x48/0x60
       [<c0178b65>] warn_slowpath_common+0x89/0xa0
       [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c0178bef>] warn_slowpath_null+0x14/0x18
       [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
       [<c02b2f44>] write_end_fn+0x40/0x53
       [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
       [<c02b59e7>] ext4_writepage+0x354/0x3b8
       [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
       [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b5a5b>] __writepage+0x10/0x2e
       [<c0225956>] write_cache_pages+0x22d/0x32c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b6ee8>] ext4_writepages+0x102/0x607
       [<c019adfe>] ? sched_clock_local+0x10/0x10e
       [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
       [<c01a8ad5>] ? lock_is_held+0x43/0x51
       [<c0226dff>] do_writepages+0x1c/0x29
       [<c0276bed>] __writeback_single_inode+0xc3/0x545
       [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
          ...
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8fb9ff99
  2. 13 May, 2015 1 commit
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by unwritten and delayed extents · f84a0b54
      Lukas Czerner authored
      commit d2dc317d
      
       upstream.
      
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f84a0b54
  3. 14 Nov, 2014 4 commits
  4. 05 Sep, 2014 1 commit
  5. 01 Jul, 2014 1 commit
    • Namjae Jeon's avatar
      ext4: fix data integrity sync in ordered mode · dc2acd78
      Namjae Jeon authored
      commit 1c8349a1
      
       upstream.
      
      When we perform a data integrity sync we tag all the dirty pages with
      PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
      for this tag in write_cache_pages_da and creates a struct
      mpage_da_data containing contiguously indexed pages tagged with this
      tag and sync these pages with a call to mpage_da_map_and_submit.  This
      process is done in while loop until all the PAGECACHE_TAG_TOWRITE
      pages are synced. We also do journal start and stop in each iteration.
      journal_stop could initiate journal commit which would call
      ext4_writepage which in turn will call ext4_bio_write_page even for
      delayed OR unwritten buffers. When ext4_bio_write_page is called for
      such buffers, even though it does not sync them but it clears the
      PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
      are also not synced by the currently running data integrity sync. We
      will end up with dirty pages although sync is completed.
      
      This could cause a potential data loss when the sync call is followed
      by a truncate_pagecache call, which is exactly the case in
      collapse_range.  (It will cause generic/127 failure in xfstests)
      
      To avoid this issue, we can use set_page_writeback_keepwrite instead of
      set_page_writeback, which doesn't clear TOWRITE tag.
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: default avatarAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc2acd78
  6. 06 May, 2014 2 commits
  7. 31 Mar, 2014 1 commit
  8. 26 Jan, 2014 1 commit
  9. 07 Jan, 2014 1 commit
    • Theodore Ts'o's avatar
      ext4: don't pass freed handle to ext4_walk_page_buffers · 8c9367fd
      Theodore Ts'o authored
      
      This is harmless, since ext4_walk_page_buffers only passes the handle
      onto the callback function, and in this call site the function in
      question, bput_one(), doesn't actually use the handle.  But there's no
      point passing in an invalid handle, and it creates a Coverity warning,
      so let's just clean it up.
      
      Addresses-Coverity-Id: #1091168
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8c9367fd
  10. 06 Jan, 2014 2 commits
  11. 18 Dec, 2013 1 commit
    • Jan Kara's avatar
      ext4: fix deadlock when writing in ENOSPC conditions · 34cf865d
      Jan Kara authored
      
      Akira-san has been reporting rare deadlocks of his machine when running
      xfstests test 269 on ext4 filesystem. The problem turned out to be in
      ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
      ext4_should_retry_alloc() while holding i_data_sem. Since
      ext4_should_retry_alloc() can force a transaction commit, this is a
      lock ordering violation and leads to deadlocks.
      
      Fix the problem by just removing the retry loops. These functions should
      just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
      function must take care of retrying after dropping all necessary locks.
      Reported-and-tested-by: default avatarAkira Fujita <a-fujita@rs.jp.nec.com>
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      34cf865d
  12. 05 Dec, 2013 1 commit
  13. 12 Nov, 2013 1 commit
  14. 30 Oct, 2013 1 commit
  15. 17 Oct, 2013 1 commit
    • Ming Lei's avatar
      ext4: fix performance regression in ext4_writepages · aeac589a
      Ming Lei authored
      Commit 4e7ea81d
      
      (ext4: restructure writeback path) introduces another
      performance regression on random write:
      
      - one more page may be added to ext4 extent in
        mpage_prepare_extent_to_map, and will be submitted for I/O so
        nr_to_write will become -1 before 'done' is set
      
      - the worse thing is that dirty pages may still be retrieved from page
        cache after nr_to_write becomes negative, so lots of small chunks
        can be submitted to block device when page writeback is catching up
        with write path, and performance is hurted.
      
      On one arm A15 board with sata 3.0 SSD(CPU: 1.5GHz dura core, RAM:
      2GB, SATA controller: 3.0Gbps), this patch can improve below test's
      result from 157MB/sec to 174MB/sec(>10%):
      
      	dd if=/dev/zero of=./z.img bs=8K count=512K
      
      The above test is actually prototype of block write in bonnie++
      utility.
      
      This patch makes sure no more pages than nr_to_write can be added to
      extent for mapping, so that nr_to_write won't become negative.
      
      Cc: linux-ext4@vger.kernel.org
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      aeac589a
  16. 16 Oct, 2013 1 commit
  17. 16 Sep, 2013 1 commit
    • Jan Kara's avatar
      ext4: fix performance regression in writeback of random writes · 9c12a831
      Jan Kara authored
      The Linux Kernel Performance project guys have reported that commit
      4e7ea81d
      
       introduces a performance regression for the following fio
      workload:
      
      [global]
      direct=0
      ioengine=mmap
      size=1500M
      bs=4k
      pre_read=1
      numjobs=1
      overwrite=1
      loops=5
      runtime=300
      group_reporting
      invalidate=0
      directory=/mnt/
      file_service_type=random:36
      file_service_type=random:36
      
      [job0]
      startdelay=0
      rw=randrw
      filename=data0/f1:data0/f2
      
      [job1]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      ...
      
      [job7]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      
      The culprit of the problem is that after the commit ext4_writepages()
      are more aggressive in writing back pages. Thus we have less consecutive
      dirty pages resulting in more seeking.
      
      This increased aggressivity is caused by a bug in the condition
      terminating ext4_writepages(). We start writing from the beginning of
      the file even if we should have terminated ext4_writepages() because
      wbc->nr_to_write <= 0.
      
      After fixing the condition the throughput of the fio workload is about 20%
      better than before writeback reorganization.
      Reported-by: default avatar"Yan, Zheng" <zheng.z.yan@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9c12a831
  18. 12 Sep, 2013 1 commit
  19. 04 Sep, 2013 1 commit
    • Christoph Hellwig's avatar
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig authored
      
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  20. 28 Aug, 2013 2 commits
  21. 17 Aug, 2013 6 commits
    • Jan Kara's avatar
      ext4: fix lost truncate due to race with writeback · 90e775b7
      Jan Kara authored
      
      The following race can lead to a loss of i_disksize update from truncate
      thus resulting in a wrong inode size if the inode size isn't updated
      again before inode is reclaimed:
      
      ext4_setattr()				mpage_map_and_submit_extent()
        EXT4_I(inode)->i_disksize = attr->ia_size;
        ...					  ...
      					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
      					  /* False because i_size isn't
      					   * updated yet */
      					  if (disksize > i_size_read(inode))
      					  /* True, because i_disksize is
      					   * already truncated */
      					  if (disksize > EXT4_I(inode)->i_disksize)
      					    /* Overwrite i_disksize
      					     * update from truncate */
      					    ext4_update_i_disksize()
        i_size_write(inode, attr->ia_size);
      
      For other places updating i_disksize such race cannot happen because
      i_mutex prevents these races. Writeback is the only place where we do
      not hold i_mutex and we cannot grab it there because of lock ordering.
      
      We fix the race by doing both i_disksize and i_size update in truncate
      atomically under i_data_sem and in mpage_map_and_submit_extent() we move
      the check against i_size under i_data_sem as well.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90e775b7
    • Jan Kara's avatar
      ext4: simplify truncation code in ext4_setattr() · 5208386c
      Jan Kara authored
      
      Merge conditions in ext4_setattr() handling inode size changes, also
      move ext4_begin_ordered_truncate() call somewhat earlier because it
      simplifies error recovery in case of failure. Also add error handling in
      case i_disksize update fails.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5208386c
    • Jan Kara's avatar
      ext4: fix ext4_writepages() in presence of truncate · 5f1132b2
      Jan Kara authored
      
      Inode size can arbitrarily change while writeback is in progress. When
      ext4_writepages() has prepared a long extent for mapping and truncate
      then reduces i_size, mpage_map_and_submit_buffers() will always map just
      one buffer in a page instead of all of them due to lblk < blocks check.
      So we end up not using all blocks we've allocated (thus leaking them)
      and also delalloc accounting goes wrong manifesting as a warning like:
      
      ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
      with only 0 reserved data blocks
      
      Note that the problem can happen only when blocksize < pagesize because
      otherwise we have only a single buffer in the page.
      
      Fix the problem by removing the size check from the mapping loop. We
      have an extent allocated so we have to use it all before checking for
      i_size. We also rename add_page_bufs_to_extent() to
      mpage_process_page_bufs() and make that function submit the page for IO
      if all buffers (upto EOF) in it are mapped.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Reported-by: default avatarZheng Liu <gnehzuil.liu@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5f1132b2
    • Jan Kara's avatar
      ext4: move test whether extent to map can be extended to one place · 09930042
      Jan Kara authored
      
      Currently the logic whether the current buffer can be added to an extent
      of buffers to map is split between mpage_add_bh_to_extent() and
      add_page_bufs_to_extent(). Move the whole logic to
      mpage_add_bh_to_extent() which makes things a bit more straightforward
      and make following i_size fixes easier.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      09930042
    • Theodore Ts'o's avatar
      ext4: use unsigned int for es_status values · 3be78c73
      Theodore Ts'o authored
      
      Don't use an unsigned long long for the es_status flags; this requires
      that we pass 64-bit values around which is painful on 32-bit systems.
      Instead pass the extent status flags around using the low 4 bits of an
      unsigned int, and shift them into place when we are reading or writing
      es_pblk.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      3be78c73
    • Jan Kara's avatar
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara authored
      Commit 0713ed0c
      
       added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: default avatarmajianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  22. 29 Jul, 2013 1 commit
  23. 16 Jul, 2013 1 commit
    • Theodore Ts'o's avatar
      ext4: call ext4_es_lru_add() after handling cache miss · 63b99968
      Theodore Ts'o authored
      
      If there are no items in the extent status tree, ext4_es_lru_add() is
      a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
      try to lookup an entry in the extent status tree.  We also need to
      call it at the end of ext4_ext_map_blocks(), after items have been
      added to the extent status tree.
      
      This could lead to inodes with that have extent status trees but which
      are not in the LRU list, which means they won't get considered for
      eviction by the es_shrinker.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      63b99968
  24. 13 Jul, 2013 1 commit
  25. 06 Jul, 2013 1 commit
  26. 01 Jul, 2013 3 commits