1. 22 Oct, 2012 2 commits
  2. 15 Oct, 2012 1 commit
  3. 10 Oct, 2012 2 commits
    • Theodore Ts'o's avatar
      ext4: fix metadata checksum calculation for the superblock · 06db49e6
      Theodore Ts'o authored
      The function ext4_handle_dirty_super() was calculating the superblock
      on the wrong block data.  As a result, when the superblock is modified
      while it is mounted (most commonly, when inodes are added or removed
      from the orphan list), the superblock checksum would be wrong.  We
      didn't notice because the superblock *was* being correctly calculated
      in ext4_commit_super(), and this would get called when the file system
      was unmounted.  So the problem only became obvious if the system
      crashed while the file system was mounted.
      
      Fix this by removing the poorly designed function signature for
      ext4_superblock_csum_set(); if it only took a single argument, the
      pointer to a struct superblock, the ambiguity which caused this
      mistake would have been impossible.
      Reported-by: default avatarGeorge Spelvin <linux@horizon.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      06db49e6
    • Dmitry Monakhov's avatar
      ext4: race-condition protection for ext4_convert_unwritten_extents_endio · dee1f973
      Dmitry Monakhov authored
      We assumed that at the time we call ext4_convert_unwritten_extents_endio()
      extent in question is fully inside [map.m_lblk, map->m_len] because
      it was already split during submission.  But this may not be true due to
      a race between writeback vs fallocate.
      
      If extent in question is larger than requested we will split it again.
      Special precautions should being done if zeroout required because
      [map.m_lblk, map->m_len] already contains valid data.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      dee1f973
  4. 05 Oct, 2012 2 commits
    • Dmitry Monakhov's avatar
      ext4: serialize fallocate with ext4_convert_unwritten_extents · 60d4616f
      Dmitry Monakhov authored
      Fallocate should wait for pended ext4_convert_unwritten_extents()
      otherwise following race may happen:
      
      ftruncate( ,12288);
      fallocate( ,0, 4096)
      io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
      fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */
      
      Later kwork completion will do:
       ->ext4_convert_unwritten_extents (0, 4096)
         ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
          ->ext4_ext_map_blocks() /* Will find new extent:  ex = [0,2] !!!!!! */
            ->ext4_ext_handle_uninitialized_extents()
              ->ext4_convert_unwritten_extents_endio()
              /* convert [0,2] extent to initialized, but only[0,1] was written */
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      60d4616f
    • Dmitry Monakhov's avatar
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov authored
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
      	 ->filemap_write_and_wait_range
                                                         ->ext4_end_bio
                                                           ->end_page_writeback
                <-- filemap_write_and_wait_range return
      	 ->ext4_flush_completed_IO
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
                                                           ->ext4_add_complete_io
      
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      c278531d
  5. 01 Oct, 2012 3 commits
    • Theodore Ts'o's avatar
      ext4: fix mtime update in nodelalloc mode · 041bbb6d
      Theodore Ts'o authored
      Commits 5e8830dc and 41c4d25f introduced a regression into
      v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
      take place for files modified via mmap if the page was already in the
      page cache.  This would also affect ext3 file systems mounted using
      the ext4 file system driver.
      
      The problem was that ext4_page_mkwrite() had a shortcut which would
      avoid calling __block_page_mkwrite() under some circumstances, and the
      above two commit transferred the responsibility of calling
      file_update_time() to __block_page_mkwrite --- which woudln't get
      called in some circumstances.
      
      Since __block_page_mkwrite() only has three callers,
      block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
      best way to solve this is to move the responsibility for calling
      file_update_time() to its caller.
      
      This problem was found via xfstests #215 with a file system mounted
      with -o nodelalloc.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@vger.kernel.org
      041bbb6d
    • Dmitry Monakhov's avatar
      ext4: fix ext_remove_space for punch_hole case · 6f2080e6
      Dmitry Monakhov authored
      Inode is allowed to have empty leaf only if it this is blockless inode.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      6f2080e6
    • Dmitry Monakhov's avatar
      ext4: punch_hole should wait for DIO writers · 02d262df
      Dmitry Monakhov authored
      punch_hole is the place where we have to wait for all existing writers
      (writeback, aio, dio), but currently we simply flush pended end_io request
      which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
      held which obviously result in dangerous data corruption due to
      write-after-free.
      
      This patch performs following changes:
      - Guard punch_hole with i_mutex
      - Recheck inode flags under i_mutex
      - Block all new dio readers in order to prevent information leak caused by
        read-after-free pattern.
      - punch_hole now wait for all writers in flight
        NOTE: XXX write-after-free race is still possible because new dirty pages
        may appear due to mmap(), and currently there is no easy way to stop
        writeback while punch_hole is in progress.
      
      [ Fixed error return from ext4_ext_punch_hole() to make sure that we
        release i_mutex before returning EPERM or ETXTBUSY -- Ted ]
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      02d262df
  6. 29 Sep, 2012 8 commits
    • Dmitry Monakhov's avatar
      ext4: serialize truncate with owerwrite DIO workers · 1f555cfa
      Dmitry Monakhov authored
      Jan Kara have spotted interesting issue:
      There are  potential data corruption issue with  direct IO overwrites
      racing with truncate:
       Like:
        dio write                      truncate_task
        ->ext4_ext_direct_IO
         ->overwrite == 1
          ->down_read(&EXT4_I(inode)->i_data_sem);
          ->mutex_unlock(&inode->i_mutex);
                                     ->ext4_setattr()
                                      ->inode_dio_wait()
                                      ->truncate_setsize()
                                      ->ext4_truncate()
                                       ->down_write(&EXT4_I(inode)->i_data_sem);
          ->__blockdev_direct_IO
           ->ext4_get_block
           ->submit_io()
          ->up_read(&EXT4_I(inode)->i_data_sem);
                                       # truncate data blocks, allocate them to
                                       # other inode - bad stuff happens because
                                       # dio is still in flight.
      
      In order to serialize with truncate dio worker should grab extra i_dio_count
      reference before drop i_mutex.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1f555cfa
    • Dmitry Monakhov's avatar
      ext4: endless truncate due to nonlocked dio readers · 1b65007e
      Dmitry Monakhov authored
      If we have enough aggressive DIO readers, truncate and other dio
      waiters will wait forever inside inode_dio_wait(). It is reasonable
      to disable nonlock DIO read optimization during truncate.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1b65007e
    • Dmitry Monakhov's avatar
      ext4: serialize unlocked dio reads with truncate · 1c9114f9
      Dmitry Monakhov authored
      Current serialization will works only for DIO which holds
      i_mutex, but nonlocked DIO following race is possible:
      
      dio_nolock_read_task            truncate_task
      				->ext4_setattr()
      				 ->inode_dio_wait()
      ->ext4_ext_direct_IO
        ->ext4_ind_direct_IO
          ->__blockdev_direct_IO
            ->ext4_get_block
      				 ->truncate_setsize()
      				 ->ext4_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to serialize with unlocked DIO reads we have to
      rearrange wait sequence
      1) update i_size first
      2) if i_size about to be reduced wait for outstanding DIO requests
      3) and only after that truncate inode blocks
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1c9114f9
    • Dmitry Monakhov's avatar
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov authored
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      required.
      
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      17335dcc
    • Dmitry Monakhov's avatar
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov authored
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • Dmitry Monakhov's avatar
      ext4: fix unwritten counter leakage · 82e54229
      Dmitry Monakhov authored
      ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
      once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
      on error path.
      
       - add missed error checks to prevent counter leakage
       - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
         that conversion finished.
       - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.
      
      Visible effect of this bug is that unaligned aio_stress may deadlock
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      82e54229
    • Dmitry Monakhov's avatar
      ext4: give i_aiodio_unwritten a more appropriate name · e27f41e1
      Dmitry Monakhov authored
      AIO/DIO prefix is wrong because it account unwritten extents which
      also may be scheduled from buffered write endio
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e27f41e1
    • Dmitry Monakhov's avatar
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov authored
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  7. 27 Sep, 2012 12 commits
  8. 26 Sep, 2012 6 commits
    • Dmitry Monakhov's avatar
      ext4: reimplement uninit extent optimization for move_extent_per_page() · 8c854473
      Dmitry Monakhov authored
      Uninitialized extent may became initialized(parallel writeback task)
      at any moment after we drop i_data_sem, so we have to recheck extent's
      state after we hold page's lock and i_data_sem.
      
      If we about to change page's mapping we must hold page's lock in order to
      serialize other users.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8c854473
    • Dmitry Monakhov's avatar
      ext4: clean up online defrag bugs in move_extent_per_page() · bb557488
      Dmitry Monakhov authored
      Non-full list of bugs:
      1) uninitialized extent optimization does not hold page's lock,
         and simply replace brunches after that writeback code goes
         crazy because block mapping changed under it's feets
         kernel BUG at fs/ext4/inode.c:1434!  ( 288'th xfstress)
      
      2) uninitialized extent may became initialized right after we
         drop i_data_sem, so extent state must be rechecked
      
      3) Locked pages goes uptodate via following sequence:
         ->readpage(page); lock_page(page); use_that_page(page)
         But after readpage() one may invalidate it because it is
         uptodate and unlocked (reclaimer does that)
         As result kernel bug at include/linux/buffer_head.c:133!
      
      4) We call write_begin() with already opened stansaction which
         result in following deadlock:
      ->move_extent_per_page()
        ->ext4_journal_start()-> hold journal transaction
        ->write_begin()
          ->ext4_da_write_begin()
            ->ext4_nonda_switch()
              ->writeback_inodes_sb_if_idle()  --> will wait for journal_stop()
      
      5) try_to_release_page() may fail and it does fail if one of page's bh was
         pinned by journal
      
      6) If we about to change page's mapping we MUST hold it's lock during entire
         remapping procedure, this is true for both pages(original and donor one)
      
      Fixes:
      
      - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
        optimization, this will be reimplemented later.
      
      - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock
      
      - Fix (4) by rearranging existing locking:
        from: journal_start(); ->write_begin
        to: write_begin(); journal_extend()
      - Fix (5) simply by checking retvalue
      - Fix (6) by locking both (original and donor one) pages during extent swap
        with help of mext_page_double_lock()
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      bb557488
    • Dmitry Monakhov's avatar
      ext4: online defrag is not supported for journaled files · f066055a
      Dmitry Monakhov authored
      Proper block swap for inodes with full journaling enabled is
      truly non obvious task. In order to be on a safe side let's
      explicitly disable it for now.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f066055a
    • Dmitry Monakhov's avatar
      ext4: move_extent code cleanup · 03bd8b9b
      Dmitry Monakhov authored
      - Remove usless checks, because it is too late to check that inode != NULL
        at the moment it was referenced several times.
      - Double lock routines looks very ugly and locking ordering relays on
        order of i_ino, but other kernel code rely on order of pointers.
        Let's make them simple and clean.
      - check that inodes belongs to the same SB as soon as possible.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      03bd8b9b
    • Tao Ma's avatar
      ext4: don't call update_backups() multiple times for the same bg · 0acdb887
      Tao Ma authored
      When performing an online resize, we add a bunch of groups at one time
      in ext4_flex_group_add, so in most cases a lot of group descriptors
      will be in the same group block. But in the end of this function,
      update_backups will be called for every group descriptor and the same
      block will be copied and journalled again and again.  It is really a
      waste.
      
      Fix things so we only update a particular bg descriptor block once and
      skip subsequent updates of the same block.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0acdb887
    • Dmitry Monakhov's avatar
      ext4: fix double unlock buffer mess during fs-resize · 7f1468d1
      Dmitry Monakhov authored
      bh_submit_read() is responsible for unlock bh on endio.  In addition,
      we need to use bh_uptodate_or_lock() to avoid races.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      7f1468d1
  9. 24 Sep, 2012 3 commits
  10. 20 Sep, 2012 1 commit
    • Tao Ma's avatar
      ext4: remove erroneous ext4_superblock_csum_set() in update_backups() · bef53b01
      Tao Ma authored
      The update_backups() function is used to backup all the metadata
      blocks, so we should not take it for granted that 'data' is pointed to
      a super block and use ext4_superblock_csum_set to calculate the
      checksum there.  In case where the data is a group descriptor block,
      it will corrupt the last group descriptor, and then e2fsck will
      complain about it it.
      
      As all the metadata checksums should already be OK when we do the
      backup, remove the wrong ext4_superblock_csum_set and it should be
      just fine.
      Reported-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bef53b01