1. 14 Jun, 2009 1 commit
    • Aneesh Kumar K.V's avatar
      ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation · c364b22c
      Aneesh Kumar K.V authored
      It is possible to see buffer_heads which are not mapped in the
      writepage callback in the following scneario (where the fs blocksize
      is 1k and the page size is 4k):
      
      1) truncate(f, 1024)
      2) mmap(f, 0, 4096)
      3) a[0] = 'a'
      4) truncate(f, 4096)
      5) writepage(...)
      
      Now if we get a writepage callback immediately after (4) and before an
      attempt to write at any other offset via mmap address (which implies we
      are yet to get a pagefault and do a get_block) what we would have is the
      page which is dirty have first block allocated and the other three
      buffer_heads unmapped.
      
      In the above case the writepage should go ahead and try to write the
      first blocks and clear the page_dirty flag. Further attempts to write
      to the page will again create a fault and result in allocating blocks
      and marking page dirty.  If we don't write any other offset via mmap
      address we would still have written the first block to the disk and
      rest of the space will be considered as a hole.
      
      So to address this, we change all of the places where we look for
      delayed, unmapped, or unwritten buffer heads, and only check for
      delayed or unwritten buffer heads instead.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c364b22c
  2. 04 Jun, 2009 1 commit
  3. 06 Jul, 2009 1 commit
  4. 08 Jul, 2009 1 commit
    • Theodore Ts'o's avatar
      ext4: fix no journal corruption with locale-gen · 5adfee9c
      Theodore Ts'o authored
      If there is no journal, ext4_should_writeback_data() should return
      TRUE.  This will fix ext4_set_aops() to set ext4_da_ops in the case of
      delayed allocation; otherwise ext4_journaled_aops gets used by
      default, which doesn't handle delayed allocation properly.
      
      The advantage of using ext4_should_writeback_data() approach is that
      it should handle nobh better as well.
      
      Thanks to Curt Wohlgemuth for investigating this problem, and Aneesh
      Kumar for suggesting this approach.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5adfee9c
  5. 06 Jul, 2009 1 commit
  6. 13 Jul, 2009 2 commits
    • Jan Kara's avatar
      ext4: Fix truncation of symlinks after failed write · ffacfa7a
      Jan Kara authored
      Contents of long symlinks is written via standard write methods. So
      when the write fails, we add inode to orphan list. But symlinks don't
      have .truncate method defined so nobody properly removes them from the
      on disk orphan list.
      
      Fix this by calling ext4_truncate() directly instead of calling
      vmtruncate() (which is saner anyway since we don't need anything
      vmtruncate() does except from calling .truncate in these paths).  We
      also add inode to orphan list only if ext4_can_truncate() is true
      (currently, it can be false for symlinks when there are no blocks
      allocated) - otherwise orphan list processing will complain and
      ext4_truncate() will not remove inode from on-disk orphan list.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      ffacfa7a
    • Jan Kara's avatar
      jbd2: Fix a race between checkpointing code and journal_get_write_access() · f91d1d04
      Jan Kara authored
      The following race can happen:
      
       CPU1                          CPU2
                                     checkpointing code checks the buffer, adds
                                       it to an array for writeback
       do_get_write_access()
       ...
       lock_buffer()
       unlock_buffer()
                                     flush_batch() submits the buffer for IO
       __jbd2_journal_file_buffer()
      
      So a buffer under writeout is returned from
      do_get_write_access(). Since the filesystem code relies on the fact
      that journaled buffers cannot be written out, it does not take the
      buffer lock and so it can modify buffer while it is under
      writeout. That can lead to a filesystem corruption if we crash at the
      right moment.
      
      We fix the problem by clearing the buffer dirty bit under buffer_lock
      even if the buffer is on BJ_None list. Actually, we clear the dirty
      bit regardless the list the buffer is in and warn about the fact if
      the buffer is already journalled.
      
      Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.
      Reported-by: default avatardingdinghua <dingdinghua85@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f91d1d04
  7. 06 Jul, 2009 1 commit
  8. 13 Jul, 2009 1 commit
    • Eric Sandeen's avatar
      ext4: naturally align struct ext4_allocation_request · 726447d8
      Eric Sandeen authored
      As Ted noted, the ext4_allocation_request isn't well aligned.  Looking
      at it with pahole we're wasting space on 64-bit arches:
      
      struct ext4_allocation_request {
              struct inode *             inode;              /*     0     8 */
              ext4_lblk_t                logical;            /*     8     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              ext4_fsblk_t               goal;               /*    16     8 */
              ext4_lblk_t                lleft;              /*    24     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              ext4_fsblk_t               pleft;              /*    32     8 */
              ext4_lblk_t                lright;             /*    40     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              ext4_fsblk_t               pright;             /*    48     8 */
              unsigned int               len;                /*    56     4 */
              unsigned int               flags;              /*    60     4 */
              /* --- cacheline 1 boundary (64 bytes) --- */
      
              /* size: 64, cachelines: 1, members: 9 */
              /* sum members: 52, holes: 3, sum holes: 12 */
      };
      
      Grouping 32-bit members together closes these holes and shrinks the
      structure by 12 bytes. which is important since ext4 can get on the
      hairy edge of stack overruns.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      726447d8
  9. 06 Jul, 2009 2 commits
    • Eric Sandeen's avatar
      ext4: mark several more functions in mballoc.c as noinline · 089ceecc
      Eric Sandeen authored
      Ted noticed a stack-deep callchain through
      writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ...
      
      With all the static functions in mballoc.c, gcc helpfully
      inlines for us, and we get something like this:
      
      ext4_mb_regular_allocator	(232 bytes stack)
      	ext4_mb_init_cache	(232 bytes stack)
      		submit_bh	(starts 464 deeper)
      
      the 2 ext4 functions here get several others inlined; by telling
      gcc not to inline them, we can save stack space for when we
      head off into submit_bh land and associated block layer callchains.
      The following noinlined functions are only called once, so this
      won't impact any other callchains:
      
      ext4_mb_regular_allocator 			(104) (was 232)
      	ext4_mb_find_by_goal			 (56) (noinlined)
      	ext4_mb_init_group			 (24) (noinlined)
      		ext4_mb_init_cache		(136) (was 232)
      			ext4_mb_generate_buddy	 (88) (noinlined)
      			ext4_mb_generate_from_pa (40) (noinlined)
      			submit_bh
      	ext4_mb_simple_scan_group		 (24) (noinlined)
      	ext4_mb_scan_aligned			 (56) (noinlined)
      	ext4_mb_complex_scan_group		 (40) (noinlined)
      	ext4_mb_try_best_found			 (24) (noinlined)
      
      now when we head off into submit_bh() we're only 264 bytes deeper
      in stack than when we entered ext4_mb_regular_allocator()
      (vs. 464 bytes before).  Every 200 bytes helps.  :)
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      089ceecc
    • Theodore Ts'o's avatar
      ext4: Fix potential reclaim deadlock when truncating partial block · f4a01017
      Theodore Ts'o authored
      The ext4_block_truncate_page() function previously called
      grab_cache_page(), which called find_or_create_page() with the
      __GFP_FS flag potentially set.  This could cause a deadlock if the
      system is low on memory and it attempts a memory reclaim, which could
      potentially call back into ext4.  So we need to call
      find_or_create_page() directly, and remove the __GFP_FP flag to avoid
      this potential deadlock.
      
      Thanks to Roland Dreier for reporting a lockdep warning which showed
      this problem.
      
      [20786.363249] =================================
      [20786.363257] [ INFO: inconsistent lock state ]
      [20786.363265] 2.6.31-2-generic #14~rbd4gitd960eea9
      [20786.363270] ---------------------------------
      [20786.363276] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
      [20786.363285] http/8397 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [20786.363291]  (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
      [20786.363314] {IN-RECLAIM_FS-W} state was registered at:
      [20786.363320]   [<ffffffff8108bef6>] mark_irqflags+0xc6/0x1a0
      [20786.363334]   [<ffffffff8108d347>] __lock_acquire+0x287/0x430
      [20786.363345]   [<ffffffff8108d595>] lock_acquire+0xa5/0x150
      [20786.363355]   [<ffffffff812008da>] jbd2_journal_start+0xfa/0x150
      [20786.363365]   [<ffffffff811d98a8>] ext4_journal_start_sb+0x58/0x90
      [20786.363377]   [<ffffffff811cce85>] ext4_delete_inode+0xc5/0x2c0
      [20786.363389]   [<ffffffff81146fa3>] generic_delete_inode+0xd3/0x1a0
      [20786.363401]   [<ffffffff81147095>] generic_drop_inode+0x25/0x30
      [20786.363411]   [<ffffffff81145ce2>] iput+0x62/0x70
      [20786.363420]   [<ffffffff81142878>] dentry_iput+0x98/0x110
      [20786.363429]   [<ffffffff81142a00>] d_kill+0x50/0x80
      [20786.363438]   [<ffffffff811444c5>] dput+0x95/0x180
      [20786.363447]   [<ffffffff8120de4b>] ecryptfs_d_release+0x2b/0x70
      [20786.363459]   [<ffffffff81142978>] d_free+0x28/0x60
      [20786.363468]   [<ffffffff81142a18>] d_kill+0x68/0x80
      [20786.363477]   [<ffffffff81142ad3>] prune_one_dentry+0xa3/0xc0
      [20786.363487]   [<ffffffff81142d61>] __shrink_dcache_sb+0x271/0x290
      [20786.363497]   [<ffffffff81142e89>] prune_dcache+0x109/0x1b0
      [20786.363506]   [<ffffffff81142f6f>] shrink_dcache_memory+0x3f/0x50
      [20786.363516]   [<ffffffff810f6d3d>] shrink_slab+0x12d/0x190
      [20786.363527]   [<ffffffff810f97d7>] balance_pgdat+0x4d7/0x640
      [20786.363537]   [<ffffffff810f9a57>] kswapd+0x117/0x170
      [20786.363546]   [<ffffffff810773ce>] kthread+0x9e/0xb0
      [20786.363558]   [<ffffffff8101430a>] child_rip+0xa/0x20
      [20786.363569]   [<ffffffffffffffff>] 0xffffffffffffffff
      [20786.363598] irq event stamp: 15997
      [20786.363603] hardirqs last  enabled at (15997): [<ffffffff81125f9d>] kmem_cache_alloc+0xfd/0x1a0
      [20786.363617] hardirqs last disabled at (15996): [<ffffffff81125f01>] kmem_cache_alloc+0x61/0x1a0
      [20786.363628] softirqs last  enabled at (15966): [<ffffffff810631ea>] __do_softirq+0x14a/0x220
      [20786.363641] softirqs last disabled at (15861): [<ffffffff8101440c>] call_softirq+0x1c/0x30
      [20786.363651] 
      [20786.363653] other info that might help us debug this:
      [20786.363660] 3 locks held by http/8397:
      [20786.363665]  #0:  (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<ffffffff8112ed24>] do_truncate+0x64/0x90
      [20786.363685]  #1:  (&sb->s_type->i_alloc_sem_key#5){+++++.}, at: [<ffffffff81147f90>] notify_change+0x250/0x350
      [20786.363707]  #2:  (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
      [20786.363724] 
      [20786.363726] stack backtrace:
      [20786.363734] Pid: 8397, comm: http Tainted: G         C 2.6.31-2-generic #14~rbd4gitd960eea9
      [20786.363741] Call Trace:
      [20786.363752]  [<ffffffff8108ad7c>] print_usage_bug+0x18c/0x1a0
      [20786.363763]  [<ffffffff8108b0c0>] ? check_usage_backwards+0x0/0xb0
      [20786.363773]  [<ffffffff8108bad2>] mark_lock_irq+0xf2/0x280
      [20786.363783]  [<ffffffff8108bd97>] mark_lock+0x137/0x1d0
      [20786.363793]  [<ffffffff8108c03c>] mark_held_locks+0x6c/0xa0
      [20786.363803]  [<ffffffff8108c11f>] lockdep_trace_alloc+0xaf/0xe0
      [20786.363813]  [<ffffffff810efbac>] __alloc_pages_nodemask+0x7c/0x180
      [20786.363824]  [<ffffffff810e9411>] ? find_get_page+0x91/0xf0
      [20786.363835]  [<ffffffff8111d3b7>] alloc_pages_current+0x87/0xd0
      [20786.363845]  [<ffffffff810e9827>] __page_cache_alloc+0x67/0x70
      [20786.363856]  [<ffffffff810eb7df>] find_or_create_page+0x4f/0xb0
      [20786.363867]  [<ffffffff811cb3be>] ext4_block_truncate_page+0x3e/0x460
      [20786.363876]  [<ffffffff812008da>] ? jbd2_journal_start+0xfa/0x150
      [20786.363885]  [<ffffffff812008bb>] ? jbd2_journal_start+0xdb/0x150
      [20786.363895]  [<ffffffff811c6415>] ? ext4_meta_trans_blocks+0x75/0xf0
      [20786.363905]  [<ffffffff811e8d8b>] ext4_ext_truncate+0x1bb/0x1e0
      [20786.363916]  [<ffffffff811072c5>] ? unmap_mapping_range+0x75/0x290
      [20786.363926]  [<ffffffff811ccc28>] ext4_truncate+0x498/0x630
      [20786.363938]  [<ffffffff8129b4ce>] ? _raw_spin_unlock+0x5e/0xb0
      [20786.363947]  [<ffffffff81107306>] ? unmap_mapping_range+0xb6/0x290
      [20786.363957]  [<ffffffff8108c3ad>] ? trace_hardirqs_on+0xd/0x10
      [20786.363966]  [<ffffffff811ffe58>] ? jbd2_journal_stop+0x1f8/0x2e0
      [20786.363976]  [<ffffffff81107690>] vmtruncate+0xb0/0x110
      [20786.363986]  [<ffffffff81147c05>] inode_setattr+0x35/0x170
      [20786.363995]  [<ffffffff811c9906>] ext4_setattr+0x186/0x370
      [20786.364005]  [<ffffffff81147eab>] notify_change+0x16b/0x350
      [20786.364014]  [<ffffffff8112ed30>] do_truncate+0x70/0x90
      [20786.364021]  [<ffffffff8112f48b>] T.657+0xeb/0x110
      [20786.364021]  [<ffffffff8112f4be>] sys_ftruncate+0xe/0x10
      [20786.364021]  [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
      Reported-by: default avatarRoland Dreier <roland@digitalvampire.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f4a01017
  10. 21 Jun, 2009 2 commits
  11. 04 Jul, 2009 9 commits
  12. 03 Jul, 2009 18 commits