- 09 Feb, 2013 2 commits
-
-
Theodore Ts'o authored
The grab_cache_page_write_begin() function can potentially sleep for a long time, since it may need to do memory allocation which can block if the system is under significant memory pressure, and because it may be blocked on page writeback. If it does take a long time to grab the page, it's better that we not hold an active jbd2 handle. So grab a handle on the page first, and _then_ start the transaction handle. This commit fixes the following long transaction handle hold time: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
-
Theodore Ts'o authored
So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 08 Feb, 2013 2 commits
-
-
Theodore Ts'o authored
Move the jbd2 wrapper functions which start and stop handles out of super.c, where they don't really logically belong, and into ext4_jbd2.c. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Theodore Ts'o authored
Handles which stay open a long time are problematic when it comes time to close down a transaction so it can be committed. These tracepoints will help us determine which ones are the problematic ones, and to validate whether changes makes things better or worse. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 07 Feb, 2013 2 commits
-
-
Theodore Ts'o authored
This reverts commit 93737456. The cow-snapshots effort is no longer active, so remove these extra fields to shrink down the handle structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
-
Theodore Ts'o authored
Track the delay between when we first request that the commit begin and when it actually begins, so we can see how much of a gap exists. In theory, this should just be the remaining scheduling quantuum of the thread which requested the commit (assuming it was not a synchronous operation which triggered the commit request) plus scheduling overhead; however, it's possible that real time processes might get in the way of letting the kjournald thread from executing. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 04 Feb, 2013 1 commit
-
-
Theodore Ts'o authored
The ext4 block allocator only maintains buddy bitmaps for chunks which are less than or equal to one quarter of a block group. That is, for a file aystem with a 1k blocksize, and where the number of blocks in a block group is 8192 blocks, the largest chunk size tracked by buddy bitmaps is 2048 blocks. For a file system with a 4k blocksize, and where the number of blocks in a block group is 32768 blocks, the largest chunk size tracked by buddy bitmaps is 8192 blocks. To work around this code, mballoc.c before this commit would truncate allocation requests to the number of blocks in a block group minus 10. Why 10? Aside from being a completely arbitrary number, it avoids block allocation to be a power of two larger than 25% of the block group. If you try to explicitly fallocate 50% of the block group size, this will demonstrate the problem; the block allocation code will scan the all of the blocks in the file system with cr==0 (since the request is for a natural power of two), but then completely fail for all blocks groups, since the buddy bitmaps don't track chunk sizes of 50% of the block group. To fix this, in these we use ext4_mb_complex_scan_group() instead of ext4_mb_simple_scan_group(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca>
-
- 03 Feb, 2013 4 commits
-
-
Theodore Ts'o authored
Check for incompatible mount options when using the ext4 file system driver to mount ext2 or ext3 file systems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
If argument of inode_readahead_blk is too big, we just bail out without printing any error. Fix this since it could confuse users. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
The loop looking for correct mount option entry is more logical if it is written rewritten as an empty loop looking for correct option entry and then code handling the option. It also saves one level of indentation for a lot of code so we can join a couple of split lines. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are currently handled before we enter standard option handling loop. I don't see a reason for this so move them to normal handling loop to make things more regular. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 02 Feb, 2013 4 commits
-
-
Cong Ding authored
It is unnecessary to check i<4 after the loop; just do it before the break. Signed-off-by: Cong Ding <dinggnu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Niu Yawei authored
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when changing the lg_prealloc_list. Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
-
Akria Fujita authored
Commit 2147b1a6 resulted in a new smatch warning: > fs/ext4/move_extent.c:693 mext_replace_branches() > warn: variable dereferenced before check 'dext' (see line 683) Fix this by adding a check to make sure dext is non-NULL before we derefrence it. Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com> [ modified by tytso to make sure an ext4_error is called ] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Julia Lawall authored
Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 30 Jan, 2013 2 commits
-
-
Eric Sandeen authored
Don't send an extra wakeup to kjournald in the case where we already have the proper target in j_commit_request, i.e. that transaction has already been requested for commit. commit deeeaf13 "jbd2: fix fsync() tid wraparound bug" changed the logic leading to a wakeup, but it caused some extra wakeups which were found to lead to a measurable performance regression. Signed-off-by: Eric Sandeen <sandeen@redhat.com> [tytso@mit.edu: reworked check to make it clearer] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: stable@vger.kernel.org Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 29 Jan, 2013 8 commits
-
-
Guo Chao authored
brelse() and ext4_journal_force_commit() are both inlined and able to handle NULL. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Guo Chao authored
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Guo Chao authored
After commit 978fef91 (create __ext4_insert_dentry for dir entry insertion), 'reclen' is not used anymore. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Guo Chao authored
Commit b0336e8d (ext4: calculate and verify checksums of directory leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums for htree nodes) forget to release buffer when checksum failed, at some places. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Lukas Czerner authored
In two places we call WARN_ON() before we print out the debug message, however we agreed that the WARN_ON() is unnecessary at those places so remove them. Also use ext4_warning() instead of ext4_msg() and printk(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Lukas Czerner authored
Remove unused variable flags from dump_completed_IO(). The code is only exercised when EXT4FS_DEBUG is defined. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
-
Jan Kara authored
So far ext4_writepage() skipped writing pages that had any delayed or unwritten buffers attached. When blocksize < pagesize this breaks data=ordered mode guarantees as we can have a page with one freshly allocated buffer whose allocation is part of the committing transaction and another buffer in the page which is delayed or unwritten. So fix this problem by calling ext4_bio_writepage() anyway. It will submit mapped buffers and leave others alone. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
So far ext4_bio_writepage() unconditionally cleared dirty bit on all buffers underlying the page. That implicitely assumes we can write all buffers. So far that is true because callers call into ext4_bio_writepage() make sure all buffers in the page are mapped but: a) it's a data corruption bug waiting to happen b) in data=ordered mode when blocksize < pagesize we do need to write pages that may have only some of dirty buffers mapped. So change ext4_bio_writepage() to skip buffers that cannot be written without clearing their dirty bit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 28 Jan, 2013 8 commits
-
-
Jan Kara authored
The argument b_size of mpage_add_bh_to_extent() was bogus since it was always == blocksize (which we can easily derive from inode->i_blkbits). Also second branch of condition: if (nrblocks >= EXT4_MAX_TRANS_DATA) { } else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) > EXT4_MAX_TRANS_DATA) { } was never taken because (b_size >> mpd->inode->i_blkbits) == 1. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io() doesn't have to deal with the case when page doesn't have buffers. We attach buffers to a page in ->write_begin() and ->page_mkwrite() which covers all places where a page can become dirty. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
The function splices i_completed_io_list to its private list first. From that moment on we don't need any lock for working with io_end structures because all io_end structure on the list are only our own. So we can remove the other two lists in the function and free io_end immediately after we are done with it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
It does not make much sense to have struct work in ext4_io_end_t because we always use it for only one ext4_io_end_t per inode (the first one in the i_completed_io list). So just move the structure to inode itself. This also allows for a small simplification in processing io_end structures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
We don't support delayed allocation in data=journal mode. So checking for it in mpage_da_submit_io() doesn't make really sence. If we ever decide to extend delayed allocation support to data=journal mode, adding __ext4_journalled_writepage() call will be the least of problems we have to solve. Most likely we'd have to implement separate writepages call anyways because we don't have transaction credits for writing more than a single page so mapping of page buffers would have to be done differently. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
When we cannot write a page we should use redirty_page_for_writepage() instead of plain set_page_dirty(). That tells writeback code we have problems, redirties only the page (redirtying buffers is not needed), and updates mm accounting of failed page writes. Also move clearing of buffer dirty flag after io_submit_add_bh(). At that moment we are sure buffer will be going to disk. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Jan Kara authored
Currently we sometimes used block_write_full_page() and sometimes ext4_bio_write_page() for writeback (depending on mount options and call path). Let's always use ext4_bio_write_page() to simplify things a bit. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
Zheng Liu authored
This patch add supports for indirect file support punching hole. It is almost the same as ext4_ext_punch_hole. First, we invalidate all pages between this hole, and then we try to deallocate all blocks of this hole. A recursive function is used to handle deallocation of blocks. In this function, it iterates over the entries in inode's i_blocks or indirect blocks, and try to free the block for each one of them. After applying this patch, xfstest #255 will not pass w/o extent because indirect-based file doesn't support unwritten extents. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 25 Jan, 2013 3 commits
-
-
Chen Gang authored
When usrjquota or grpjquota mount options are specified several times, we leak memory storing the names. Free the memory correctly. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
-
Theodore Ts'o authored
Otherwise, ext4 file systems with the quota featured enable will get a very confusing "No such process" error message if the quota code is built as a module and the quota_v2 module has not been loaded. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
-
Theodore Ts'o authored
In addition, print the error returned from ext4_enable_quotas() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Cc: stable@vger.kernel.org
-
- 17 Jan, 2013 1 commit
-
-
Zheng Liu authored
This patch adds a tracepoint in ext4_punch_hole. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 13 Jan, 2013 1 commit
-
-
Theodore Ts'o authored
After we have finished extending the file system, we need to trigger a the lazy inode table thread to zero out the inode tables. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-
- 12 Jan, 2013 2 commits
-
-
Eryu Guan authored
Validate the bh pointer before using it, since ext4_read_block_bitmap_nowait() might return NULL. I've seen this in fsfuzz testing. EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0 ... Call Trace: [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60 [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80 [<ffffffff811d0d36>] ? __getblk+0x36/0x70 [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210 [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140 [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0 [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100 [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0 [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230 [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490 [<ffffffff811b8cd7>] evict+0xa7/0x1c0 [<ffffffff811b8ed3>] iput_final+0xe3/0x170 [<ffffffff811b8f9e>] iput+0x3e/0x50 [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90 [<ffffffff81231d0b>] ext4_create+0xeb/0x170 [<ffffffff811aae9c>] vfs_create+0xac/0xd0 [<ffffffff811ac845>] lookup_open+0x185/0x1c0 [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170 [<ffffffff811acb54>] do_last+0x2d4/0x7a0 [<ffffffff811af743>] path_openat+0xb3/0x480 [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0 [<ffffffff811afc49>] do_filp_open+0x49/0xa0 [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150 [<ffffffff8119da28>] do_sys_open+0x108/0x1f0 [<ffffffff8119db51>] sys_open+0x21/0x30 [<ffffffff81618959>] system_call_fastpath+0x16/0x1b Also fix comment for ext4_read_block_bitmap_nowait() Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
-
Wang Shilong authored
Because the function 'sb_getblk' seldomly fails to return NULL value,it will be better to use 'unlikely' to optimize it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-