- 18 May, 2010 11 commits
-
-
Joel Becker authored
ocfs2_block_group_claim_bits() is never called with min_bits=0, but we shouldn't leave status undefined if it ever is. Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
In normal xattr set, the set sequence is inode, xattr block and finally xattr bucket if we meet with a ENOSPC. But there is a corner case. So consider we will set a xattr whose value will be stored in a cluster, and there is no xattr block by now. So we will reserve 1 xattr block and 1 cluster for setting it. Now if we fail in value extension(in case the volume is almost full and we can't allocate the cluster because the check in ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So we will try to create a bucket(this time there is a chance that the reserved cluster will be used), and when we try value extension again, kernel bug happens. We did meet with it. Check the bug below. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251 This patch just try to avoid this by adding a set_abort in ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension, we will check whether it is caused by the real ENOSPC or just the full of inode or xattr block. If it is the first case, we set set_abort so that we don't try any further. we are safe to exit directly here ince it is really ENOSPC. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Wengang Wang authored
Currently we process a dirty lockres with the lockres->spinlock taken. While during the process, we may need to lock on dlm->ast_lock. This breaks the dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second). This patch fixes the problem. Since we can't release lockres->spinlock, we have to take dlm->ast_lock just before taking the lockres->spinlock and release it after lockres->spinlock is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version, in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no performance harm. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
In ocfs2_prepare_xattr_entry, if we fail to grow an existing value, xa_cleanup_value_truncate() will leave the old entry in place. Thus, we reset its value size. However, if we were allocating a new value, we must not reset the value size or we will BUG(). This resolves oss.oracle.com bug 1247. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
-
Julia Lawall authored
Use kstrdup when the goal of an allocation is copy a string into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to; expression flag,E1,E2; statement S; @@ - to = kmalloc(strlen(from) + 1,flag); + to = kstrdup(from, flag); ... when != \(from = E1 \| to = E1 \) if (to==NULL || ...) S ... when != \(from = E2 \| to = E2 \) - strcpy(to, from); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Julia Lawall authored
Drop cast on the result of kmalloc and similar functions. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; @@ - (T *) (\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\| kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...)) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tristan Ye authored
This patch simplifies the logic of handling existing holes and skipping extent blocks and removes some confusing comments. The patch survived the fill_verify_holes testcase in ocfs2-test. It also passed my manual sanity check and stress tests with enormous extent records. Currently punching a hole on a file with 3+ extent tree depth was really a performance disaster. It can even take several hours, though we may not hit this in real life with such a huge extent number. One simple way to improve the performance is quite straightforward. From the logic of truncate, we can punch the hole from hole_end to hole_start, which reduces the overhead of btree operations in a significant way, such as tree rotation and moving. Following is the testing result when punching hole from 0 to file end in bytes, on a 1G file, 1G file consists of 256k extent records, each record cover 4k data(just one cluster, clustersize is 4k): =========================================================================== * Original punching-hole mechanism: =========================================================================== I waited 1 hour for its completion, unfortunately it's still ongoing. =========================================================================== * Patched punching-hode mechanism: =========================================================================== real 0m2.518s user 0m0.000s sys 0m2.445s That means we've gained up to 1000 times improvement on performance in this case, whee! It's fairly cool. and it looks like that performance gain will be raising when extent records grow. The patch was based on my former 2 patches, which were about truncating codes optimization and fixup to handle CoW on punching hole. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tristan Ye authored
The original idea to pull ocfs2_find_cpos_for_left_leaf() out of alloc.c is to benefit punching-holes optimization patch, it however, can also be referred by other funcs in the future who want to do the same job. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tristan Ye authored
Based on the previous patch of optimizing truncate, the bugfix for refcount trees when punching holes can be fairly easy and straightforward since most of work we should take into account for refcounting have been completed already in ocfs2_remove_btree_range(). This patch performs CoW for refcounted extents when a hole being punched whose start or end offset were in the middle of a cluster, which means partial zeroing of the cluster will be performed soon. The patch has been tested fixing the following bug: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1216Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tristan Ye authored
Truncate is just a special case of punching holes(from new i_size to end), we therefore could take advantage of the existing ocfs2_remove_btree_range() to reduce the comlexity and redundancy in alloc.c. The goal here is to make truncate more generic and straightforward. Several functions only used by ocfs2_commit_truncate() will smiply be removed. ocfs2_remove_btree_range() was originally used by the hole punching code, which didn't take refcount trees into account (definitely a bug). We therefore need to change that func a bit to handle refcount trees. It must take the refcount lock, calculate and reserve blocks for refcount tree changes, and decrease refcounts at the end. We replace ocfs2_lock_allocators() here by adding a new func ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to reserve. This will not hurt any other code using ocfs2_remove_btree_range() (such as dir truncate and hole punching). I merged the following steps into one patch since they may be logically doing one thing, though I know it looks a little bit fat to review. 1). Remove redundant code used by ocfs2_commit_truncate(), since we're moving to ocfs2_remove_btree_range anyway. 2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of accepting some extra blocks to reserve. 3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our needs. It's safe to do this since it's only being called by truncate. 4). Change ocfs2_remove_btree_range() a bit to take refcount case into account. 5). Finally, we change ocfs2_commit_truncate() to call ocfs2_remove_btree_range() in a proper way. The patch has been tested normally for sanity check, stress tests with heavier workload will be expected. Based on this patch, fixing the punching holes bug will be fairly easy. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
- 10 May, 2010 2 commits
-
-
Joel Becker authored
Once file or link creation gets going, it can't be interrupted by a signal. They're not idempotent. This blocks signals in ocfs2_mknod(), ocfs2_link(), and ocfs2_symlink() once we start actually changing things. ocfs2_mknod() covers mknod(), creat(), mkdir(), and open(O_CREAT). Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Joel Becker authored
ocfs2 sometimes needs to block signals around dlm operations, but it currently does it with sigprocmask(). Even worse, it's checking the error code of sigprocmask(). The in-kernel sigprocmask() can only error if you get the SIG_* argument wrong. We don't. Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These functions are void, but they will BUG() if somehow sigprocmask() returns an error. Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
- 06 May, 2010 18 commits
-
-
Sunil Mushran authored
Lockres hash size of 16KB is far too small for large filesystems (where we have hundreds of thousands of lock resources stored in the table). This patch increases it to 128KB. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's blocks. But if jbd2_journal_extend() fails, it will only restart with the the new number of blocks. This tends to be awkward since in most cases we want additional reserved blocks. It makes our code harder to mantain since the caller can't be sure all the original blocks will not be accessed and dirtied again. There are 15 callers of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add h_buffer_credits before they call ocfs2_extend_trans(). This makes ocfs2_extend_trans() really extend atop the original block count. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
Two tiny cleanup for allocation reservation. 1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits. 2. Remove an unuseful variables in ocfs2_find_resv_lhs. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
When we allocate some bits from the reservation, we always allocate from the r_start(see ocfs2_resmap_resv_bits). So there should be no reason to check between r_start and start. And I don't think we will change this behaviour later by allocating from some bits after r_start. Why not make ocfs2_adjust_resv_from_alloc simple for now? The only chance we have to adjust the reservation is when we haven't reached the end. With this patch, the function is more readable. Note: btw, this patch also fixes an original bug in the function which I haven't found before. if (end < ocfs2_resv_end(resv)) rhs = end - ocfs2_resv_end(resv); This code is of course buggy. ;) Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Sunil Mushran authored
OCFS2 has never really supported intr. This patch acknowledges this reality and makes nointr the default mount option. In a later patch, we intend to support intr. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Sunil Mushran authored
o2dlm join and leave messages are more than informational as they are required for debugging locking issues. This patch changes them from KERN_INFO to KERN_NOTICE. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Srinivas Eeda authored
This patch logs socket state changes that lead to socket shutdown. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Wengang Wang authored
Print the node number of a peer node if sending it a message failed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Mark Fasheh authored
The default behavior for directory reservations stays the same, but we add a mount option so people can tweak the size of directory reservations according to their workloads. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Mark Fasheh authored
The default reservation size of 4 (32-bit windows) is a bit too ambitious. Scale it back to 16 bits (resv_level=2). I have been testing various sizes on a 4-node cluster which runs a mixed workload that is heavily threaded. With a 256MB local alloc, I get *roughly* the following levels of average file fragmentation: resv_level=0 70% resv_level=1 21% resv_level=2 23% resv_level=3 24% resv_level=4 60% resv_level=5 did not test resv_level=6 60% resv_level=2 seemed like a good compromise between not letting windows be too small, but not so big that heavier workloads will immediately suffer without tuning. This patch also change the behavior of directory reservations - they now track file reservations. The previous compromise of giving directory windows only 8 bits wound up fragmenting more at some window sizes because file allocations had smaller unused windows to poach from. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Mark Fasheh authored
I have observed that the current size of 8M gives us pretty poor fragmentation on multi-threaded workloads which do lots of writes. Generally, I can increase the size of local alloc windows and observe a marked decrease in fragmentation, even up and beyond window sizes of 512 megabytes. This makes sense for a couple reasons - larger local alloc means more room for reservation windows. On multi-node workloads the larger local alloc helps as well because we don't have to do window slides as often. Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no longer used and the comment above it was out of date. To test fragmentation, I used a workload which launched 4 threads that did 4k writes into a series of about 140 alternating files. With resv_level=2, and a 4k/4k file system I observed the following average fragmentation for various localalloc= parameters: localalloc= avg. fragmentation 8 48 32 16 64 10 120 7 On larger cluster sizes, the difference is more dramatic. The new default size top out at 256M, which we'll only get for cluster sizes of 32K and above. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Mark Fasheh authored
This patch pulls the local alloc sizing code into localalloc.c and provides a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged except that I correctly calculate the maximum local alloc size. The old code in ocfs2_parse_options() calculated the max size as: ocfs2_local_alloc_size(sb) * 8 which is correct, in bits. Unfortunately though the option passed in is in megabytes. Ultimately, this bug made no real difference - the shrink code would catch a too-large size and bring it down to something reasonable. Still, it's less than efficient as-is. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Mark Fasheh authored
Inodes are always allocated from the global bitmap now so we don't need this any more. Also, the existing implementation bounces reservations around needlessly. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
-
Mark Fasheh authored
Otherwise, the need for a very large contiguous allocation tends to wreak havoc on many inode allocation reservations on the local alloc, thus ruining any chances for contiguousness. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
-
Mark Fasheh authored
Use the reservations system for unindexed dir tree allocations. We don't bother with the indexed tree as reads from it are mostly random anyway. Directory reservations are marked seperately, to allow the reservations code a chance to optimize their window sizes. This patch allocates only 8 bits for directory windows as they generally are not expected to grow as quickly as file data. Future improvements to dir window sizing can trivially be made. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
-
Mark Fasheh authored
Add a per-inode reservations structure and pass it through to the reservations code. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
-
Mark Fasheh authored
This patch improves Ocfs2 allocation policy by allowing an inode to reserve a portion of the local alloc bitmap for itself. The reserved portion (allocation window) is advisory in that other allocation windows might steal it if the local alloc bitmap becomes full. Otherwise, the reservations are honored and guaranteed to be free. When the local alloc window is moved to a different portion of the bitmap, existing reservations are discarded. Reservation windows are represented internally by a red-black tree. Within that tree, each node represents the reservation window of one inode. An LRU of active reservations is also maintained. When new data is written, we allocate it from the inodes window. When all bits in a window are exhausted, we allocate a new one as close to the previous one as possible. Should we not find free space, an existing reservation is pulled off the LRU and cannibalized. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
-
Joel Becker authored
jbd[2]_journal_dirty_metadata() only returns 0. It's been returning 0 since before the kernel moved to git. There is no point in checking this error. ocfs2_journal_dirty() has been faithfully returning the status since the beginning. All over ocfs2, we have blocks of code checking this can't fail status. In the past few years, we've tried to avoid adding these checks, because they are pointless. But anyone who looks at our code assumes they are needed. Finally, ocfs2_journal_dirty() is made a void function. All error checking is removed from other files. We'll BUG_ON() the status of jbd2_journal_dirty_metadata() just in case they change it someday. They won't. Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
- 24 Mar, 2010 1 commit
-
-
Mark Fasheh authored
When the local alloc file changes windows, unused bits are freed back to the global bitmap. By defnition, those bits can not be in use by any file. Also, the local alloc will never have been able to allocate those bits if they were part of a previous truncate. Therefore it makes sense that we should clear unused local alloc bits in the undo buffer so that they can be used immediatly. [ Modified to call it ocfs2_release_clusters() -- Joel ] Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
- 19 Mar, 2010 2 commits
-
-
Tao Ma authored
You can't store a pointer that you haven't filled in yet and expect it to work. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
When replacing a xattr's value, in some case we wipe its name/value first and then re-add it. The wipe is done by ocfs2_xa_block_wipe_namevalue() when the xattr is in the inode or block. We currently adjust name_offset for all the entries which have (offset < name_offset). This does not adjust the entrie we're replacing. Since we are replacing the entry, we don't adjust the total entry count. When we calculate a new namevalue location, we trust the entries now-wrong offset in ocfs2_xa_get_free_start(). The solution is to also adjust the name_offset for the replaced entry, allowing ocfs2_xa_get_free_start() to calculate the new namevalue location correctly. The following script can trigger a kernel panic easily. echo 'y'|mkfs.ocfs2 --fs-features=local,xattr -b 4K $DEVICE mount -t ocfs2 $DEVICE $MNT_DIR FILE=$MNT_DIR/$RANDOM for((i=0;i<76;i++)) do string_76="a$string_76" done string_78="aa$string_76" string_82="aaaa$string_78" touch $FILE setfattr -n 'user.test1234567890' -v $string_76 $FILE setfattr -n 'user.test1234567890' -v $string_78 $FILE setfattr -n 'user.test1234567890' -v $string_82 $FILE Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
- 18 Mar, 2010 2 commits
-
-
Mark Fasheh authored
What we were doing before was to ask for the current window size as the maximum allocation. This had the effect of limiting the amount of allocation we could get for the local alloc during times when the window size was shrunk due to fragmentation. In some cases, that could actually *increase* fragmentation by artificially limiting the number of bits we can accept. So while we still want to ask for a minimum number of bits equal to window size, there is no reason why we should limit the number of bits the local alloc should accept. Hence always allow the maximum number of local alloc bits. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
-
Tao Ma authored
Signed-off-by: Tao Ma <tao.ma@oracle.com>
-
- 27 Apr, 2010 1 commit
-
-
Tao Ma authored
ac_last_group is used to record the last block group we used during allocation. But the initialization process only calls ocfs2_which_suballoc_group and fails to use suballoc_loc properly. So let us do it. Another function ocfs2_test_suballoc_bit also needs fix. I have searched all the callers of ocfs2_which_suballoc_group, and all the callers notices suballoc_loc now. Signed-off-by: Tao Ma <tao.ma@oracle.com>
-
- 22 Mar, 2010 1 commit
-
-
Tao Ma authored
In case the block we are going to free is allocated from a discontiguous block group, we have to use suballoc_loc to be the right group. Signed-off-by: Tao Ma <tao.ma@oracle.com>
-
- 17 May, 2010 1 commit
-
-
Tao Ma authored
Add ocfs2_gd_is_discontig so that we can test whether a group descriptor is discontiguous or not. Signed-off-by: Tao Ma <tao.ma@oracle.com>
-
- 13 Apr, 2010 1 commit
-
-
Tao Ma authored
ocfs2_group_bitmap_size has to handle the case when the volume don't have discontiguous block group support. So pass the feature_incompat in and check it. Signed-off-by: Tao Ma <tao.ma@oracle.com>
-