- 03 Apr, 2014 40 commits
-
-
Gu Zheng authored
The return value of bio_get_nr_vecs() cannot be bigger than BIO_MAX_PAGES, so we can remove redundant the comparison between nr_pages and BIO_MAX_PAGES. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wengang Wang authored
This patch fixes the following crash: kernel BUG at fs/ocfs2/uptodate.c:530! Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F) ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F) CPU 5 Pid: 21303, comm: xattr-test Tainted: GF W 3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2] Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480) Call Trace: ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2] ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2] ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2] ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2] ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2] ocfs2_xattr_block_set+0xf9/0x220 [ocfs2] __ocfs2_xattr_set_handle+0x118/0x710 [ocfs2] ocfs2_xattr_set+0x691/0x880 [ocfs2] ocfs2_xattr_user_set+0x46/0x50 [ocfs2] generic_setxattr+0x96/0xa0 __vfs_setxattr_noperm+0x7b/0x170 vfs_setxattr+0xbc/0xc0 setxattr+0xde/0x230 sys_fsetxattr+0xc6/0xf0 system_call_fastpath+0x16/0x1b Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 <0f> 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66 RIP ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2] It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate(): void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci, struct buffer_head *bh) { /* This should definitely *not* exist in our cache */ if (ocfs2_buffer_cached(ci, bh)) printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh); BUG_ON(ocfs2_buffer_cached(ci, bh)); set_buffer_uptodate(bh); ocfs2_metadata_cache_io_lock(ci); ocfs2_set_buffer_uptodate(ci, bh); ocfs2_metadata_cache_io_unlock(ci); } The problem here is: We cached a block, but the buffer_head got reused. When we are to pick up this block again, a new buffer_head created with UPTODATE flag cleared. ocfs2_buffer_uptodate() returned false since no UPTODATE is set on the buffer_head. so we set this block to cache as a NEW block, then it failed at asserting block is not in cache. The fix is to add a new parameter indicating the bucket is a new allocated or not to ocfs2_init_xattr_bucket(). ocfs2_init_xattr_bucket() assert block not cached accordingly. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
jiangyiwen authored
The following case may lead to the same system inode ref in confusion. A thread B thread ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode because of *arr == NULL, ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode gets first ref thru _ocfs2_get_system_file_inode, gets second ref thru igrab and set *arr = inode at the moment, B thread also gets two refs, so lead to one more inode ref. So add mutex lock to avoid multi thread set two inode ref once at the same time. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
jiangyiwen authored
In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after calls ocfs2_get_system_file_inode() to get inode ref, if calls ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should iput inode alloc to avoid leaking the inode. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tariq Saeed authored
Orabug: 17330860 When accepting an incomming connection o2net_accept_one clones a child data socket from the parent listening socket. It then proceeds to setup the child with callback o2net_data_ready() and sk_user_data to NULL. If data arrives in this window, o2net_listen_data_ready will be called with some non-deterministic value in sk_user_data (not inherited). We panic when we page fault on sk_user_data -- in parent it is sock_def_readable(). The fix is to recognize that this is a data socket being set up by looking at the socket state and do nothing. Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com> Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Younger Liu authored
After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(), if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that some space may be lost. So, roll back alloc_dinode counts when ocfs2_block_group_set_bits() failed. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Younger Liu <younger.liucn@gmail.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wengang Wang authored
ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and then call flock_lock_file_wait() to compete with local processes. In case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not done. This patch adds the cleanup --drop the cross-node lock which was just granted. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Darrick J. Wong authored
Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch an inode in a given transaction. This is a follow-on to the previous patch to reduce lock contention and deadlocking during an fsync operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Wengang <wen.gang.wang@oracle.com> Cc: Greg Marsden <greg.marsden@oracle.com> Cc: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
Commit 9548906b ('xattr: Constify ->name member of "struct xattr"') missed that ocfs2 is calling kfree(xattr->name). As a result, kernel panic occurs upon calling kfree(xattr->name) because xattr->name refers static constant names. This patch removes kfree(xattr->name) from ocfs2_mknod() and ocfs2_symlink(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Tariq Saeed <tariq.x.saeed@oracle.com> Tested-by: Tariq Saeed <tariq.x.saeed@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
alex chen authored
Do not put bh when buffer_uptodate failed in ocfs2_write_block and ocfs2_write_super_or_backup, because it will put bh in b_end_io. Otherwise it will hit a warning "VFS: brelse: Trying to free free buffer". Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Xue jiufei authored
When ocfs2_create_new_inode_locks() return error, inode open lock may not be obtainted for this inode. So other nodes can remove this file and free dinode when inode still remain in memory on this node, which is not correct and may trigger BUG. So __ocfs2_mknod_locked should return error when ocfs2_create_new_inode_locks() failed. Node_1 Node_2 create fileA, call ocfs2_mknod() -> ocfs2_get_init_inode(), allocate inodeA -> ocfs2_claim_new_inode(), claim dinode(dinodeA) -> call ocfs2_create_new_inode_locks(), create open lock failed, return error -> __ocfs2_mknod_locked return success unlink fileA try open lock succeed, and free dinodeA create another file, call ocfs2_mknod() -> ocfs2_get_init_inode(), allocate inodeB -> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA, so claim dinodeA and update generation for dinodeA call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode() to free inodeA, and finally triggers BUG on(inode->i_generation != le32_to_cpu(fe->i_generation)) in function ocfs2_inode_lock_update(). Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tariq Saeed authored
Orabug: 18108070 ocfs2_xattr_extend_allocation() hits panic when creating xattr during data extent alloc phase. The problem occurs if due to local alloc fragmentation, clusters are spread over multiple extents. In this case ocfs2_add_clusters_in_btree() finds no space to store more than one extent record and therefore fails returning RESTART_META. The situation is anticipated for xattr update case but not xattr create case. This fix simply ports that code to create case. Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhonghua Guo authored
In dlm_query_region_handler(), once kmalloc failed, it will unlock dlm_domain_lock without lock first, then deadlock happens. Signed-off-by: Zhonghua Guo <guozhonghua@h3c.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Tested-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jensen authored
llseek requires ocfs2 inode lock for updating the file size in SEEK_END. because the file size maybe update on another node. This bug can be reproduce the following scenario: at first, we dd a test fileA, the file size is 10k. on NodeA: --------- 1) open the test fileA, lseek the end of file. and print the position. 2) close the test fileA on NodeB: 1) open the test fileA, append the 5k data to test FileA. 2) lseek the end of file. and print the position. 3) close file. At first we run the test program1 on NodeA , the result is 10k. And then run the test program2 on NodeB, the result is 15k. At last, we run the test program1 on NodeA again, the result is 10k. After applying this patch the three step result is 15k. test result: 1000000 times lseek call; index lseek with inode lock (unit:us) lseek without inode lock (unit:us) 1 1168162 555383 2 1168011 549504 3 1170538 549396 4 1170375 551685 5 1170444 556719 6 1174364 555307 7 1163294 551552 8 1170080 549350 9 1162464 553700 10 1165441 552594 avg 1168317 552519 avg with lock - avg without lock = 615798 (avg with lock - avg without lock)/1000000=0.615798 us Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joseph Qi authored
In o2nm_cluster, cl_idle_timeout_ms, cl_keepalive_delay_ms, as well as cl_reconnect_delay_ms, are defined as type of unsigned int. So we should also use unsigned int in the helper functions. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Goldwyn Rodrigues authored
The following patches are reverted in this patch because these patches caused performance regression in the remote unlink() calls. ea455f8a - ocfs2: Push out dropping of dentry lock to ocfs2_wq f7b1aa69 - ocfs2: Fix deadlock on umount 5fd13189 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount Previous patches in this series removed the possible deadlocks from downconvert thread so the above patches shouldn't be needed anymore. The regression is caused because these patches delay the iput() in case of dentry unlocks. This also delays the unlocking of the open lockres. The open lockresource is required to test if the inode can be wiped from disk or not. When the deleting node does not get the open lock, it marks it as orphan (even though it is not in use by another node/process) and causes a journal checkpoint. This delays operations following the inode eviction. This also moves the inode to the orphaned inode which further causes more I/O and a lot of unneccessary orphans. The following script can be used to generate the load causing issues: declare -a create declare -a remove declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384) unique="`mktemp -u XXXXX`" script="/tmp/idontknow-${unique}.sh" cat <<EOF > "${script}" for n in {1..8}; do mkdir -p test/dir\${n} eval touch test/dir\${n}/foo{1.."\$1"} done EOF chmod 700 "${script}" function fcreate () { exec 2>&1 /usr/bin/time --format=%E "${script}" "$1" } function fremove () { exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*" } function fcp () { exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new" } echo ------------------------------------------------- echo "| # files | create #s | copy #s | remove #s |" echo ------------------------------------------------- for ((x=0; x < ${#iterations[*]} ; x++)) do create[$x]="`fcreate ${iterations[$x]}`" copy[$x]="`fcp ${iterations[$x]}`" remove[$x]="`fremove`" printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]} done rm "${script}" echo "------------------------" Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
If we are dropping last inode reference from downconvert thread, we will end up calling ocfs2_mark_lockres_freeing() which can block if the lock we are freeing is queued thus creating an A-A deadlock. Luckily, since we are the downconvert thread, we can immediately dequeue the lock and thus avoid waiting in this case. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Provide dqgrab() function to get quota structure reference when we are sure it already has at least one active reference. Make use of this function inside quota code. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Move dquot_initalize() call in ocfs2_delete_inode() after the moment we verify inode is actually a sane one to delete. We certainly don't want to initialize quota for system inodes etc. This also avoids calling into quota code from downconvert thread. Add more details into the comment why bailing out from ocfs2_delete_inode() when we are in downconvert thread is OK. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
The flag was never set, delete it. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Goldwyn Rodrigues authored
This is a part of the nocontrold feature which was incorporated sometime back. This is required for backward compatibility of the tools, specifically the scenario where the tools with recovery callback is used with a kernel not using the recovery callbacks (older kernel + newer tools). The tools look for this file to understand if the kernel supports DLM recovery callbacks. For kernels which support recovery callbacks but will miss this patch, ocfs2 will continue to use the older API and would still be able to mount the filesystem. [akpm@linux-foundation.org: simplify] [sfr@canb.auug.org.au: VERIFY_OCTAL_PERMISSIONS fix up] Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Junxiao Bi authored
There is a race window in dlm_do_recovery() between dlm_remaster_locks() and dlm_reset_recovery() when the recovery master nearly finish the recovery process for a dead node. After the master sends FINALIZE_RECO message in dlm_remaster_locks(), another node may become the recovery master for another dead node, and then send the BEGIN_RECO message to all the nodes included the old master, in the handler of this message dlm_begin_reco_handler() of old master, dlm->reco.dead_node and dlm->reco.new_master will be set to the second dead node and the new master, then in dlm_reset_recovery(), these two variables will be reset to default value. This will cause new recovery master can not finish the recovery process and hung, at last the whole cluster will hung for recovery. old recovery master: new recovery master: dlm_remaster_locks() become recovery master for another dead node. dlm_send_begin_reco_message() dlm_begin_reco_handler() { if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { return -EAGAIN; } dlm_set_reco_master(dlm, br->node_idx); dlm_set_reco_dead_node(dlm, br->dead_node); } dlm_reset_recovery() { dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM); dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM); } will hang in dlm_remaster_locks() for request dlm locks info Before send FINALIZE_RECO message, recovery master should set DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done, this can break the race windows as the BEGIN_RECO messages will not be handled before DLM_RECO_STATE_FINALIZE flag is cleared. A similar race may happen between new recovery master and normal node which is in dlm_finalize_reco_handler(), also fix it. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Junxiao Bi authored
This issue was introduced by commit 800deef3 ("ocfs2: use list_for_each_entry where benefical") in 2007 where it replaced list_for_each with list_for_each_entry. The variable "lock" will point to invalid data if "tmpq" list is empty and a panic will be triggered due to this. Sunil advised reverting it back, but the old version was also not right. At the end of the outer for loop, that list_for_each_entry will also set "lock" to an invalid data, then in the next loop, if the "tmpq" list is empty, "lock" will be an stale invalid data and cause the panic. So reverting the list_for_each back and reset "lock" to NULL to fix this issue. Another concern is that this seemes can not happen because the "tmpq" list should not be empty. Let me describe how. old lock resource owner(node 1): migratation target(node 2): image there's lockres with a EX lock from node 2 in granted list, a NR lock from node x with convert_type EX in converting list. dlm_empty_lockres() { dlm_pick_migration_target() { pick node 2 as target as its lock is the first one in granted list. } dlm_migrate_lockres() { dlm_mark_lockres_migrating() { res->state |= DLM_LOCK_RES_BLOCK_DIRTY; wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); //after the above code, we can not dirty lockres any more, // so dlm_thread shuffle list will not run downconvert lock from EX to NR upconvert lock from NR to EX <<< migration may schedule out here, then <<< node 2 send down convert request to convert type from EX to <<< NR, then send up convert request to convert type from NR to <<< EX, at this time, lockres granted list is empty, and two locks <<< in the converting list, node x up convert lock followed by <<< node 2 up convert lock. // will set lockres RES_MIGRATING flag, the following // lock/unlock can not run dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() for (i=0; i<mres->num_locks; i++) if (ml->node == dlm->node_num) for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) { list_for_each_entry(lock, tmpq, list) if (lock) break; <<< lock is invalid as grant list is empty. } if (lock->ml.node != ml->node) BUG() >>> crash here } I see the above locks status from a vmcore of our internal bug. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Darrick J. Wong authored
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal transaction to complete. This isn't terribly efficient, since sync_file really only needs to wait for the last transaction involving that inode to complete, and this doesn't require i_mutex. Therefore, implement the necessary bits to track the newest tid associated with an inode, and teach sync_file to wait for that instead of waiting for everything in the journal to commit. Furthermore, only issue the flush request to the drive if jbd2 hasn't already done so. This also eliminates the deadlock between ocfs2_file_aio_write() and ocfs2_sync_file(). aio_write takes i_mutex then calls ocfs2_aiodio_wait() to wait for unaligned dio writes to finish. However, if that dio completion involves calling fsync, then we can get into trouble when some ocfs2_sync_file tries to take i_mutex. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
joyce.xue authored
Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it up. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wengang Wang authored
There is a problem that waitqueue_active() may check stale data thus miss a wakeup of threads waiting on ip_unaligned_aio. The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to be of type mutex thus the above prolem is avoid. Another benifit is that mutex which works as FIFO is fairer than wake_up_all(). Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zongxun Wang authored
When mounting an ocfs2 volume, it will firstly generate a file /sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread. So the following situation will cause a null pointer dereference. dlm_debug_init -> access file dlm_state which will call dlm_state_print -> dlm_launch_thread Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread can fix this issue. Signed-off-by: Zongxun Wang <wangzongxun@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Geert Uytterhoeven authored
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Geert Uytterhoeven authored
Switch the RSPI MSTP clock on SH7757 from a con ID match to a dev ID match, so we can start looking it up using clk_get() with a NULL ID. Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org> Tested-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kuninori Morimoto authored
sh7757lcr SDHI register size is 0x100 Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Cc: Simon Horman <horms@verge.net.au> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Bobby Bingham authored
The compiler is permitted to generate code which overwrites the parameters to a function. If those parameters include the only saved copy we have of userspace's registers, we're in trouble. Signed-off-by: Bobby Bingham <koorogi@koorogi.info> Cc: Paul Mundt <paul.mundt@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Bobby Bingham authored
This does not appear to have been used since commit 74d99a5e ("sh: SH-2A FPU support") in 2007. Signed-off-by: Bobby Bingham <koorogi@koorogi.info> Cc: Paul Mundt <paul.mundt@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Bobby Bingham authored
When invoking syscall handlers on sh32, the saved userspace registers are at the top of the stack. This seems to have been intentional, as it is an easy way to pass r0, r1, ... to the handler as parameters 5, 6, ... It causes problems, however, because the compiler is allowed to generate code for a function which clobbers that function's own parameters. For example, gcc generates the following code for clone: <SyS_clone>: mov.l 8c020714 <SyS_clone+0xc>,r1 ! 8c020540 <do_fork> mov.l r7,@r15 mov r6,r7 jmp @r1 mov #0,r6 nop .word 0x0540 .word 0x8c02 The `mov.l r7,@r15` clobbers the saved value of r0 passed from userspace. For most system calls, this might not be a problem, because we'll be overwriting r0 with the return value anyway. But in the case of clone, copy_thread will need the original value of r0 if the CLONE_SETTLS flag was specified. The first patch in this series fixes this issue for system calls by pushing to the stack and extra copy of r0-r2 before invoking the handler. We discard this copy before restoring the userspace registers, so it is not a problem if they are clobbered. Exception handlers also receive the userspace register values in a similar manner, and may hit the same problem. The second patch removes the do_fpu_error handler, which looks susceptible to this problem and which, as far as I can tell, has not been used in some time. The third patch addresses other exception handlers. This patch (of 3): The userspace registers are stored at the top of the stack when the syscall handler is invoked, which allows r0-r2 to act as parameters 5-7. Parameters passed on the stack may be clobbered by the syscall handler. The solution is to push an extra copy of the registers which might be used as syscall parameters to the stack, so that the authoritative set of saved register values does not get clobbered. A few system call handlers are also updated to get the userspace registers using current_pt_regs() instead of from the stack. Signed-off-by: Bobby Bingham <koorogi@koorogi.info> Cc: Paul Mundt <paul.mundt@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michael Opdenacker authored
This removes the CPU_SCORE7 Kconfig parameter, which is no longer used anywhere in the source code and Makefiles. Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Lennox Wu <lennox.wu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Beulich authored
Recent increased use of typeof() throughout the tree resulted in a number of symbols (25 in a typical distro config of ours) not getting a proper CRC calculated for them anymore, due to the parser in genksyms not coping with several of these uses (interestingly in the majority of [if not all] cases the problem is due to the use of typeof() in code preceding a certain export, not in the declaration/definition of the exported function/object itself; I wasn't able to find a way to address this more general parser shortcoming). The use of parameter_declaration is a little more relaxed than would be ideal (permitting not just a bare type specification, but also one with identifier), but since the same code is being passed through an actual compiler, there's no apparent risk of allowing through any broken code. Otoh using parameter_declaration instead of the ad hoc "decl_specifier_seq '*'" / "decl_specifier_seq" pair allows all types to be handled rather than just plain ones and pointers to plain ones. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Move code moving event structure to access_list from copy_event_to_user() to fanotify_read() where it is more logical (so that we can immediately see in the main loop that we either move the event to a different list or free it). Also move special error handling for permission events from copy_event_to_user() to the main loop to have it in one place with error handling for normal events. This makes copy_event_to_user() really only copy the event to user without any side effects. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Swap the error / "read ok" branches in the main loop of fanotify_read(). We will grow the "read ok" part in the next patch and this makes the indentation easier. Also it is more common to have error conditions inside an 'if' instead of the fast path. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
access_mutex is used only to guard operations on access_list. There's no need for sleeping within this lock so just make a spinlock out of it. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Currently, fanotify creates new structure to track the fact that permission event has been reported to userspace and someone is waiting for a response to it. As event structures are now completely in the hands of each notification framework, we can use the event structure for this tracking instead of allocating a new structure. Since this makes the event structures for normal events and permission events even more different and the structures have different lifetime rules, we split them into two separate structures (where permission event structure contains the structure for a normal event). This makes normal events 8 bytes smaller and the code a tad bit cleaner. [akpm@linux-foundation.org: fix build] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-