• Wengang Wang's avatar
    ocfs2: initialize ip_next_orphan · f5785283
    Wengang Wang authored
    Though problem if found on a lower 4.1.12 kernel, I think upstream has
    same issue.
    
    In one node in the cluster, there is the following callback trace:
    
       # cat /proc/21473/stack
       __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
       ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
       ocfs2_evict_inode+0x152/0x820 [ocfs2]
       evict+0xae/0x1a0
       iput+0x1c6/0x230
       ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
       ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
       ocfs2_dir_foreach+0x29/0x30 [ocfs2]
       ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
       ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
       process_one_work+0x169/0x4a0
       worker_thread+0x5b/0x560
       kthread+0xcb/0xf0
       ret_from_fork+0x61/0x90
    
    The above stack is not reasonable, the final iput shouldn't happen in
    ocfs2_orphan_filldir() function.  Looking at the code,
    
      2067         /* Skip inodes which are already added to recover list, since dio may
      2068          * happen concurrently with unlink/rename */
      2069         if (OCFS2_I(iter)->ip_next_orphan) {
      2070                 iput(iter);
      2071                 return 0;
      2072         }
      2073
    
    The logic thinks the inode is already in recover list on seeing
    ip_next_orphan is non-NULL, so it skip this inode after dropping a
    reference which incremented in ocfs2_iget().
    
    While, if the inode is already in recover list, it should have another
    reference and the iput() at line 2070 should not be the final iput
    (dropping the last reference).  So I don't think the inode is really in
    the recover list (no vmcore to confirm).
    
    Note that ocfs2_queue_orphans(), though not shown up in the call back
    trace, is holding cluster lock on the orphan directory when looking up
    for unlinked inodes.  The on disk inode eviction could involve a lot of
    IOs which may need long time to finish.  That means this node could hold
    the cluster lock for very long time, that can lead to the lock requests
    (from other nodes) to the orhpan directory hang for long time.
    
    Looking at more on ip_next_orphan, I found it's not initialized when
    allocating a new ocfs2_inode_info structure.
    
    This causes te reflink operations from some nodes hang for very long
    time waiting for the cluster lock on the orphan directory.
    
    Fix: initialize ip_next_orphan as NULL.
    Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
    Cc: Mark Fasheh <mark@fasheh.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Cc: Changwei Ge <gechangwei@live.cn>
    Cc: Gang He <ghe@suse.com>
    Cc: Jun Piao <piaojun@huawei.com>
    Cc: <stable@vger.kernel.org>
    Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    f5785283
super.c 69 KB