1. 27 Aug, 2016 7 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 28687b93
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "We've queued up a few different fixes in here.  These range from
        enospc corners to fsync and quota fixes, and a few targeted at error
        handling for corrupt metadata/fuzzing"
      
      * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix lockdep warning on deadlock against an inode's log mutex
        Btrfs: detect corruption when non-root leaf has zero item
        Btrfs: check btree node's nritems
        btrfs: don't create or leak aliased root while cleaning up orphans
        Btrfs: fix em leak in find_first_block_group
        btrfs: do not background blkdev_put()
        Btrfs: clarify do_chunk_alloc()'s return value
        btrfs: fix fsfreeze hang caused by delayed iputs deal
        btrfs: update btrfs_space_info's bytes_may_use timely
        btrfs: divide btrfs_update_reserved_bytes() into two functions
        btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
        btrfs: qgroup: Fix qgroup incorrectness caused by log replay
        btrfs: relocation: Fix leaking qgroups numbers on data extents
        btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
        btrfs: waiting on qgroup rescan should not always be interruptible
        btrfs: properly track when rescan worker is running
        btrfs: flush_space: treat return value of do_chunk_alloc properly
        Btrfs: add ASSERT for block group's memory leak
        btrfs: backref: Fix soft lockup in __merge_refs function
        Btrfs: fix memory leak of reloc_root
      28687b93
    • Linus Torvalds's avatar
      Merge tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm · 370f6017
      Linus Torvalds authored
      Pull dlm fix from David Teigland:
       "This fixes a bug introduced by recent debugfs cleanup"
      
      * tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
        dlm: fix malfunction of dlm_tool caused by debugfs changes
      370f6017
    • Linus Torvalds's avatar
      Merge tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6ec675ed
      Linus Torvalds authored
      Pull device mapper fixes from Mike Snitzer:
      
       - another stable fix for DM flakey (that tweaks the previous fix that
         didn't factor in expected 'drop_writes' behavior for read IO).
      
       - a dm-log bio operation flags fix for the broader block changes that
         were merged during the 4.8 merge window.
      
      * tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm log: fix unitialized bio operation flags
        dm flakey: fix reads to be issued if drop_writes configured
      6ec675ed
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 67a8c7d6
      Linus Torvalds authored
      Pull IOMMU fixes from Joerg Roedel:
       "Fixes from Will Deacon:
      
         - fix a couple of thinkos in the CMDQ error handling and
           short-descriptor page table code that have been there since day one
      
         - disable stalling faults, since they may result in hardware deadlock
      
         - fix an accidental BUG() when passing disable_bypass=1 on the
           cmdline"
      
      * tag 'iommu-fixes-v4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
        iommu/arm-smmu: Disable stalling faults for all endpoints
        iommu/arm-smmu: Fix CMDQ error handling
        iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
      67a8c7d6
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · fd1ae514
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Here's a set of block fixes for the current 4.8-rc release.  This
        contains:
      
         - a fix for a secure erase regression, from Adrian.
      
         - a fix for an mmc use-after-free bug regression, also from Adrian.
      
         - potential zero pointer deference in bdev freezing, from Andrey.
      
         - a race fix for blk_set_queue_dying() from Bart.
      
         - a set of xen blkfront fixes from Bob Liu.
      
         - three small fixes for bcache, from Eric and Kent.
      
         - a fix for a potential invalid NVMe state transition, from Gabriel.
      
         - blk-mq CPU offline fix, preventing us from issuing and completing a
           request on the wrong queue.  From me.
      
         - revert two previous floppy changes, since they caused a user
           visibile regression.  A better fix is in the works.
      
         - ensure that we don't send down bios that have more than 256
           elements in them.  Fixes a crash with bcache, for example.  From
           Ming.
      
         - a fix for deferencing an error pointer with cgroup writeback.
           Fixes a regression.  From Vegard"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        mmc: fix use-after-free of struct request
        Revert "floppy: refactor open() flags handling"
        Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
        fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
        blk-mq: improve warning for running a queue on the wrong CPU
        blk-mq: don't overwrite rq->mq_ctx
        block: make sure a big bio is split into at most 256 bvecs
        nvme: Fix nvme_get/set_features() with a NULL result pointer
        bdev: fix NULL pointer dereference
        xen-blkfront: free resources if xlvbd_alloc_gendisk fails
        xen-blkfront: introduce blkif_set_queue_limits()
        xen-blkfront: fix places not updated after introducing 64KB page granularity
        bcache: pr_err: more meaningful error message when nr_stripes is invalid
        bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
        bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
        block: Fix race triggered by blk_set_queue_dying()
        block: Fix secure erase
        nvme: Prevent controller state invalid transition
      fd1ae514
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · b09c412a
      Linus Torvalds authored
      Pull input subsystem fixes from Dmitry Torokhov:
       "Simply small driver fixups"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: ads7846 - remove redundant regulator_disable call
        Input: synaptics-rmi4 - fix register descriptor subpacket map construction
        Input: tegra-kbc - fix inverted reset logic
        Input: silead - use devm_gpiod_get
        Input: i8042 - set up shared ps2_cmd_mutex for AUX ports
      b09c412a
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.8-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 219c04ce
      Linus Torvalds authored
      Pull PCI fixes from Bjorn Helgaas:
       "Resource management:
         - Update "pci=resource_alignment" documentation (Mathias Koehrer)
      
        MSI:
         - Use positive flags in pci_alloc_irq_vectors() (Christoph Hellwig)
         - Call pci_intx() when using legacy interrupts in pci_alloc_irq_vectors() (Christoph Hellwig)
      
        Intel VMD host bridge driver:
         - Fix infinite loop executing irq's (Keith Busch)"
      
      * tag 'pci-v4.8-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        x86/PCI: VMD: Fix infinite loop executing irq's
        PCI: Call pci_intx() when using legacy interrupts in pci_alloc_irq_vectors()
        PCI: Use positive flags in pci_alloc_irq_vectors()
        PCI: Update "pci=resource_alignment" documentation
      219c04ce
  2. 26 Aug, 2016 1 commit
    • Eric Ren's avatar
      dlm: fix malfunction of dlm_tool caused by debugfs changes · 079d37df
      Eric Ren authored
      With the current kernel, `dlm_tool lockdebug` fails as below:
      
      "dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3
      can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3:
      Operation not permitted"
      
      This is because table_open() depends on file->f_op to tell which
      seq_file ops should be passed down. But, the original file ops in
      file->f_op is replaced by "debugfs_full_proxy_file_operations" with
      commit 49d200de ("debugfs: prevent access to removed files'
      private data").
      
      Currently, I can think up 2 solutions: 1st, replace
      debugfs_create_file() with debugfs_create_file_unsafe();
      2nd, make different table_open#() accordingly. The 1st one
      is neat, but I don't thoroughly understand its risk. Maybe
      someone has a better one.
      Signed-off-by: default avatarEric Ren <zren@suse.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      079d37df
  3. 25 Aug, 2016 27 commits
    • Adrian Hunter's avatar
      mmc: fix use-after-free of struct request · 869c5548
      Adrian Hunter authored
      We call mmc_req_is_special() after having processed a request, but
      it could be freed after that. Check that ahead of time, and use
      the cached value.
      Reported-by: default avatarHans de Goede <hdegoede@redhat.com>
      Tested-by: default avatarHans de Goede <hdegoede@redhat.com>
      Fixes: c2df40df ("drivers: use req op accessor")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      869c5548
    • Jens Axboe's avatar
      Revert "floppy: refactor open() flags handling" · f2791e7e
      Jens Axboe authored
      This reverts commit 09954bad.
      f2791e7e
    • Jens Axboe's avatar
      Revert "floppy: fix open(O_ACCMODE) for ioctl-only open" · 468c298a
      Jens Axboe authored
      This reverts commit ff06db1e.
      468c298a
    • Andrey Ryabinin's avatar
      fs/block_dev: fix potential NULL ptr deref in freeze_bdev() · 5bb53c0f
      Andrey Ryabinin authored
      Calling freeze_bdev() twice on the same block device without mounted
      filesystem get_super() will return NULL, which will lead to NULL-ptr
      dereference later in drop_super().
      
      Check get_super() result to fix that.
      
      Note, that this is a purely theoretical issue. We have only 3
      freeze_bdev() callers. 2 of them are in filesystem code and used on a
      device with mounted fs. The third one in lock_fs() has protection in
      upper-layer code against freezing block device the second time without
      thawing it first.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5bb53c0f
    • Filipe Manana's avatar
      Btrfs: fix lockdep warning on deadlock against an inode's log mutex · 28a23593
      Filipe Manana authored
      Commit 44f714da ("Btrfs: improve performance on fsync against new
      inode after rename/unlink"), which landed in 4.8-rc2, introduced a
      possibility for a deadlock due to double locking of an inode's log mutex
      by the same task, which lockdep reports with:
      
      [23045.433975] =============================================
      [23045.434748] [ INFO: possible recursive locking detected ]
      [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
      [23045.436044] ---------------------------------------------
      [23045.436044] xfs_io/3688 is trying to acquire lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     but task is already holding lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     other info that might help us debug this:
      [23045.436044]  Possible unsafe locking scenario:
      
      [23045.436044]        CPU0
      [23045.436044]        ----
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]
                      *** DEADLOCK ***
      
      [23045.436044]  May be due to missing lock nesting notation
      
      [23045.436044] 3 locks held by xfs_io/3688:
      [23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
      [23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
      [23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     stack backtrace:
      [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
      [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
      [23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
      [23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
      [23045.436044] Call Trace:
      [23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
      [23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
      [23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
      [23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
      [23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
      [23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
      [23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
      [23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
      [23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
      [23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
      [23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
      [23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
      [23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
      [23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
      [23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa
      
      An example reproducer for this is:
      
         $ mkfs.btrfs -f /dev/sdb
         $ mount /dev/sdb /mnt
         $ mkdir /mnt/dir
         $ touch /mnt/dir/foo
         $ sync
         $ mv /mnt/dir/foo /mnt/dir/bar
         $ touch /mnt/dir/foo
         $ xfs_io -c "fsync" /mnt/dir/bar
      
      This is because while logging the inode of file bar we end up logging its
      parent directory (since its inode has an unlink_trans field matching the
      current transaction id due to the rename operation), which in turn logs
      the inodes for all its new dentries, so that the new inode for the new
      file named foo gets logged which in turn triggered another logging attempt
      for the inode we are fsync'ing, since that inode had an old name that
      corresponds to the name of the new inode.
      
      So fix this by ensuring that when logging the inode for a new dentry that
      has a name matching an old name of some other inode, we don't log again
      the original inode that we are fsync'ing.
      
      Fixes: 44f714da ("Btrfs: improve performance on fsync against new inode after rename/unlink")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28a23593
    • Liu Bo's avatar
      Btrfs: detect corruption when non-root leaf has zero item · 1ba98d08
      Liu Bo authored
      Right now we treat leaf which has zero item as a valid one
      because we could have an empty tree, that is, a root that is
      also a leaf without any item, however, in the same case but
      when the leaf is not a root, we can end up with hitting the
      BUG_ON(1) in btrfs_extend_item() called by
      setup_inline_extent_backref().
      
      This makes us check the situation as a corruption if leaf is
      not its own root.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ba98d08
    • Liu Bo's avatar
      Btrfs: check btree node's nritems · 053ab70f
      Liu Bo authored
      When btree node (level = 1) has nritems which equals to zero,
      we can end up with panic due to insert_ptr()'s
      
      BUG_ON(slot > nritems);
      
      where slot is 1 and nritems is 0, as copy_for_split() calls
      insert_ptr(.., path->slots[1] + 1, ...);
      
      A invalid value results in the whole mess, this adds the check
      for btree's node nritems so that we stop reading block when
      when something is wrong.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      053ab70f
    • Jeff Mahoney's avatar
      btrfs: don't create or leak aliased root while cleaning up orphans · 35bbb97f
      Jeff Mahoney authored
      commit 909c3a22 (Btrfs: fix loading of orphan roots leading to BUG_ON)
      avoids the BUG_ON but can add an aliased root to the dead_roots list or
      leak the root.
      
      Since we've already been loading roots into the radix tree, we should
      use it before looking the root up on disk.
      
      Cc: <stable@vger.kernel.org> # 4.5
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      35bbb97f
    • Josef Bacik's avatar
      Btrfs: fix em leak in find_first_block_group · 187ee58c
      Josef Bacik authored
      We need to call free_extent_map() on the em we look up.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      187ee58c
    • Anand Jain's avatar
      btrfs: do not background blkdev_put() · 14238819
      Anand Jain authored
      At the end of unmount/dev-delete, if the device exclusive open is not
      actually closed, then there might be a race with another program in
      the userland who is trying to open the device in exclusive mode and
      it may fail for eg:
            unmount /btrfs; fsck /dev/x
            btrfs dev del /dev/x /btrfs; fsck /dev/x
      so here background blkdev_put() is not a choice
      Signed-off-by: default avatarAnand Jain <Anand.Jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      14238819
    • Liu Bo's avatar
      Btrfs: clarify do_chunk_alloc()'s return value · 28b737f6
      Liu Bo authored
      Function start_transaction() can return ERR_PTR(1) when flush is
      BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
      
      start_transaction (return ERR_PTR(1))
        -> btrfs_block_rsv_add (return 1)
           -> reserve_metadata_bytes (return 1)
              -> flush_space (return 1)
                 -> do_chunk_alloc  (return 1)
      
      With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
      flush_state of ALLOC_CHUNK and it successfully allocates a new
      chunk, then instead of trying to reserve space again,
      reserve_metadata_bytes returns 1 immediately.
      
      Eventually the callers who call start_transaction() usually just
      do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
      a panic when dereferencing a pointer which is ERR_PTR(1).
      
      The following patch fixes the above problem.
      "btrfs: flush_space: treat return value of do_chunk_alloc properly"
      https://patchwork.kernel.org/patch/7778651/
      
      This add comments to clarify do_chunk_alloc()'s return value.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      28b737f6
    • Wang Xiaoguang's avatar
      btrfs: fix fsfreeze hang caused by delayed iputs deal · 9e7cc91a
      Wang Xiaoguang authored
      When running fstests generic/068, sometimes we got below deadlock:
        xfs_io          D ffff8800331dbb20     0  6697   6693 0x00000080
        ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
        ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
        ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
        Call Trace:
        [<ffffffff816a9045>] schedule+0x35/0x80
        [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
        [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
        [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
        [<ffffffff810d32b5>] percpu_down_read+0x35/0x50
        [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
        [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
        [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
        [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
        [<ffffffff81230a1a>] evict+0xba/0x1a0
        [<ffffffff812316b6>] iput+0x196/0x200
        [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
        [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
        [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
        [<ffffffff81218040>] freeze_super+0xf0/0x190
        [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
        [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
        [<ffffffff81229409>] SyS_ioctl+0x79/0x90
        [<ffffffff81003c12>] do_syscall_64+0x62/0x110
        [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25
      
      >From this warning, freeze_super() already holds SB_FREEZE_FS, but
      btrfs_freeze() will call btrfs_commit_transaction() again, if
      btrfs_commit_transaction() finds that it has delayed iputs to handle,
      it'll start_transaction(), which will try to get SB_FREEZE_FS lock
      again, then deadlock occurs.
      
      The root cause is that in btrfs, sync_filesystem(sb) does not make
      sure all metadata is updated. There still maybe some codes adding
      delayed iputs, see below sample race window:
      
               CPU1                                  |         CPU2
      |-> freeze_super()                             |
          |-> sync_filesystem(sb);                   |
          |                                          |-> cleaner_kthread()
          |                                          |   |-> btrfs_delete_unused_bgs()
          |                                          |       |-> btrfs_remove_chunk()
          |                                          |           |-> btrfs_remove_block_group()
          |                                          |               |-> btrfs_add_delayed_iput()
          |                                          |
          |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
          |-> sb_wait_write(sb, SB_FREEZE_FS);       |
          |   acquire SB_FREEZE_FS lock.             |
          |                                          |
          |-> btrfs_freeze()                         |
              |-> btrfs_commit_transaction()         |
                  |-> btrfs_run_delayed_iputs()      |
                  |   will handle delayed iputs,     |
                  |   that means start_transaction() |
                  |   will be called, which will try |
                  |   to get SB_FREEZE_FS lock.      |
      
      To fix this issue, introduce a "int fs_frozen" to record internally whether
      fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add comment to btrfs_freeze ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9e7cc91a
    • Wang Xiaoguang's avatar
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang authored
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      18513091
    • Wang Xiaoguang's avatar
      btrfs: divide btrfs_update_reserved_bytes() into two functions · 4824f1f4
      Wang Xiaoguang authored
      This patch divides btrfs_update_reserved_bytes() into
      btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
      next patch will extend btrfs_add_reserved_bytes()to fix some
      false ENOSPC error, please see later patch for detailed info.
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4824f1f4
    • Wang Xiaoguang's avatar
      btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() · dcb40c19
      Wang Xiaoguang authored
      In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
      wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
      which indeed are extent's bytenr. The correct value should be
      cluster->[start|end] minus block group's start bytenr.
      
      start bytenr   cluster->start
      |              |     extent      |   extent   | ...| extent |
      |----------------------------------------------------------------|
      |                block group reloc_inode                         |
      Signed-off-by: default avatarWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dcb40c19
    • Qu Wenruo's avatar
      btrfs: qgroup: Fix qgroup incorrectness caused by log replay · df2c95f3
      Qu Wenruo authored
      When doing log replay at mount time(after power loss), qgroup will leak
      numbers of replayed data extents.
      
      The cause is almost the same of balance.
      So fix it by manually informing qgroup for owner changed extents.
      
      The bug can be detected by btrfs/119 test case.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df2c95f3
    • Qu Wenruo's avatar
      btrfs: relocation: Fix leaking qgroups numbers on data extents · 62b99540
      Qu Wenruo authored
      This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
      rework.
      
      When balancing data extents, qgroup will leak all its numbers for
      relocated data extents.
      
      The relocation is done in the following steps for data extents:
      1) Create data reloc tree and inode
      2) Copy all data extents to data reloc tree
         And commit transaction
      3) Create tree reloc tree(special snapshot) for any related subvolumes
      4) Replace file extent in tree reloc tree with new extents in data reloc
         tree
         And commit transaction
      5) Merge tree reloc tree with original fs, by swapping tree blocks
      
      For 1)~4), since tree reloc tree and data reloc tree doesn't count to
      qgroup, everything is OK.
      
      But for 5), the swapping of tree blocks will only info qgroup to track
      metadata extents.
      
      If metadata extents contain file extents, qgroup number for file extents
      will get lost, leading to corrupted qgroup accounting.
      
      The fix is, before commit transaction of step 5), manually info qgroup to
      track all file extents in data reloc tree.
      Since at commit transaction time, the tree swapping is done, and qgroup
      will account these data extents correctly.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Reported-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reported-by: default avatarFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      62b99540
    • Qu Wenruo's avatar
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo authored
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cb93b52c
    • Jeff Mahoney's avatar
      btrfs: waiting on qgroup rescan should not always be interruptible · d06f23d6
      Jeff Mahoney authored
      We wait on qgroup rescan completion in three places: file system
      shutdown, the quota disable ioctl, and the rescan wait ioctl.  If the
      user sends a signal while we're waiting, we continue happily along.  This
      is expected behavior for the rescan wait ioctl.  It's racy in the shutdown
      path but mostly works due to other unrelated synchronization points.
      In the quota disable path, it Oopses the kernel pretty much immediately.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d06f23d6
    • Jeff Mahoney's avatar
      btrfs: properly track when rescan worker is running · d2c609b8
      Jeff Mahoney authored
      The qgroup_flags field is overloaded such that it reflects the on-disk
      status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
      flag is used to indicate that a rescan operation is in progress, but if
      the file system is unmounted while a rescan is running, the rescan
      operation is paused.  If the file system is then mounted read-only,
      the flag will still be present but the rescan operation will not have
      been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
      will see the flag and interpret it to mean that the rescan worker is
      still running and will wait for a completion that will never come.
      
      This patch uses a separate flag to indicate when the worker is
      running.  The locking and state surrounding the qgroup rescan worker
      needs a lot of attention beyond this patch but this is enough to
      avoid a hung umount.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by; Jeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d2c609b8
    • Alex Lyakas's avatar
      btrfs: flush_space: treat return value of do_chunk_alloc properly · eecba891
      Alex Lyakas authored
      do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
      But flush_space will not convert this to 0, and will also return 1.
      As a result, reserve_metadata_bytes will think that flush_space failed,
      and may potentially return this value "1" to the caller (depends how
      reserve_metadata_bytes was called). The caller will also treat this as an error.
      For example, btrfs_block_rsv_refill does:
      
      int ret = -ENOSPC;
      ...
      ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
      if (!ret) {
              block_rsv_add_bytes(block_rsv, num_bytes, 0);
              return 0;
      }
      
      return ret;
      
      So it will return -ENOSPC.
      Signed-off-by: default avatarAlex Lyakas <alex@zadarastorage.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      eecba891
    • Liu Bo's avatar
      Btrfs: add ASSERT for block group's memory leak · f3bca802
      Liu Bo authored
      This adds several ASSERT()' s to report memory leak of block group cache.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f3bca802
    • Qu Wenruo's avatar
      btrfs: backref: Fix soft lockup in __merge_refs function · d8422ba3
      Qu Wenruo authored
      When over 1000 file extents refers to one extent, find_parent_nodes()
      will be obviously slow, due to the O(n^2)~O(n^3) loops inside
      __merge_refs().
      
      The following ftrace shows the cubic growth of execution time:
      
      256 refs
       5) + 91.768 us   |  __add_keyed_refs.isra.12 [btrfs]();
       5)   1.447 us    |  __add_missing_keys.isra.13 [btrfs]();
       5) ! 114.544 us  |  __merge_refs [btrfs]();
       5) ! 136.399 us  |  __merge_refs [btrfs]();
      
      512 refs
       6) ! 279.859 us  |  __add_keyed_refs.isra.12 [btrfs]();
       6)   3.164 us    |  __add_missing_keys.isra.13 [btrfs]();
       6) ! 442.498 us  |  __merge_refs [btrfs]();
       6) # 2091.073 us |  __merge_refs [btrfs]();
      
      and 1024 refs
       7) ! 368.683 us  |  __add_keyed_refs.isra.12 [btrfs]();
       7)   4.810 us    |  __add_missing_keys.isra.13 [btrfs]();
       7) # 2043.428 us |  __merge_refs [btrfs]();
       7) * 18964.23 us |  __merge_refs [btrfs]();
      
      And sort them into the following char:
      (Unit: us)
      ------------------------------------------------------------------------
       Trace function        | 256 ref        | 512 refs      | 1024 refs    |
      ------------------------------------------------------------------------
       __add_keyed_refs      | 91             | 249           | 368          |
       __add_missing_keys    | 1              | 3             | 4            |
       __merge_refs 1st call | 114            | 442           | 2043         |
       __merge_refs 2nd call | 136            | 2091          | 18964        |
      ------------------------------------------------------------------------
      
      We can see the that __add_keyed_refs() grows almost in linear behavior.
      And __add_missing_keys() in this case doesn't change much or takes much
      time.
      
      While for the 1st __merge_refs() it's square growth
      for the 2nd __merge_refs() call it's cubic growth.
      
      It's no doubt that merge_refs() will take a long long time to execute if
      the number of refs continues its grows.
      
      So add a cond_resced() into the loop of __merge_refs().
      
      Although this will solve the problem of soft lockup, we need to use the
      new rb_tree based structure introduced by Lu Fengqi to really solve the
      long execution time.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d8422ba3
    • Liu Bo's avatar
      Btrfs: fix memory leak of reloc_root · 1c1ea4f7
      Liu Bo authored
      When some critical errors occur and FS would be flipped into RO,
      if we have an on-going balance, we can end up with a memory leak
      of root->reloc_root since btrfs_drop_snapshots() bails out
      without freeing reloc_root at the very early start.
      
      However, we're not able to free reloc_root in btrfs_drop_snapshots()
      because its caller, merge_reloc_roots(), still needs to access it to
      cleanup reloc_root's rbtree.
      
      This makes us free reloc_root when we're going to free fs/file roots.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1c1ea4f7
    • Linus Torvalds's avatar
      Merge branch 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 61c04572
      Linus Torvalds authored
      Pull thermal fixes from Zhang Rui:
      
       - Fix cpu_cooling to have separate thermal_cooling_device_ops
         structures for cpus with and without power model, to avoid NULL
         dereference in cpufreq_state2power.  From Brendan Jackman.
      
       - Fix a possible NULL dereference in imx_thermal driver.  From Corentin
         LABBE.
      
       - Another two trivial fixes, one typo fix and one deleting module
         owner.  From Caesar Wang and Markus Elfring.
      
      * 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
        thermal: imx: fix a possible NULL dereference
        thermal: trivial: fix the typo
        Thermal-INT3406: Delete owner assignment
        thermal: cpu_cooling: Fix NULL dereference in cpufreq_state2power
      61c04572
    • Heinz Mauelshagen's avatar
      dm log: fix unitialized bio operation flags · 9c5a559d
      Heinz Mauelshagen authored
      Commit e6047149 ("dm: use bio op accessors") switched DM over to
      using bio_set_op_attrs() but didn't take care to initialize
      lc->io_req.bi_op_flags in dm-log.c:rw_header().  This caused
      rw_header()'s call to dm_io() to make bio->bi_op_flags be uninitialized
      in dm-io.c:do_region(), which ultimately resulted in a SCSI BUG() in
      sd_init_command().
      
      Also, adjust rw_header() and its callers to use REQ_OP_{READ|WRITE}.
      
      Fixes: e6047149 ("dm: use bio op accessors")
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Reviewed-by: default avatarShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      9c5a559d
    • Mike Snitzer's avatar
      dm flakey: fix reads to be issued if drop_writes configured · 299f6230
      Mike Snitzer authored
      v4.8-rc3 commit 99f3c90d ("dm flakey: error READ bios during the
      down_interval") overlooked the 'drop_writes' feature, which is meant to
      allow reads to be issued rather than errored, during the down_interval.
      
      Fixes: 99f3c90d ("dm flakey: error READ bios during the down_interval")
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      299f6230
  4. 24 Aug, 2016 5 commits