1. 13 Nov, 2018 40 commits
    • Josef Bacik's avatar
      btrfs: reset max_extent_size properly · e982beca
      Josef Bacik authored
      commit 21a94f7a upstream.
      
      If we use up our block group before allocating a new one we'll easily
      get a max_extent_size that's set really really low, which will result in
      a lot of fragmentation.  We need to make sure we're resetting the
      max_extent_size when we add a new chunk or add new space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e982beca
    • Filipe Manana's avatar
      Btrfs: fix deadlock when writing out free space caches · ea9c846f
      Filipe Manana authored
      commit 5ce55557 upstream.
      
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: default avatarE V <eliventer@gmail.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea9c846f
    • Filipe Manana's avatar
      Btrfs: fix assertion on fsync of regular file when using no-holes feature · e17af96e
      Filipe Manana authored
      commit 7ed586d0 upstream.
      
      When using the NO_HOLES feature and logging a regular file, we were
      expecting that if we find an inline extent, that either its size in RAM
      (uncompressed and unenconded) matches the size of the file or if it does
      not, that it matches the sector size and it represents compressed data.
      This assertion does not cover a case where the length of the inline extent
      is smaller than the sector size and also smaller the file's size, such
      case is possible through fallocate. Example:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
        $ xfs_io -c "falloc 40 40" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
      In the above example we trigger the assertion because the inline extent's
      length is 21 bytes while the file size is 80 bytes. The fallocate() call
      merely updated the file's size and did not touch the existing inline
      extent, as expected.
      
      So fix this by adjusting the assertion so that an inline extent length
      smaller than the file size is valid if the file size is smaller than the
      filesystem's sector size.
      
      A test case for fstests follows soon.
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: a89ca6f2 ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
      CC: stable@vger.kernel.org # 4.14+
      Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e17af96e
    • Filipe Manana's avatar
      Btrfs: fix null pointer dereference on compressed write path error · 85c5f244
      Filipe Manana authored
      commit 3527a018 upstream.
      
      At inode.c:compress_file_range(), under the "free_pages_out" label, we can
      end up dereferencing the "pages" pointer when it has a NULL value. This
      case happens when "start" has a value of 0 and we fail to allocate memory
      for the "pages" pointer. When that happens we jump to the "cont" label and
      then enter the "if (start == 0)" branch where we immediately call the
      cow_file_range_inline() function. If that function returns 0 (success
      creating an inline extent) or an error (like -ENOMEM for example) we jump
      to the "free_pages_out" label and then access "pages[i]" leading to a NULL
      pointer dereference, since "nr_pages" has a value greater than zero at
      that point.
      
      Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
      the "pages" pointer.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85c5f244
    • Qu Wenruo's avatar
      btrfs: qgroup: Dirty all qgroups before rescan · 8181a8f8
      Qu Wenruo authored
      commit 9c7b0c2e upstream.
      
      [BUG]
      In the following case, rescan won't zero out the number of qgroup 1/0:
      
        $ mkfs.btrfs -fq $DEV
        $ mount $DEV /mnt
      
        $ btrfs quota enable /mnt
        $ btrfs qgroup create 1/0 /mnt
        $ btrfs sub create /mnt/sub
        $ btrfs qgroup assign 0/257 1/0 /mnt
      
        $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
        $ btrfs sub snap /mnt/sub /mnt/snap
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre /mnt
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     0/257
      
      So far so good, but:
      
        $ btrfs qgroup remove 0/257 1/0 /mnt
        WARNING: quotas may be inconsistent, rescan needed
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre  /mnt
        qgoupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none ---     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     ---
      	     ^^^^^^^^^^     ^^^^^^^^ not cleared
      
      [CAUSE]
      Before rescan we call qgroup_rescan_zero_tracking() to zero out all
      qgroups' accounting numbers.
      
      However we don't mark all qgroups dirty, but rely on rescan to do so.
      
      If we have any high level qgroup without children, it won't be marked
      dirty during rescan, since we cannot reach that qgroup.
      
      This will cause QGROUP_INFO items of childless qgroups never get updated
      in the quota tree, thus their numbers will stay the same in "btrfs
      qgroup show" output.
      
      [FIX]
      Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
      we have childless qgroups, their QGROUP_INFO items will still get
      updated during rescan.
      Reported-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Tested-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8181a8f8
    • Filipe Manana's avatar
      Btrfs: fix wrong dentries after fsync of file that got its parent replaced · d2c6df39
      Filipe Manana authored
      commit 0f375eed upstream.
      
      In a scenario like the following:
      
        mkdir /mnt/A               # inode 258
        mkdir /mnt/B               # inode 259
        touch /mnt/B/bar           # inode 260
      
        sync
      
        mv /mnt/B/bar /mnt/A/bar
        mv -T /mnt/A /mnt/B
        fsync /mnt/B/bar
      
        <power fail>
      
      After replaying the log we end up with file bar having 2 hard links, both
      with the name 'bar' and one in the directory with inode number 258 and the
      other in the directory with inode number 259. Also, we end up with the
      directory inode 259 still existing and with the directory inode 258 still
      named as 'A', instead of 'B'. In this scenario, file 'bar' should only
      have one hard link, located at directory inode 258, the directory inode
      259 should not exist anymore and the name for directory inode 258 should
      be 'B'.
      
      This incorrect behaviour happens because when attempting to log the old
      parents of an inode, we skip any parents that no longer exist. Fix this
      by forcing a full commit if an old parent no longer exists.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2c6df39
    • Filipe Manana's avatar
      Btrfs: fix warning when replaying log after fsync of a tmpfile · 55f21e16
      Filipe Manana authored
      commit f2d72f42 upstream.
      
      When replaying a log which contains a tmpfile (which necessarily has a
      link count of 0) we end up calling inc_nlink(), at
      fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
      the following:
      
        [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
        [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
        [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
        [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
        [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
        [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
        [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
        [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
        [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
        [195191.943735] FS:  00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
        [195191.943736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
        [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [195191.943741] Call Trace:
        [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
        [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
        [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
        [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
        [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
        [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
        [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
        [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943906]  ? mount_fs+0x3b/0x140
        [195191.943908]  mount_fs+0x3b/0x140
        [195191.943912]  ? __init_waitqueue_head+0x36/0x50
        [195191.943916]  vfs_kern_mount+0x62/0x160
        [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
        [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
        [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
        [195191.943943]  ? mount_fs+0x3b/0x140
        [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
        [195191.943954]  mount_fs+0x3b/0x140
        [195191.943956]  ? __init_waitqueue_head+0x36/0x50
        [195191.943960]  vfs_kern_mount+0x62/0x160
        [195191.943963]  do_mount+0x1f9/0xd40
        [195191.943967]  ? memdup_user+0x4b/0x70
        [195191.943971]  ksys_mount+0x7e/0xd0
        [195191.943974]  __x64_sys_mount+0x21/0x30
        [195191.943977]  do_syscall_64+0x60/0x1b0
        [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [195191.943983] RIP: 0033:0x7f570e4e524a
        [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
        [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
        [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
        [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
        [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
        [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
        [195191.944002] irq event stamp: 8688
        [195191.944010] hardirqs last  enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
        [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
        [195191.944018] softirqs last  enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
        [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
        [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
      
      This happens because the inode does not have the flag I_LINKABLE set,
      which is a runtime only flag, not meant to be persisted, set when the
      inode is created through open(2) if the flag O_EXCL is not passed to it.
      Except for the warning, there are no other consequences (like corruptions
      or metadata inconsistencies).
      
      Since it's pointless to replay a tmpfile as it would be deleted in a
      later phase of the log replay procedure (it has a link count of 0), fix
      this by not logging tmpfiles and if a tmpfile is found in a log (created
      by a kernel without this change), skip the replay of the inode.
      
      A test case for fstests follows soon.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.18+
      Reported-by: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55f21e16
    • Josef Bacik's avatar
      btrfs: make sure we create all new block groups · 1d6d4a03
      Josef Bacik authored
      commit 545e3366 upstream.
      
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d6d4a03
    • Josef Bacik's avatar
      btrfs: reset max_extent_size on clear in a bitmap · 9aabbb2e
      Josef Bacik authored
      commit 553cceb4 upstream.
      
      We need to clear the max_extent_size when we clear bits from a bitmap
      since it could have been from the range that contains the
      max_extent_size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9aabbb2e
    • Josef Bacik's avatar
      btrfs: protect space cache inode alloc with GFP_NOFS · 9006ad16
      Josef Bacik authored
      commit 84de76a2 upstream.
      
      If we're allocating a new space cache inode it's likely going to be
      under a transaction handle, so we need to use memalloc_nofs_save() in
      order to avoid deadlocks, and more importantly lockdep messages that
      make xfstests fail.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9006ad16
    • Josef Bacik's avatar
      btrfs: release metadata before running delayed refs · 0de8cf3f
      Josef Bacik authored
      commit f45c752b upstream.
      
      We want to release the unused reservation we have since it refills the
      delayed refs reserve, which will make everything go smoother when
      running the delayed refs if we're short on our reservation.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0de8cf3f
    • Chris Mason's avatar
      Btrfs: don't clean dirty pages during buffered writes · 8f2ecee5
      Chris Mason authored
      commit 7703bdd8 upstream.
      
      During buffered writes, we follow this basic series of steps:
      
      again:
      	lock all the pages
      	wait for writeback on all the pages
      	Take the extent range lock
      	wait for ordered extents on the whole range
      	clean all the pages
      
      	if (copy_from_user_in_atomic() hits a fault) {
      		drop our locks
      		goto again;
      	}
      
      	dirty all the pages
      	release all the locks
      
      The extra waiting, cleaning and locking are there to make sure we don't
      modify pages in flight to the drive, after they've been crc'd.
      
      If some of the pages in the range were already dirty when the write
      began, and we need to goto again, we create a window where a dirty page
      has been cleaned and unlocked.  It may be reclaimed before we're able to
      lock it again, which means we'll read the old contents off the drive and
      lose any modifications that had been pending writeback.
      
      We don't actually need to clean the pages.  All of the other locking in
      place makes sure we don't start IO on the pages, so we can just leave
      them dirty for the duration of the write.
      
      Fixes: 73d59314 (the original btrfs merge)
      CC: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f2ecee5
    • Josef Bacik's avatar
      btrfs: wait on caching when putting the bg cache · 2974abff
      Josef Bacik authored
      commit 3aa7c7a3 upstream.
      
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2974abff
    • Jeff Mahoney's avatar
      btrfs: keep trim from interfering with transaction commits · 0a7f6c7e
      Jeff Mahoney authored
      commit fee7acc3 upstream.
      
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a7f6c7e
    • Jeff Mahoney's avatar
      btrfs: don't attempt to trim devices that don't support it · 4d0dfd8f
      Jeff Mahoney authored
      commit 0be88e36 upstream.
      
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d0dfd8f
    • Jeff Mahoney's avatar
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · 76e59a62
      Jeff Mahoney authored
      commit d4e329de upstream.
      
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76e59a62
    • Qu Wenruo's avatar
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · d147f4dc
      Qu Wenruo authored
      commit 6ba9fc8e upstream.
      
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d147f4dc
    • Qu Wenruo's avatar
      btrfs: Enhance btrfs_trim_fs function to handle error better · c9ee7109
      Qu Wenruo authored
      commit 93bba24d upstream.
      
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9ee7109
    • Jeff Mahoney's avatar
      btrfs: fix error handling in btrfs_dev_replace_start · 18bdce0e
      Jeff Mahoney authored
      commit 5c061471 upstream.
      
      When we fail to start a transaction in btrfs_dev_replace_start, we leave
      dev_replace->replace_start set to STARTED but clear ->srcdev and
      ->tgtdev.  Later, that can result in an Oops in
      btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
      implies that ->srcdev is valid.
      
      Also fix error handling when the state is already STARTED or SUSPENDED
      while starting.  That, too, will clear ->srcdev and ->tgtdev even though
      it doesn't own them.  This should be an impossible case to hit since we
      should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
      ASSERT there while we're at it.
      
      Fixes: e93c89c1 (Btrfs: add new sources for device replace code)
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18bdce0e
    • Jeff Mahoney's avatar
      btrfs: fix error handling in free_log_tree · cdecd48a
      Jeff Mahoney authored
      commit 374b0e2d upstream.
      
      When we hit an I/O error in free_log_tree->walk_log_tree during file system
      shutdown we can crash due to there not being a valid transaction handle.
      
      Use btrfs_handle_fs_error when there's no transaction handle to use.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
        IP: free_log_tree+0xd2/0x140 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        Modules linked in: <modules>
        CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
        RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
        RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
        RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
        RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
        R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
        R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
        FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? wait_for_writer+0xb0/0xb0 [btrfs]
         btrfs_free_log+0x17/0x30 [btrfs]
         btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
         btrfs_free_fs_roots+0xc0/0x130 [btrfs]
         ? wait_for_completion+0xf2/0x100
         close_ctree+0xea/0x2e0 [btrfs]
         ? kthread_stop+0x161/0x260
         generic_shutdown_super+0x6c/0x120
         kill_anon_super+0xe/0x20
         btrfs_kill_super+0x13/0x100 [btrfs]
         deactivate_locked_super+0x3f/0x70
         cleanup_mnt+0x3b/0x70
         task_work_run+0x78/0x90
         exit_to_usermode_loop+0x77/0xa6
         do_syscall_64+0x1c5/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
        RIP: 0033:0x7f1784f90827
        RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
        RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
        R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
        R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
        RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
        CR2: 0000000000000060
      
      Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error")
      CC: <stable@vger.kernel.org> # v3.13
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdecd48a
    • Qu Wenruo's avatar
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · d4f56c44
      Qu Wenruo authored
      commit b72c3aba upstream.
      
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: default avatarXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f56c44
    • Qu Wenruo's avatar
      btrfs: Handle owner mismatch gracefully when walking up tree · 4f2a4e02
      Qu Wenruo authored
      commit 65c6e82b upstream.
      
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: default avatarXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f2a4e02
    • Qu Wenruo's avatar
      btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled · c25c6665
      Qu Wenruo authored
      commit 3628b4ca upstream.
      
      Some qgroup trace events like btrfs_qgroup_release_data() and
      btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
      not enabled.
      
      This is caused by the lack of qgroup status check before calling some
      qgroup functions.  Thankfully the functions can handle quota disabled
      case well and just do nothing for qgroup disabled case.
      
      This patch will do earlier check before triggering related trace events.
      
      And for enabled <-> disabled race case:
      
      1) For enabled->disabled case
         Disable will wipe out all qgroups data including reservation and
         excl/rfer. Even if we leak some reservation or numbers, it will
         still be cleared, so nothing will go wrong.
      
      2) For disabled -> enabled case
         Current btrfs_qgroup_release_data() will use extent_io tree to ensure
         we won't underflow reservation. And for delayed_ref we use
         head->qgroup_reserved to record the reserved space, so in that case
         head->qgroup_reserved should be 0 and we won't underflow.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c25c6665
    • Masami Hiramatsu's avatar
      tracing: Return -ENOENT if there is no target synthetic event · 2b3171ec
      Masami Hiramatsu authored
      commit 18858511 upstream.
      
      Return -ENOENT error if there is no target synthetic event.
      This notices an operation failure to user as below;
      
        # echo 'wakeup_latency u64 lat; pid_t pid;' > synthetic_events
        # echo '!wakeup' >> synthetic_events
        sh: write error: No such file or directory
      
      Link: http://lkml.kernel.org/r/154013449986.25576.9487131386597290172.stgit@devboxAcked-by: default avatarTom Zanussi <zanussi@linux.intel.com>
      Tested-by: default avatarTom Zanussi <zanussi@linux.intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Rajvi Jingar <rajvi.jingar@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 4b147936 ('tracing: Add support for 'synthetic' events')
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b3171ec
    • Breno Leitao's avatar
      selftests/powerpc: Fix ptrace tm failure · 1ebadf5e
      Breno Leitao authored
      commit 48dc0ef1 upstream.
      
      Test ptrace-tm-spd-gpr fails on current kernel (4.19) due to a segmentation
      fault that happens on the child process prior to setting cptr[2] = 1. This
      causes the parent process to wait forever at 'while (!pptr[2])' and the test to
      be killed by the test harness framework by timeout, thus, failing.
      
      The segmentation fault happens because of a inline assembly being
      generated as:
      
      	0x10000355c <tm_spd_gpr+492>    lfs    f0, 0(0)
      
      This is reading memory position 0x0 and causing the segmentation fault.
      
      This code is being generated by ASM_LOAD_FPR_SINGLE_PRECISION(flt_4), where
      flt_4 is passed to the inline assembly block as:
      
      	[flt_4] "r" (&d)
      
      Since the inline assembly 'r' constraint means any GPR, gpr0 is being
      chosen, thus causing this issue when issuing a Load Floating-Point Single
      instruction.
      
      This patch simply changes the constraint to 'b', which specify that this
      register will be used as base, and r0 is not allowed to be used, avoiding
      this issue.
      
      Other than that, removing flt_2 register from the input operands, since it
      is not used by the inline assembly code at all.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Acked-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ebadf5e
    • Masami Hiramatsu's avatar
      selftests/ftrace: Fix synthetic event test to delete event correctly · 2dc12479
      Masami Hiramatsu authored
      commit 0d0352d8 upstream.
      
      Fix the synthetic event test case to remove event correctly.
      If redirecting command to synthetic_event file without append
      mode, it cleans up all existing events and execute (parse) the
      command. This means "delete event" always fails to find the
      target event.
      
      Since previous synthetic event has a bug which doesn't return
      -ENOENT even if it fails to find the deleting event, this test
      passed. But fixing that bug, this test fails because this test
      itself has a bug.
      
      This fixes that bug by trying to delete event right after
      adding an event, and use append mode redirection ('>>') instead
      of normal redirection ('>').
      
      Link: http://lkml.kernel.org/r/154013452832.25576.2305459545429386517.stgit@devboxAcked-by: default avatarShuah Khan <shuah@kernel.org>
      Acked-by: default avatarTom Zanussi <zanussi@linux.intel.com>
      Tested-by: default avatarTom Zanussi <zanussi@linux.intel.com>
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: Rajvi Jingar <rajvi.jingar@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: f06eec4d ('selftests: ftrace: Add inter-event hist triggers testcases')
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2dc12479
    • Johan Hovold's avatar
      soc/tegra: pmc: Fix child-node lookup · b4087c2a
      Johan Hovold authored
      commit 1dc6bd5e upstream.
      
      Fix child-node lookup during probe, which ended up searching the whole
      device tree depth-first starting at the parent rather than just matching
      on its children.
      
      To make things worse, the parent pmc node could end up being prematurely
      freed as of_find_node_by_name() drops a reference to its first argument.
      
      Fixes: 3568df3d ("soc: tegra: Add thermal reset (thermtrip) support to PMC")
      Cc: stable <stable@vger.kernel.org>     # 4.0
      Cc: Mikko Perttunen <mperttunen@nvidia.com>
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Reviewed-by: default avatarMikko Perttunen <mperttunen@nvidia.com>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4087c2a
    • Bjorn Andersson's avatar
      soc: qcom: rmtfs-mem: Validate that scm is available · 7f9a150f
      Bjorn Andersson authored
      commit 137dc584 upstream.
      
      The scm device must be present in order for the rmtfs driver to
      configure memory permissions for the rmtfs memory region, so check that
      it is probed before continuing.
      
      Cc: stable@vger.kernel.org
      Fixes: fa65f804 ("soc: qcom: rmtfs-mem: Add support for assigning memory to remote")
      Signed-off-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarAndy Gross <andy.gross@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f9a150f
    • Thor Thayer's avatar
      arm64: dts: stratix10: Correct System Manager register size · 142261e7
      Thor Thayer authored
      commit 74121b9a upstream.
      
      Correct the register size of the System Manager node.
      
      Cc: stable@vger.kernel.org
      Fixes: 78cd6a9d ("arm64: dts: Add base stratix 10 dtsi")
      Signed-off-by: default avatarThor Thayer <thor.thayer@linux.intel.com>
      Signed-off-by: default avatarDinh Nguyen <dinguyen@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      142261e7
    • Thor Thayer's avatar
      ARM: dts: socfpga: Fix SDRAM node address for Arria10 · 9a209614
      Thor Thayer authored
      commit ce3bf934 upstream.
      
      The address in the SDRAM node was incorrect. Fix this to agree with the
      correct address and to match the reg definition block.
      
      Cc: stable@vger.kernel.org
      Fixes: 54b4a8f5("arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support")
      Signed-off-by: default avatarThor Thayer <thor.thayer@linux.intel.com>
      Signed-off-by: default avatarDinh Nguyen <dinguyen@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a209614
    • Nicolas Pitre's avatar
      Cramfs: fix abad comparison when wrap-arounds occur · 9d9da6fe
      Nicolas Pitre authored
      commit 672ca9dd upstream.
      
      It is possible for corrupted filesystem images to produce very large
      block offsets that may wrap when a length is added, and wrongly pass
      the buffer size test.
      Reported-by: default avatarAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: default avatarNicolas Pitre <nico@linaro.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d9da6fe
    • Colin Ian King's avatar
      rpmsg: smd: fix memory leak on channel create · 9632c033
      Colin Ian King authored
      commit 940c620d upstream.
      
      Currently a failed allocation of channel->name leads to an
      immediate return without freeing channel. Fix this by setting
      ret to -ENOMEM and jumping to an exit path that kfree's channel.
      
      Detected by CoverityScan, CID#1473692 ("Resource Leak")
      
      Fixes: 53e2822e ("rpmsg: Introduce Qualcomm SMD backend")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9632c033
    • Tri Vo's avatar
      arm64: lse: remove -fcall-used-x0 flag · 64537fda
      Tri Vo authored
      commit 2a6c7c36 upstream.
      
      x0 is not callee-saved in the PCS. So there is no need to specify
      -fcall-used-x0.
      
      Clang doesn't currently support -fcall-used flags. This patch will help
      building the kernel with clang.
      Tested-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarTri Vo <trong@android.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64537fda
    • Hans Verkuil's avatar
      media: hdmi.h: rename ADOBE_RGB to OPRGB and ADOBE_YCC to OPYCC · 3f7987f8
      Hans Verkuil authored
      commit 463659a0 upstream.
      
      These names have been renamed in the CTA-861 standard due to trademark
      issues. Replace them here as well so they are in sync with the standard.
      Signed-off-by: default avatarHans Verkuil <hansverk@cisco.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f7987f8
    • Hans Verkuil's avatar
      media: replace ADOBERGB by OPRGB · b1452c51
      Hans Verkuil authored
      commit db034018 upstream.
      
      The CTA-861 standards have been updated to refer to opRGB instead
      of AdobeRGB. The official standard is in fact named opRGB, so
      switch to that.
      
      The two old defines referring to ADOBERGB in the public API are
      put under #ifndef __KERNEL__ and a comment mentions that they are
      deprecated.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1452c51
    • Hans Verkuil's avatar
      media: media colorspaces*.rst: rename AdobeRGB to opRGB · 29ba4b99
      Hans Verkuil authored
      commit a58c3797 upstream.
      
      Drop all Adobe references and use the official opRGB standard
      instead.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ba4b99
    • Johan Hovold's avatar
      drm/mediatek: fix OF sibling-node lookup · 5501b812
      Johan Hovold authored
      commit ceff2f4d upstream.
      
      Use the new of_get_compatible_child() helper to lookup the sibling
      instead of using of_find_compatible_node(), which searches the entire
      tree from a given start node and thus can return an unrelated (i.e.
      non-sibling) node.
      
      This also addresses a potential use-after-free (e.g. after probe
      deferral) as the tree-wide helper drops a reference to its first
      argument (i.e. the parent device node).
      
      While at it, also fix the related cec-node reference leak.
      
      Fixes: 8f83f268 ("drm/mediatek: Add HDMI support")
      Cc: stable <stable@vger.kernel.org>     # 4.8
      Cc: Junzhi Zhao <junzhi.zhao@mediatek.com>
      Cc: Philipp Zabel <p.zabel@pengutronix.de>
      Cc: CK Hu <ck.hu@mediatek.com>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5501b812
    • Hans Verkuil's avatar
      media: adv7842: when the EDID is cleared, unconfigure CEC as well · 6529dcce
      Hans Verkuil authored
      commit ab83203e upstream.
      
      When there is no EDID the CEC adapter should be unconfigured as
      well. So call cec_phys_addr_invalidate() when this happens.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Cc: <stable@vger.kernel.org>      # for v4.18 and up
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6529dcce
    • Hans Verkuil's avatar
      media: adv7604: when the EDID is cleared, unconfigure CEC as well · 3e383b31
      Hans Verkuil authored
      commit e7da8992 upstream.
      
      When there is no EDID the CEC adapter should be unconfigured as
      well. So call cec_phys_addr_invalidate() when this happens.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Cc: <stable@vger.kernel.org>      # for v4.18 and up
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e383b31
    • Mauro Carvalho Chehab's avatar
      media: em28xx: fix handler for vidioc_s_input() · 9a79eb12
      Mauro Carvalho Chehab authored
      commit 258c4304 upstream.
      
      The a->index is not the name of the internal amux entry,
      but, instead a value from zero to the maximum number
      of audio inputs.
      
      As the actual available inputs depend on each board, build
      it dynamically.
      
      This is broken for a really long time. On a quick check,
      since at least commit 195a4ef6 ("V4L/DVB (6585): Convert
      em28xx to video_ioctl2") this was not implemented right.
      
      Fixes: 195a4ef6 ("V4L/DVB (6585): Convert em28xx to video_ioctl2")
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a79eb12