1. 19 Apr, 2021 25 commits
    • Johannes Thumshirn's avatar
      btrfs: remove duplicated in_range() macro · cea62800
      Johannes Thumshirn authored
      The in_range() macro is defined twice in btrfs' source, once in ctree.h
      and once in misc.h.
      
      Remove the definition in ctree.h and include misc.h in the files depending
      on it.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cea62800
    • Filipe Manana's avatar
      btrfs: remove stale comment and logic from btrfs_inode_in_log() · 209ecbb8
      Filipe Manana authored
      Currently btrfs_inode_in_log() checks the list of modified extents of the
      inode, and has a comment mentioning why, as it used to be necessary to
      make sure if we did something like the following:
      
        mmap write range A
        mmap write range B
        msync range A (ranged fsync)
        msync range B (ranged fsync)
      
      we ended up with both ranges being logged.
      
      If we did not check it, then the second fsync would do nothing because
      btrfs_inode_in_log() would return true. This was added in 125c4cf9
      ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
      test case generic/325 from fstests exercises that scenario.
      
      However, as of commit 48778179 ("btrfs: make fast fsyncs wait only
      for writeback"), every ranged fsync is now turned into a full ranged fsync
      (operates on the range from 0 to LLONG_MAX), so it is now pointless to
      test of emptiness of the list of modified extents, and the comment is
      clearly outdated.
      
      So just remove the comment and list emptiness check, while also changing
      the function's return type to be a boolean instead of an integer.
      In case one day we get support for ranged fsyncs again, it will be easy
      to notice the check is necessary again, because it will make generic/325
      always fail.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      209ecbb8
    • Filipe Manana's avatar
      btrfs: fix race between marking inode needs to be logged and log syncing · bc0939fc
      Filipe Manana authored
      We have a race between marking that an inode needs to be logged, either
      at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
      btrfs_sync_log(). The following steps describe how the race happens.
      
      1) We are at transaction N;
      
      2) Inode I was previously fsynced in the current transaction so it has:
      
          inode->logged_trans set to N;
      
      3) The inode's root currently has:
      
         root->log_transid set to 1
         root->last_log_commit set to 0
      
         Which means only one log transaction was committed to far, log
         transaction 0. When a log tree is created we set ->log_transid and
         ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
      
      4) One more range of pages is dirtied in inode I;
      
      5) Some task A starts an fsync against some other inode J (same root), and
         so it joins log transaction 1.
      
         Before task A calls btrfs_sync_log()...
      
      6) Task B starts an fsync against inode I, which currently has the full
         sync flag set, so it starts delalloc and waits for the ordered extent
         to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
      
      7) During ordered extent completion we have btrfs_update_inode() called
         against inode I, which in turn calls btrfs_set_inode_last_trans(),
         which does the following:
      
           spin_lock(&inode->lock);
           inode->last_trans = trans->transaction->transid;
           inode->last_sub_trans = inode->root->log_transid;
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         So ->last_trans is set to N and ->last_sub_trans set to 1.
         But before setting ->last_log_commit...
      
      8) Task A is at btrfs_sync_log():
      
         - it increments root->log_transid to 2
         - starts writeback for all log tree extent buffers
         - waits for the writeback to complete
         - writes the super blocks
         - updates root->last_log_commit to 1
      
         It's a lot of slow steps between updating root->log_transid and
         root->last_log_commit;
      
      9) The task doing the ordered extent completion, currently at
         btrfs_set_inode_last_trans(), then finally runs:
      
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         Which results in inode->last_log_commit being set to 1.
         The ordered extent completes;
      
      10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
          true because we have all the following conditions met:
      
          inode->logged_trans == N which matches fs_info->generation &&
          inode->last_subtrans (1) <= inode->last_log_commit (1) &&
          inode->last_subtrans (1) <= root->last_log_commit (1) &&
          list inode->extent_tree.modified_extents is empty
      
          And as a consequence we return without logging the inode, so the
          existing logged version of the inode does not point to the extent
          that was written after the previous fsync.
      
      It should be impossible in practice for one task be able to do so much
      progress in btrfs_sync_log() while another task is at
      btrfs_set_inode_last_trans() right after it reads root->log_transid and
      before it reads root->last_log_commit. Even if kernel preemption is enabled
      we know the task at btrfs_set_inode_last_trans() can not be preempted
      because it is holding the inode's spinlock.
      
      However there is another place where we do the same without holding the
      spinlock, which is in the memory mapped write path at:
      
        vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
        {
           (...)
           BTRFS_I(inode)->last_trans = fs_info->generation;
           BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
           BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
           (...)
      
      So with preemption happening after setting ->last_sub_trans and before
      setting ->last_log_commit, it is less of a stretch to have another task
      do enough progress at btrfs_sync_log() such that the task doing the memory
      mapped write ends up with ->last_sub_trans and ->last_log_commit set to
      the same value. It is still a big stretch to get there, as the task doing
      btrfs_sync_log() has to start writeback, wait for its completion and write
      the super blocks.
      
      So fix this in two different ways:
      
      1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
         value of ->last_sub_trans minus 1;
      
      2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
         like we do for buffered and direct writes at btrfs_file_write_iter(),
         which is all we need to make sure multiple writes and fsyncs to an
         inode in the same transaction never result in an fsync missing that
         the inode changed and needs to be logged. Turn this into a helper
         function and use it both at btrfs_page_mkwrite() and at
         btrfs_file_write_iter() - this also fixes the problem that at
         btrfs_page_mkwrite() we were setting those fields without the
         protection of the inode's spinlock.
      
      This is an extremely unlikely race to happen in practice.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc0939fc
    • Filipe Manana's avatar
      btrfs: fix race between memory mapped writes and fsync · 885f46d8
      Filipe Manana authored
      When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
      any new delalloc that might have been created before taking the lock and
      then wait either for the ordered extents to complete or just for the
      writeback to complete (depending on whether the full sync flag is set or
      not). We then start logging the inode and assume that while we are doing it
      no one else is touching the inode's file extent items (or adding new ones).
      
      That is generally true because all operations that modify an inode acquire
      the inode's lock first, including buffered and direct IO writes. However
      there is one exception: memory mapped writes, which do not and can not
      acquire the inode's lock.
      
      This can cause two types of issues: ending up logging file extent items
      with overlapping ranges, which is detected by the tree checker and will
      result in aborting the transaction when starting writeback for a log
      tree's extent buffers, or a silent corruption where we log a version of
      the file that never existed.
      
      Scenario 1 - logging overlapping extents
      
      The following steps explain how we can end up with file extents items with
      overlapping ranges in a log tree due to a race between a fsync and memory
      mapped writes:
      
      1) Task A starts an fsync on inode X, which has the full sync runtime flag
         set. First it starts by flushing all delalloc for the inode;
      
      2) Task A then locks the inode and flushes any other delalloc that might
         have been created after the previous flush and waits for all ordered
         extents to complete;
      
      3) In the inode's root we have the following leaf:
      
         Leaf N, generation == current transaction id:
      
         ---------------------------------------------------------
         | (...)  [ file extent item, offset 640K, length 128K ] |
         ---------------------------------------------------------
      
         The last file extent item in leaf N covers the file range from 640K to
         768K;
      
      4) Task B does a memory mapped write for the page corresponding to the
         file range from 764K to 768K;
      
      5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
         btrfs_search_forward() to search for leafs modified in the current
         transaction that contain items for the inode. It finds leaf N and copies
         all the inode items from that leaf into the log tree.
      
         Now the log tree has a copy of the last file extent item from leaf N.
      
         At the end of the while loop at copy_inode_items_to_log(), we have the
         minimum key set to:
      
         min_key.objectid = <inode X number>
         min_key.type = BTRFS_EXTENT_DATA_KEY
         min_key.offset = 640K
      
         Then we increment the key's offset by 1 so that the next call to
         btrfs_search_forward() leaves us at the first key greater than the key
         we just processed.
      
         But before btrfs_search_forward() is called again...
      
      6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
         It can be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) When the respective ordered extent completes, it trims the length of
         the existing file extent item for file offset 640K from 128K to 124K,
         and a new file extent item is added with a key offset of 764K and a
         length of 4K;
      
      8) Task A calls btrfs_search_forward(), which returns us a path pointing
         to the leaf (can be leaf N or some other) containing the new file extent
         item for file offset 764K.
      
         We end up copying this item to the log tree, which overlaps with the
         last copied file extent item, which covers the file range from 640K to
         768K.
      
         When writeback is triggered for log tree's extent buffers, the issue
         will be detected by the tree checker which will dump a trace and an
         error message on dmesg/syslog. If the writeback is triggered when
         syncing the log, which typically is, then we also end up aborting the
         current transaction.
      
      This is the same type of problem fixed in 0c713cba ("Btrfs: fix race
      between ranged fsync and writeback of adjacent ranges").
      
      Scenario 2 - logging a version of the file that never existed
      
      This scenario only happens when using the NO_HOLES feature and results in
      a silent corruption, in the sense that is not detectable by 'btrfs check'
      or the tree checker:
      
      1) We have an inode I with a size of 1M and two file extent items, one
         covering an extent with disk_bytenr == X for the file range [0, 512K)
         and another one covering another extent with disk_bytenr == Y for the
         file range [512K, 1M);
      
      2) A hole is punched for the file range [512K, 1M);
      
      3) Task A starts an fsync of inode I, which has the full sync runtime flag
         set. It starts by flushing all existing delalloc, locks the inode (VFS
         lock), starts any new delalloc that might have been created before
         taking the lock and waits for all ordered extents to complete;
      
      4) Some other task does a memory mapped write for the page corresponding to
         the file range [640K, 644K) for example;
      
      5) Task A then logs all items of the inode with the call to
         copy_inode_items_to_log();
      
      6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
         be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) The ordered extent for the range [640K, 644K) completes and a file
         extent item for that range is added to the subvolume tree, pointing
         to a 4K extent with a disk_bytenr == Z;
      
      8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
         the subvolume tree. It finds two implicit holes:
      
         - one for the file range [512K, 640K)
         - one for the file range [644K, 1M)
      
         As a result we end up neither logging a hole for the range [640K, 644K)
         nor logging the file extent item with a disk_bytenr == Z.
         This means that if we have a power failure and replay the log tree we
         end up getting the following file extent layout:
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Y ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
         Which does not corresponding to any layout the file ever had before
         the power failure. The only two valid layouts would be:
      
         [ disk_bytenr X ]    [   hole   ]
         0             512K  512K        1M
      
         and
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Z ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
      This can be fixed by serializing memory mapped writes with fsync, and there
      are two ways to do it:
      
      1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
         in the inode's io tree. This prevents the race but also blocks any reads
         during the duration of the fsync, which has a negative impact for many
         common workloads;
      
      2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
         semaphore was recently added by Josef's patch set:
      
         btrfs: add a i_mmap_lock to our inode
         btrfs: cleanup inode_lock/inode_unlock uses
         btrfs: exclude mmaps while doing remap
         btrfs: exclude mmap from happening during all fallocate operations
      
         and is used to solve races between memory mapped writes and
         clone/dedupe/fallocate. This also makes us have the same behaviour we
         have regarding other writes (buffered and direct IO) and fsync - block
         them while the inode logging is in progress.
      
      This change uses the second approach due to the performance impact of the
      first one.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      885f46d8
    • Josef Bacik's avatar
      btrfs: exclude mmap from happening during all fallocate operations · 8d9b4a16
      Josef Bacik authored
      There's a small window where a deadlock can happen between fallocate and
      mmap.  This is described in detail by Filipe:
      
      """
      When doing a fallocate operation we lock the inode, flush delalloc within
      the target range, wait for any ordered extents to complete and then lock
      the file range. Before we lock the range and after we flush delalloc,
      there is a time window where another task can come in and do a memory
      mapped write for a page within the fallocate range.
      
      This means that after fallocate locks the range, there can be a dirty page
      in the range. More often than not, this does not cause any problem.
      The exception is when we are low on available metadata space, because an
      fallocate operation needs to start a transaction while holding the file
      range locked, either through btrfs_prealloc_file_range() or through the
      call to btrfs_fallocate_update_isize(). If that's the case, we can end up
      in a deadlock. The following list of steps explains how that happens:
      
      1) A fallocate operation starts, locks the inode, flushes delalloc in the
         range and waits for ordered extents in the range to complete;
      
      2) Before the fallocate task locks the file range, another task does a
         memory mapped write for a page in the fallocate target range. This is
         possible since memory mapped writes do not (and can not) lock the
         inode;
      
      3) The fallocate task locks the file range. At this point there is one
         dirty page in the range (due to the memory mapped write);
      
      4) When the fallocate task attempts to start a transaction, it blocks when
         attempting to reserve metadata space, since we are low on available
         metadata space. Before blocking (wait on its reservation ticket), it
         starts the async reclaim task (if not running already);
      
      5) The async reclaim task is not able to release space through any other
         means, so it decides to flush delalloc for inodes with dirty pages.
         It finds that the inode used in the fallocate operation has a dirty
         page and therefore queues a job (fs_info->flush_workers workqueue) to
         flush delalloc for that inode and waits on that job to complete;
      
      6) The flush job blocks when attempting to lock the file range because
         it is currently locked by the fallocate task;
      
      7) The fallocate task keeps waiting for its metadata reservation, waiting
         for a wakeup on its reservation ticket. The async reclaim task is
         waiting on the flush job, which in turn is waiting for locking the file
         range that is currently locked by the fallocate task. So unless some
         other task is able to release enough metadata space, for example an
         ordered extent for some other inode completes, we end up in a deadlock
         between all these tasks.
      
      When this happens stack traces like the following show up in dmesg/syslog:
      
       INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
       Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        schedule+0x45/0xe0
        lock_extent_bits+0x1e6/0x2d0 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_invalidatepage+0x32c/0x390 [btrfs]
        ? __mod_memcg_state+0x8e/0x160
        __extent_writepage+0x2d4/0x400 [btrfs]
        extent_write_cache_pages+0x2b2/0x500 [btrfs]
        ? lock_release+0x20e/0x4c0
        ? trace_hardirqs_on+0x1b/0xf0
        extent_writepages+0x43/0x90 [btrfs]
        ? lock_acquire+0x1a3/0x490
        do_writepages+0x43/0xe0
        ? __filemap_fdatawrite_range+0xa4/0x100
        __filemap_fdatawrite_range+0xc5/0x100
        btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        btrfs_work_helper+0xf1/0x600 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x50/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
       INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
       Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? kvm_clock_read+0x14/0x30
        ? wait_for_completion+0x81/0x110
        schedule+0x45/0xe0
        schedule_timeout+0x30c/0x580
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? lock_acquire+0x1a3/0x490
        ? try_to_wake_up+0x7a/0xa20
        ? lock_release+0x20e/0x4c0
        ? lock_acquired+0x199/0x490
        ? wait_for_completion+0x81/0x110
        wait_for_completion+0xab/0x110
        start_delalloc_inodes+0x2af/0x390 [btrfs]
        btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        flush_space+0x24f/0x660 [btrfs]
        btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x20f/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
      (...)
      several tasks waiting for the inode lock held by the fallocate task below
      (...)
       RIP: 0033:0x7f61efe73fff
       Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
       RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
       RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
       RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
       RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
       R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
       R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
       task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        schedule+0x45/0xe0
        __reserve_bytes+0x4a4/0xb10 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        start_transaction+0x2d1/0x760 [btrfs]
        btrfs_replace_file_extents+0x120/0x930 [btrfs]
        ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
        btrfs_fallocate+0xdfb/0x1260 [btrfs]
        ? filename_lookup+0xf1/0x180
        vfs_fallocate+0x14f/0x440
        ioctl_preallocate+0x92/0xc0
        do_vfs_ioctl+0x66b/0x750
        ? __do_sys_newfstat+0x53/0x60
        __x64_sys_ioctl+0x62/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      """
      
      Fix this by disallowing mmaps from happening while we're doing any of
      the fallocate operations on this inode.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d9b4a16
    • Josef Bacik's avatar
      btrfs: exclude mmaps while doing remap · 8c99516a
      Josef Bacik authored
      Darrick reported a potential issue to me where we could allow mmap
      writes after validating a page range matched in the case of dedupe.
      Generally we rely on lock page -> lock extent with the ordered flush to
      protect us, but this is done after we check the pages because we use the
      generic helpers, so we could modify the page in between doing the check
      and locking the range.
      
      There also exists a deadlock, as described by Filipe
      
      """
      When cloning a file range, we lock the inodes, flush any delalloc within
      the respective file ranges, wait for any ordered extents and then lock the
      file ranges in both inodes. This means that right after we flush delalloc
      and before we lock the file ranges, memory mapped writes can come in and
      dirty pages in the file ranges of the clone operation.
      
      Most of the time this is harmless and causes no problems. However, if we
      are low on available metadata space, we can later end up in a deadlock
      when starting a transaction to replace file extent items. This happens if
      when allocating metadata space for the transaction, we need to wait for
      the async reclaim thread to release space and the reclaim thread needs to
      flush delalloc for the inode that got the memory mapped write and has its
      range locked by the clone task.
      
      Basically what happens is the following:
      
      1) A clone operation locks inodes A and B, flushes delalloc for both
         inodes in the respective file ranges and waits for any ordered extents
         in those ranges to complete;
      
      2) Before the clone task locks the file ranges, another task does a
         memory mapped write (which does not lock the inode) for one of the
         inodes of the clone operation. So now we have a dirty page in one of
         the ranges used by the clone operation;
      
      3) The clone operation locks the file ranges for inodes A and B;
      
      4) Later, when iterating over the file extents of inode A, the clone
         task attempts to start a transaction. There's not enough available
         free metadata space, so the async reclaim task is started (if not
         running already) and we wait for someone to wake us up on our
         reservation ticket;
      
      5) The async reclaim task is not able to release space by any other
         means and decides to flush delalloc for the inode of the clone
         operation;
      
      6) The workqueue job used to flush the inode blocks when starting
         delalloc for the inode, since the file range is currently locked by
         the clone task;
      
      7) But the clone task is waiting on its reservation ticket and the async
         reclaim task is waiting on the flush job to complete, which can't
         progress since the clone task has the file range locked. So unless
         some other task is able to release space, for example an ordered
         extent for some other inode completes, we have a deadlock between all
         these tasks;
      
      When this happens stack traces like the following show up in dmesg/syslog:
      
       INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
       Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        schedule+0x45/0xe0
        lock_extent_bits+0x1e6/0x2d0 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_invalidatepage+0x32c/0x390 [btrfs]
        ? __mod_memcg_state+0x8e/0x160
        __extent_writepage+0x2d4/0x400 [btrfs]
        extent_write_cache_pages+0x2b2/0x500 [btrfs]
        ? lock_release+0x20e/0x4c0
        ? trace_hardirqs_on+0x1b/0xf0
        extent_writepages+0x43/0x90 [btrfs]
        ? lock_acquire+0x1a3/0x490
        do_writepages+0x43/0xe0
        ? __filemap_fdatawrite_range+0xa4/0x100
        __filemap_fdatawrite_range+0xc5/0x100
        btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        btrfs_work_helper+0xf1/0x600 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x50/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
       INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
       Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? kvm_clock_read+0x14/0x30
        ? wait_for_completion+0x81/0x110
        schedule+0x45/0xe0
        schedule_timeout+0x30c/0x580
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? lock_acquire+0x1a3/0x490
        ? try_to_wake_up+0x7a/0xa20
        ? lock_release+0x20e/0x4c0
        ? lock_acquired+0x199/0x490
        ? wait_for_completion+0x81/0x110
        wait_for_completion+0xab/0x110
        start_delalloc_inodes+0x2af/0x390 [btrfs]
        btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        flush_space+0x24f/0x660 [btrfs]
        btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x20f/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
      (...)
      several other tasks blocked on inode locks held by the clone task below
      (...)
       RIP: 0033:0x7f61efe73fff
       Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
       RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
       RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
       RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
       RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
       R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
       R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
       task: fdm-stress        state:D stack:    0 pid:2508234 ppid:2508153 flags:0x00004000
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        schedule+0x45/0xe0
        __reserve_bytes+0x4a4/0xb10 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        start_transaction+0x2d1/0x760 [btrfs]
        btrfs_replace_file_extents+0x120/0x930 [btrfs]
        ? lock_release+0x20e/0x4c0
        btrfs_clone+0x3e4/0x7e0 [btrfs]
        ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
        btrfs_clone_files+0xf6/0x150 [btrfs]
        btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        do_clone_file_range+0xd4/0x1f0
        vfs_clone_file_range+0x4d/0x230
        ? lock_release+0x20e/0x4c0
        ioctl_file_clone+0x8f/0xc0
        do_vfs_ioctl+0x342/0x750
        __x64_sys_ioctl+0x62/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      """
      
      Fix both of these issues by excluding mmaps from happening we are doing
      any sort of remap, which prevents this race completely.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c99516a
    • Josef Bacik's avatar
      btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers · 64708539
      Josef Bacik authored
      A few places we intermix btrfs_inode_lock with a inode_unlock, and some
      places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
      
      None of these places are using this incorrectly, but as we adjust some
      of these callers it would be nice to keep everything consistent, so
      convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64708539
    • Josef Bacik's avatar
      btrfs: add a i_mmap_lock to our inode · 8318ba79
      Josef Bacik authored
      We need to be able to exclude page_mkwrite from happening concurrently
      with certain operations.  To facilitate this, add a i_mmap_lock to our
      inode, down_read() it in our mkwrite, and add a new ILOCK flag to
      indicate that we want to take the i_mmap_lock as well.  I used pahole to
      check the size of the btrfs_inode, the sizes are as follows
      
      no lockdep:
      before: 1120 (3 per 4k page)
      after: 1160 (3 per 4k page)
      
      lockdep:
      before: 2072 (1 per 4k page)
      after: 2224 (1 per 4k page)
      
      We're slightly larger but it doesn't change how many objects we can fit
      per page.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8318ba79
    • Goldwyn Rodrigues's avatar
      btrfs: remove mirror argument from btrfs_csum_verify_data() · 5e295768
      Goldwyn Rodrigues authored
      The parameter mirror is not used and does not make sense for checksum
      verification of the given bio.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5e295768
    • Goldwyn Rodrigues's avatar
      btrfs: remove force argument from run_delalloc_nocow() · 6e65ae76
      Goldwyn Rodrigues authored
      force_cow can be calculated from inode and does not need to be passed as
      an argument.
      
      This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range()
      A new function, should_nocow() checks if the range should be NOCOWed or
      not. The function returns true iff either BTRFS_INODE_NODATA or
      BTRFS_INODE_PREALLOC, but is not a defrag extent.
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e65ae76
    • Nikolay Borisov's avatar
    • Jiapeng Chong's avatar
      btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned · 7000babd
      Jiapeng Chong authored
      Fix the following coccicheck warnings:
      
      ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function
      'dev_extent_hole_check_zoned' with return type bool.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7000babd
    • Filipe Manana's avatar
      btrfs: add btree read ahead for incremental send operations · 2ce73c63
      Filipe Manana authored
      Currently we do not do btree read ahead when doing an incremental send,
      however we know that we will read and process any node or leaf in the
      send root that has a generation greater than the generation of the parent
      root. So triggering read ahead for such nodes and leafs is beneficial
      for an incremental send.
      
      This change does that, triggers read ahead of any node or leaf in the
      send root that has a generation greater then the generation of the
      parent root. As for the parent root, no readahead is triggered because
      knowing in advance which nodes/leaves are going to be read is not so
      linear and there's often a large time window between visiting nodes or
      leaves of the parent root. So I opted to leave out the parent root,
      and triggering read ahead for its nodes/leaves seemed to have not made
      significant difference.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of ram:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing incremental send..."
        start=$(date +%s)
        btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
        end=$(date +%s)
        echo
        echo "Incremental send took $((end - start)) seconds"
      
        umount $MNT
      
      Before this change, incremental send duration:
      
        with $initial_file_count == 200000:  51 seconds
        with $initial_file_count == 500000: 168 seconds
      
      After this change, incremental send duration:
      
        with $initial_file_count == 200000:   39 seconds (-26.7%)
        with $initial_file_count == 500000:  125 seconds (-29.4%)
      
      For $initial_file_count == 200000 there are 62600 nodes and leaves in the
      btree of the first snapshot, and 77759 nodes and leaves in the btree of
      the second snapshot. The root nodes were at level 2.
      
      While for $initial_file_count == 500000 there are 152476 nodes and leaves
      in the btree of the first snapshot, and 190511 nodes and leaves in the
      btree of the second snapshot. The root nodes were at level 2 as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ce73c63
    • Filipe Manana's avatar
      btrfs: add btree read ahead for full send operations · 19358b15
      Filipe Manana authored
      When doing a full send we know that we are going to be reading every node
      and leaf of the send root, so we benefit from enabling read ahead for the
      btree.
      
      This change enables read ahead for full send operations only, incremental
      sends will have read ahead enabled in a different way by a separate patch.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing incremental send..."
        start=$(date +%s)
        btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
        end=$(date +%s)
        echo
        echo "Incremental send took $((end - start)) seconds"
      
        umount $MNT
      
      Before this change, full send duration:
      
        with $initial_file_count == 200000:  165 seconds
        with $initial_file_count == 500000:  407 seconds
      
      After this change, full send duration:
      
        with $initial_file_count == 200000:  149 seconds (-10.2%)
        with $initial_file_count == 500000:  353 seconds (-14.2%)
      
      For $initial_file_count == 200000 there are 62600 nodes and leaves in the
      btree of the first snapshot, while for $initial_file_count == 500000 there
      are 152476 nodes and leaves. The roots were at level 2.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19358b15
    • Nikolay Borisov's avatar
      btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata · 98686ffc
      Nikolay Borisov authored
      btrfs_block_rsv_add can return only ENOSPC since it's called with
      NO_FLUSH modifier. This so simplify the logic in
      btrfs_delayed_inode_reserve_metadata to exploit this invariant.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add assert and comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98686ffc
    • Nikolay Borisov's avatar
      btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadata · 8e3c9d3c
      Nikolay Borisov authored
      It's only used for tracepoint to obtain the inode number, but we already
      have the ino from btrfs_delayed_node::inode_id.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e3c9d3c
    • Nikolay Borisov's avatar
      btrfs: simplify commit logic in try_flush_qgroup · ae396a3b
      Nikolay Borisov authored
      It's no longer expected to call this function with an open transaction
      so all the workarounds concerning this can be removed. In fact it'll
      constitute a bug to call this function with a transaction already held
      so WARN in this case.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae396a3b
    • Anand Jain's avatar
      btrfs: scrub: drop a few function declarations · e5ce9886
      Anand Jain authored
      Drop function declarations at the beginning of the file scrub.c. These
      functions are defined before they are used in the same file and don't
      need forward declaration.
      
      No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5ce9886
    • Anand Jain's avatar
      btrfs: change return type to bool in btrfs_extent_readonly · f4639636
      Anand Jain authored
      btrfs_extent_readonly() checks if the block group is readonly, the bool
      return type should be used.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4639636
    • Anand Jain's avatar
      btrfs: unexport btrfs_extent_readonly() and make it static · 05947ae1
      Anand Jain authored
      btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
      move it from extent-tree.c to inode.c and declare it as static.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05947ae1
    • Nikolay Borisov's avatar
      btrfs: replace open coded while loop with proper construct · b6e9f16c
      Nikolay Borisov authored
      btrfs_inc_block_group_ro wants to ensure that the current transaction is
      not running dirty block groups, if it is it waits and loops again.
      That logic is currently implemented using a goto label. Actually using
      a proper do {} while() construct doesn't hurt readability nor does it
      introduce excessive nesting and makes the relevant code stand out by
      being encompassed in the loop construct. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6e9f16c
    • Nikolay Borisov's avatar
      btrfs: replace offset_in_entry with in_range · 20bbf20e
      Nikolay Borisov authored
      No point in duplicating the functionality just use the generic helper
      that has the same semantics.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      20bbf20e
    • Nikolay Borisov's avatar
      cca5de97
    • Nikolay Borisov's avatar
    • Qu Wenruo's avatar
      btrfs: fix comment for btrfs ordered extent flag bits · 0b3dcd13
      Qu Wenruo authored
      There is small error in comment about BTRFS_ORDERED_* flags, added in
      commit 3c198fe0 ("btrfs: rework the order of
      btrfs_ordered_extent::flags") but the fixup did not get merged in time.
      
      The 4 types are for ordered extent itself, not for direct io.
      Only 3 types support direct io, REGULAR/NOCOW/PREALLOC.
      
      Fix the comment to reflect that.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b3dcd13
  2. 18 Apr, 2021 6 commits
    • Linus Torvalds's avatar
      Linux 5.12-rc8 · bf05bf16
      Linus Torvalds authored
      bf05bf16
    • Linus Torvalds's avatar
      Merge tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 5ffe04cc
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "Another smaller set of fixes for three of the Arm platforms:
      
        TI OMAP:
      
           Fix swapped mmc device order also for omap3 that got changed with
           the recent PROBE_PREFER_ASYNCHRONOUS changes. While eventually the
           aliases should be board specific, all the mmc device instances are
           all there in the SoC, and we do probe them by default so that PM
           runtime can idle the devices if left enabled from the bootloader.
      
        Qualcomm Snapdragon:
      
           This bypasses the recently introduced interconnect handling in
           the GENI (serial engine) driver when running off ACPI, as this
           causes the GENI probe to fail and the Lenovo Yoga C630 to boot
           without keyboard and touchpad.
      
        Allwinner:
      
           One 32kHz clock fix for the beelink gs1, a CD polarity fix for the
           SoPine, some MAINTAINERS maintainance, and a clk / reset switch to
           our headers"
      
      * tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        arm64: dts: allwinner: h6: beelink-gs1: Remove ext. 32 kHz osc reference
        MAINTAINERS: Match on allwinner keyword
        MAINTAINERS: Add our new mailing-list
        arm64: dts: allwinner: Fix SD card CD GPIO for SOPine systems
        arm64: dts: allwinner: h6: Switch to macros for RSB clock/reset indices
        ARM: OMAP2+: Fix uninitialized sr_inst
        ARM: dts: Fix swapped mmc order for omap3
        ARM: OMAP2+: Fix warning for omap_init_time_of()
        soc: qcom: geni: shield geni_icc_get() for ACPI boot
      5ffe04cc
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm · f5ce0466
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
      
       - Halve maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
      
       - Fix conversion for_each_membock() to for_each_mem_range()
      
       - Fix footbridge PCI mapping
      
       - Avoid uprobes hooking on thumb instructions
      
      * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
        ARM: 9071/1: uprobes: Don't hook on thumb instructions
        ARM: footbridge: fix PCI interrupt mapping
        ARM: 9069/1: NOMMU: Fix conversion for_each_membock() to for_each_mem_range()
        ARM: 9063/1: mm: reduce maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
      f5ce0466
    • Fredrik Strupe's avatar
      ARM: 9071/1: uprobes: Don't hook on thumb instructions · d2f7eca6
      Fredrik Strupe authored
      Since uprobes is not supported for thumb, check that the thumb bit is
      not set when matching the uprobes instruction hooks.
      
      The Arm UDF instructions used for uprobes triggering
      (UPROBE_SWBP_ARM_INSN and UPROBE_SS_ARM_INSN) coincidentally share the
      same encoding as a pair of unallocated 32-bit thumb instructions (not
      UDF) when the condition code is 0b1111 (0xf). This in effect makes it
      possible to trigger the uprobes functionality from thumb, and at that
      using two unallocated instructions which are not permanently undefined.
      Signed-off-by: default avatarFredrik Strupe <fredrik@strupe.net>
      Cc: stable@vger.kernel.org
      Fixes: c7edc9e3 ("ARM: add uprobes support")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      d2f7eca6
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · c98ff1d0
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two fixes: the libsas fix is for a problem that occurs when trying to
        change the cache type of an ATA device and the libiscsi one is a
        regression fix from this merge window"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: libsas: Reset num_scatter if libata marks qc as NODATA
        scsi: iscsi: Fix iSCSI cls conn state
      c98ff1d0
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm · aba5970c
      Linus Torvalds authored
      Pull vmwgfx fixes from Dave Airlie:
       "This contains two regression fixes for vmwgfx, one due to a refactor
        which meant locks were being used before initialisation, and the other
        in fixing up some warnings from the core when destroying pinned
        buffers.
      
        vmwgfx:
      
         - fixed unpinning before destruction
      
         - lockdep init reordering"
      
      * tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm:
        drm/vmwgfx: Make sure bo's are unpinned before putting them back
        drm/vmwgfx: Fix the lockdep breakage
        drm/vmwgfx: Make sure we unpin no longer needed buffers
      aba5970c
  3. 17 Apr, 2021 9 commits
    • Dave Airlie's avatar
      Merge tag 'vmwgfx-fixes-2021-04-14' of gitlab.freedesktop.org:zack/vmwgfx into drm-fixes · 796b556c
      Dave Airlie authored
      vmwgfx fixes for regressions in 5.12
      
      Here's a set of 3 patches fixing ugly regressions
      in the vmwgfx driver. We broke lock initialization
      code and ended up using spinlocks before initialization
      breaking lockdep.
      Also there was a bit of a fallout from drm changes
      which made the core validate that unreferenced buffers
      have been unpinned. vmwgfx pinning code predates a lot
      of the core drm and wasn't written to account for those
      semantics. Fortunately changes required to fix it
      are not too intrusive.
      The changes have been validated by our internal ci.
      Signed-off-by: default avatarZack Rusin <zackr@vmware.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      
      From: Zack Rusin <zackr@vmware.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/f7add0a2-162e-3bd2-b1be-344a94f2acbf@vmware.com
      796b556c
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 194cf482
      Linus Torvalds authored
      Pull i2c fix from Wolfram Sang:
       "One more driver bugfix for I2C"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: mv64xxx: Fix random system lock caused by runtime PM
      194cf482
    • Linus Torvalds's avatar
      readdir: make sure to verify directory entry for legacy interfaces too · 0c93ac69
      Linus Torvalds authored
      This does the directory entry name verification for the legacy
      "fillonedir" (and compat) interface that goes all the way back to the
      dark ages before we had a proper dirent, and the readdir() system call
      returned just a single entry at a time.
      
      Nobody should use this interface unless you still have binaries from
      1991, but let's do it right.
      
      This came up during discussions about unsafe_copy_to_user() and proper
      checking of all the inputs to it, as the networking layer is looking to
      use it in a few new places.  So let's make sure the _old_ users do it
      all right and proper, before we add new ones.
      
      See also commit 8a23eb80 ("Make filldir[64]() verify the directory
      entry filename is valid") which did the proper modern interfaces that
      people actually use. It had a note:
      
          Note that I didn't bother adding the checks to any legacy interfaces
          that nobody uses.
      
      which this now corrects.  Note that we really don't care about POSIX and
      the presense of '/' in a directory entry, but verify_dirent_name() also
      ends up doing the proper name length verification which is what the
      input checking discussion was about.
      
      [ Another option would be to remove the support for this particular very
        old interface: any binaries that use it are likely a.out binaries, and
        they will no longer run anyway since we removed a.out binftm support
        in commit eac61655 ("x86: Deprecate a.out support").
      
        But I'm not sure which came first: getdents() or ELF support, so let's
        pretend somebody might still have a working binary that uses the
        legacy readdir() case.. ]
      
      Link: https://lore.kernel.org/lkml/CAHk-=wjbvzCAhAtvG0d81W5o0-KT5PPTHhfJ5ieDFq+bGtgOYg@mail.gmail.com/Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c93ac69
    • Linus Torvalds's avatar
      Merge tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 88a5af94
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes for 5.12-rc8, including fixes from netfilter, and
        bpf. BPF verifier changes stand out, otherwise things have slowed
        down.
      
        Current release - regressions:
      
         - gro: ensure frag0 meets IP header alignment
      
         - Revert "net: stmmac: re-init rx buffers when mac resume back"
      
         - ethernet: macb: fix the restore of cmp registers
      
        Previous releases - regressions:
      
         - ixgbe: Fix NULL pointer dereference in ethtool loopback test
      
         - ixgbe: fix unbalanced device enable/disable in suspend/resume
      
         - phy: marvell: fix detection of PHY on Topaz switches
      
         - make tcp_allowed_congestion_control readonly in non-init netns
      
         - xen-netback: Check for hotplug-status existence before watching
      
        Previous releases - always broken:
      
         - bpf: mitigate a speculative oob read of up to map value size by
           tightening the masking window
      
         - sctp: fix race condition in sctp_destroy_sock
      
         - sit, ip6_tunnel: Unregister catch-all devices
      
         - netfilter: nftables: clone set element expression template
      
         - netfilter: flowtable: fix NAT IPv6 offload mangling
      
         - net: geneve: check skb is large enough for IPv4/IPv6 header
      
         - netlink: don't call ->netlink_bind with table lock held"
      
      * tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
        netlink: don't call ->netlink_bind with table lock held
        MAINTAINERS: update my email
        bpf: Update selftests to reflect new error states
        bpf: Tighten speculative pointer arithmetic mask
        bpf: Move sanitize_val_alu out of op switch
        bpf: Refactor and streamline bounds check into helper
        bpf: Improve verifier error messages for users
        bpf: Rework ptr_limit into alu_limit and add common error path
        bpf: Ensure off_reg has no mixed signed bounds for all types
        bpf: Move off_reg into sanitize_ptr_alu
        bpf: Use correct permission flag for mixed signed bounds arithmetic
        ch_ktls: do not send snd_una update to TCB in middle
        ch_ktls: tcb close causes tls connection failure
        ch_ktls: fix device connection close
        ch_ktls: Fix kernel panic
        i40e: fix the panic when running bpf in xdpdrv mode
        net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
        net/mlx5e: Fix setting of RS FEC mode
        net/mlx5: Fix setting of devlink traps in switchdev mode
        Revert "net: stmmac: re-init rx buffers when mac resume back"
        ...
      88a5af94
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-fixes-for-5.12-rc8' of... · bdfd99e6
      Linus Torvalds authored
      Merge tag 'libnvdimm-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
      
      Pull libnvdimm fixes from Dan Williams:
       "The largest change is for a regression that landed during -rc1 for
        block-device read-only handling. Vaibhav found a new use for the
        ability (originally introduced by virtio_pmem) to call back to the
        platform to flush data, but also found an original bug in that
        implementation. Lastly, Arnd cleans up some compile warnings in dax.
      
        This has all appeared in -next with no reported issues.
      
        Summary:
      
         - Fix a regression of read-only handling in the pmem driver
      
         - Fix a compile warning
      
         - Fix support for platform cache flush commands on powerpc/papr"
      
      * tag 'libnvdimm-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC
        libnvdimm: Notify disk drivers to revalidate region read-only
        dax: avoid -Wempty-body warnings
      bdfd99e6
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 7c226774
      Linus Torvalds authored
      Pull CXL memory class fixes from Dan Williams:
       "A collection of fixes for the CXL memory class driver introduced in
        this release cycle.
      
        The driver was primarily developed on a work-in-progress QEMU
        emulation of the interface and we have since found a couple places
        where it hid spec compliance bugs in the driver, or had a spec
        implementation bug itself.
      
        The biggest change here is replacing a percpu_ref with an rwsem to
        cleanup a couple bugs in the error unwind path during ioctl device
        init. Lastly there were some minor cleanups to not export the
        power-management sysfs-ABI for the ioctl device, use the proper sysfs
        helper for emitting values, and prevent subtle bugs as new
        administration commands are added to the supported list.
      
        The bulk of it has appeared in -next save for the top commit which was
        found today and validated on a fixed-up QEMU model.
      
        Summary:
      
         - Fix support for CXL memory devices with registers offset from the
           BAR base.
      
         - Fix the reporting of device capacity.
      
         - Fix the driver commands list definition to be disconnected from the
           UAPI command list.
      
         - Replace percpu_ref with rwsem to fix initialization error path.
      
         - Fix leaks in the driver initialization error path.
      
         - Drop the power/ directory from CXL device sysfs.
      
         - Use the recommended sysfs helper for attribute 'show'
           implementations"
      
      * tag 'cxl-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/mem: Fix memory device capacity probing
        cxl/mem: Fix register block offset calculation
        cxl/mem: Force array size of mem_commands[] to CXL_MEM_COMMAND_ID_MAX
        cxl/mem: Disable cxl device power management
        cxl/mem: Do not rely on device_add() side effects for dev_set_name() failures
        cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations
        cxl/mem: Use sysfs_emit() for attribute show routines
      7c226774
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · fdb5d6ca
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "12 patches.
      
        Subsystems affected by this patch series: mm (documentation, kasan,
        and pagemap), csky, ia64, gcov, and lib"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        lib: remove "expecting prototype" kernel-doc warnings
        gcov: clang: fix clang-11+ build
        mm: ptdump: fix build failure
        mm/mapping_dirty_helpers: guard hugepage pud's usage
        ia64: tools: remove duplicate definition of ia64_mf() on ia64
        ia64: tools: remove inclusion of ia64-specific version of errno.h header
        ia64: fix discontig.c section mismatches
        ia64: remove duplicate entries in generic_defconfig
        csky: change a Kconfig symbol name to fix e1000 build error
        kasan: remove redundant config option
        kasan: fix hwasan build for gcc
        mm: eliminate "expecting prototype" kernel-doc warnings
      fdb5d6ca
    • Dan Williams's avatar
      cxl/mem: Fix memory device capacity probing · fae8817a
      Dan Williams authored
      The CXL Identify Memory Device output payload emits capacity in 256MB
      units. The driver is treating the capacity field as bytes. This was
      missed because QEMU reports bytes when it should report bytes / 256MB.
      
      Fixes: 8adaf747 ("cxl/mem: Find device capabilities")
      Reviewed-by: default avatarVishal Verma <vishal.l.verma@intel.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Link: https://lore.kernel.org/r/161862021044.3259705.7008520073059739760.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      fae8817a
    • Florian Westphal's avatar
      netlink: don't call ->netlink_bind with table lock held · f2764bd4
      Florian Westphal authored
      When I added support to allow generic netlink multicast groups to be
      restricted to subscribers with CAP_NET_ADMIN I was unaware that a
      genl_bind implementation already existed in the past.
      
      It was reverted due to ABBA deadlock:
      
      1. ->netlink_bind gets called with the table lock held.
      2. genetlink bind callback is invoked, it grabs the genl lock.
      
      But when a new genl subsystem is (un)registered, these two locks are
      taken in reverse order.
      
      One solution would be to revert again and add a comment in genl
      referring 1e82a62f, "genetlink: remove genl_bind").
      
      This would need a second change in mptcp to not expose the raw token
      value anymore, e.g.  by hashing the token with a secret key so userspace
      can still associate subflow events with the correct mptcp connection.
      
      However, Paolo Abeni reminded me to double-check why the netlink table is
      locked in the first place.
      
      I can't find one.  netlink_bind() is already called without this lock
      when userspace joins a group via NETLINK_ADD_MEMBERSHIP setsockopt.
      Same holds for the netlink_unbind operation.
      
      Digging through the history, commit f7736080
      ("netlink: access nlk groups safely in netlink bind and getname")
      expanded the lock scope.
      
      commit 3a20773b ("net: netlink: cap max groups which will be considered in netlink_bind()")
      ... removed the nlk->ngroups access that the lock scope
      extension was all about.
      
      Reduce the lock scope again and always call ->netlink_bind without
      the table lock.
      
      The Fixes tag should be vs. the patch mentioned in the link below,
      but that one got squash-merged into the patch that came earlier in the
      series.
      
      Fixes: 4d54cc32 ("mptcp: avoid lock_fast usage in accept path")
      Link: https://lore.kernel.org/mptcp/20210213000001.379332-8-mathew.j.martineau@linux.intel.com/T/#u
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Cc: Sean Tranchetti <stranche@codeaurora.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2764bd4