1. 28 Sep, 2020 2 commits
    • Jens Axboe's avatar
      io_uring: fix potential ABBA deadlock in ->show_fdinfo() · fad8e0de
      Jens Axboe authored
      syzbot reports a potential lock deadlock between the normal IO path and
      ->show_fdinfo():
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.9.0-rc6-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.2/19710 is trying to acquire lock:
      ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296
      
      but task is already holding lock:
      ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&ctx->uring_lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
             io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
             seq_show+0x4a8/0x700 fs/proc/fd.c:65
             seq_read+0x432/0x1070 fs/seq_file.c:208
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&p->lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:956 [inline]
             __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
             seq_read+0x61/0x1070 fs/seq_file.c:155
             pde_read fs/proc/inode.c:306 [inline]
             proc_reg_read+0x221/0x300 fs/proc/inode.c:318
             do_loop_readv_writev fs/read_write.c:734 [inline]
             do_loop_readv_writev fs/read_write.c:721 [inline]
             do_iter_read+0x48e/0x6e0 fs/read_write.c:955
             vfs_readv+0xe5/0x150 fs/read_write.c:1073
             kernel_readv fs/splice.c:355 [inline]
             default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
             do_splice_to+0x137/0x170 fs/splice.c:871
             splice_direct_to_actor+0x307/0x980 fs/splice.c:950
             do_splice_direct+0x1b3/0x280 fs/splice.c:1059
             do_sendfile+0x55f/0xd40 fs/read_write.c:1540
             __do_sys_sendfile64 fs/read_write.c:1601 [inline]
             __se_sys_sendfile64 fs/read_write.c:1587 [inline]
             __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (sb_writers#4){.+.+}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:2496 [inline]
             check_prevs_add kernel/locking/lockdep.c:2601 [inline]
             validate_chain kernel/locking/lockdep.c:3218 [inline]
             __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
             lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
             percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
             __sb_start_write+0x228/0x450 fs/super.c:1672
             io_write+0x6b5/0xb30 fs/io_uring.c:3296
             io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
             __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
             io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
             io_submit_sqe fs/io_uring.c:6324 [inline]
             io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
             __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
             do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      other info that might help us debug this:
      
      Chain exists of:
        sb_writers#4 --> &p->lock --> &ctx->uring_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ctx->uring_lock);
                                     lock(&p->lock);
                                     lock(&ctx->uring_lock);
        lock(sb_writers#4);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/19710:
       #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348
      
      stack backtrace:
      CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x198/0x1fd lib/dump_stack.c:118
       check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45e179
      Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
      RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
      RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
      R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c
      
      Fix this by just not diving into details if we fail to trylock the
      io_uring mutex. We know the ctx isn't going away during this operation,
      but we cannot safely iterate buffers/files/personalities if we don't
      hold the io_uring mutex.
      
      Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fad8e0de
    • Jens Axboe's avatar
      io_uring: always delete double poll wait entry on match · 8706e04e
      Jens Axboe authored
      syzbot reports a crash with tty polling, which is using the double poll
      handling:
      
      general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
      CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
      RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
      Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
      RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
      RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
      RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
      R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
      R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
      FS:  0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
       __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
       tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
       __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
       __tty_hangup drivers/tty/tty_io.c:575 [inline]
       tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
       pty_close+0x3f5/0x550 drivers/tty/pty.c:79
       tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
       __fput+0x285/0x920 fs/file_table.c:281
       task_work_run+0xdd/0x190 kernel/task_work.c:141
       tracehook_notify_resume include/linux/tracehook.h:188 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
       exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
       syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x401210
      
      which is due to a failure in removing the double poll wait entry if we
      hit a wakeup match. This can cause multiple invocations of the wakeup,
      which isn't safe.
      
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8706e04e
  2. 25 Sep, 2020 3 commits
  3. 21 Sep, 2020 5 commits
  4. 14 Sep, 2020 2 commits
  5. 13 Sep, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: grab any needed state during defer prep · 202700e1
      Jens Axboe authored
      Always grab work environment for deferred links. The assumption that we
      will be running it always from the task in question is false, as exiting
      tasks may mean that we're deferring this one to a thread helper. And at
      that point it's too late to grab the work environment.
      
      Fixes: debb85f4 ("io_uring: factor out grab_env() from defer_prep()")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      202700e1
  6. 05 Sep, 2020 3 commits
  7. 02 Sep, 2020 2 commits
  8. 01 Sep, 2020 1 commit
  9. 27 Aug, 2020 3 commits
  10. 26 Aug, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: make offset == -1 consistent with preadv2/pwritev2 · 0fef9483
      Jens Axboe authored
      The man page for io_uring generally claims were consistent with what
      preadv2 and pwritev2 accept, but turns out there's a slight discrepancy
      in how offset == -1 is handled for pipes/streams. preadv doesn't allow
      it, but preadv2 does. This currently causes io_uring to return -EINVAL
      if that is attempted, but we should allow that as documented.
      
      This change makes us consistent with preadv2/pwritev2 for just passing
      in a NULL ppos for streams if the offset is -1.
      
      Cc: stable@vger.kernel.org # v5.7+
      Reported-by: default avatarBenedikt Ames <wisp3rwind@posteo.eu>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0fef9483
  11. 25 Aug, 2020 4 commits
  12. 23 Aug, 2020 2 commits
  13. 20 Aug, 2020 3 commits
    • Pavel Begunkov's avatar
      io_uring: kill extra iovec=NULL in import_iovec() · 867a23ea
      Pavel Begunkov authored
      If io_import_iovec() returns an error, return iovec is undefined and
      must not be used, so don't set it to NULL when failing.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      867a23ea
    • Pavel Begunkov's avatar
      io_uring: comment on kfree(iovec) checks · f261c168
      Pavel Begunkov authored
      kfree() handles NULL pointers well, but io_{read,write}() checks it
      because of performance reasons. Leave a comment there for those who are
      tempted to patch it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f261c168
    • Pavel Begunkov's avatar
      io_uring: fix racy req->flags modification · bb175342
      Pavel Begunkov authored
      Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
      io_cqring_overflow_flush() are racy, because they might be called
      asynchronously.
      
      REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
      be guaranteed that requests _currently_ marked inflight can't be
      overflown, the problem will be solved with removing the flag
      altogether.
      
      That's how the patch works, it removes inflight status of a request
      in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
      list. That's Ok to do, because no opcode specific handling can be done
      after io_cqring_fill_event(), the same assumption as with "struct
      io_completion" patches.
      And it already have a good place for such cleanups, which is
      io_clean_op(). A nice side effect of this is removing this inflight
      check from the hot path.
      
      note on synchronisation: now __io_cqring_fill_event() may be taking two
      spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
      because we never do that in reverse order, and CQ-overflow of inflight
      requests shouldn't happen often.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb175342
  14. 19 Aug, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: use system_unbound_wq for ring exit work · fc666777
      Jens Axboe authored
      We currently use system_wq, which is unbounded in terms of number of
      workers. This means that if we're exiting tons of rings at the same
      time, then we'll briefly spawn tons of event kworkers just for a very
      short blocking time as the rings exit.
      
      Use system_unbound_wq instead, which has a sane cap on the concurrency
      level.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fc666777
  15. 18 Aug, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: cleanup io_import_iovec() of pre-mapped request · 8452fd0c
      Jens Axboe authored
      io_rw_prep_async() goes through a dance of clearing req->io, calling
      the iovec import, then re-setting req->io. Provide an internal helper
      that does the right thing without needing state tweaked to get there.
      
      This enables further cleanups in io_read, io_write, and
      io_resubmit_prep(), but that's left for another time.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8452fd0c
  16. 16 Aug, 2020 6 commits
    • Jens Axboe's avatar
      io_uring: get rid of kiocb_wait_page_queue_init() · 3b2a4439
      Jens Axboe authored
      The 5.9 merge moved this function io_uring, which means that we don't
      need to retain the generic nature of it. Clean up this part by removing
      redundant checks, and just inlining the small remainder in
      io_rw_should_retry().
      
      No functional changes in this patch.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b2a4439
    • Jens Axboe's avatar
      io_uring: find and cancel head link async work on files exit · b711d4ea
      Jens Axboe authored
      Commit f254ac04 ("io_uring: enable lookup of links holding inflight files")
      only handled 2 out of the three head link cases we have, we also need to
      lookup and cancel work that is blocked in io-wq if that work has a link
      that's holding a reference to the files structure.
      
      Put the "cancel head links that hold this request pending" logic into
      io_attempt_cancel(), which will to through the motions of finding and
      canceling head links that hold the current inflight files stable request
      pending.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b711d4ea
    • Linus Torvalds's avatar
      Linux 5.9-rc1 · 9123e3a7
      Linus Torvalds authored
      9123e3a7
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block · 2cc3c4b3
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "A few differerent things in here.
      
        Seems like syzbot got some more io_uring bits wired up, and we got a
        handful of reports and the associated fixes are in here.
      
        General fixes too, and a lot of them marked for stable.
      
        Lastly, a bit of fallout from the async buffered reads, where we now
        more easily trigger short reads. Some applications don't really like
        that, so the io_read() code now handles short reads internally, and
        got a cleanup along the way so that it's now easier to read (and
        documented). We're now passing tests that failed before"
      
      * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
        io_uring: short circuit -EAGAIN for blocking read attempt
        io_uring: sanitize double poll handling
        io_uring: internally retry short reads
        io_uring: retain iov_iter state over io_read/io_write calls
        task_work: only grab task signal lock when needed
        io_uring: enable lookup of links holding inflight files
        io_uring: fail poll arm on queue proc failure
        io_uring: hold 'ctx' reference around task_work queue + execute
        fs: RWF_NOWAIT should imply IOCB_NOIO
        io_uring: defer file table grabbing request cleanup for locked requests
        io_uring: add missing REQ_F_COMP_LOCKED for nested requests
        io_uring: fix recursive completion locking on oveflow flush
        io_uring: use TWA_SIGNAL for task_work uncondtionally
        io_uring: account locked memory before potential error case
        io_uring: set ctx sq/cq entry count earlier
        io_uring: Fix NULL pointer dereference in loop_rw_iter()
        io_uring: add comments on how the async buffered read retry works
        io_uring: io_async_buf_func() need not test page bit
      2cc3c4b3
    • Mike Rapoport's avatar
      parisc: fix PMD pages allocation by restoring pmd_alloc_one() · 6f6aea7e
      Mike Rapoport authored
      Commit 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one()
      and pmd_free_one()") converted parisc to use generic version of
      pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for
      PMD.
      
      Restore the original version of pmd_alloc_one() for parisc, just use
      GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and
      memset.
      
      Fixes: 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()")
      Reported-by: default avatarMeelis Roos <mroos@linux.ee>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarMeelis Roos <mroos@linux.ee>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.eeSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f6aea7e
    • Linus Torvalds's avatar
      Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block · 4b6c093e
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes on the block side of things:
      
         - Discard granularity fix (Coly)
      
         - rnbd cleanups (Guoqing)
      
         - md error handling fix (Dan)
      
         - md sysfs fix (Junxiao)
      
         - Fix flush request accounting, which caused an IO slowdown for some
           configurations (Ming)
      
         - Properly propagate loop flag for partition scanning (Lennart)"
      
      * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
        block: fix double account of flush request's driver tag
        loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
        rnbd: no need to set bi_end_io in rnbd_bio_map_kern
        rnbd: remove rnbd_dev_submit_io
        md-cluster: Fix potential error pointer dereference in resize_bitmaps()
        block: check queue's limits.discard_granularity in __blkdev_issue_discard()
        md: get sysfs entry after redundancy attr group create
      4b6c093e