1. 10 Aug, 2021 5 commits
    • Pavel Begunkov's avatar
      io_uring: fix ctx-exit io_rsrc_put_work() deadlock · 43597aac
      Pavel Begunkov authored
      __io_rsrc_put_work() might need ->uring_lock, so nobody should wait for
      rsrc nodes holding the mutex. However, that's exactly what
      io_ring_ctx_free() does with io_wait_rsrc_data().
      
      Split it into rsrc wait + dealloc, and move the first one out of the
      lock.
      
      Cc: stable@vger.kernel.org
      Fixes: b60c8dce ("io_uring: preparation for rsrc tagging")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      43597aac
    • Jens Axboe's avatar
      io_uring: drop ctx->uring_lock before flushing work item · c018db4a
      Jens Axboe authored
      Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
      from the regression suite:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc3-bluetea-test-00249-gc7d10223 #5 Tainted: G           OE
      ------------------------------------------------------
      kworker/2:4/2684 is trying to acquire lock:
      ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0
      
      but task is already holding lock:
      ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
             __flush_work+0x31b/0x490
             io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
             __do_sys_io_uring_register+0x45b/0x1060
             do_syscall_64+0x35/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #0 (&ctx->uring_lock){+.+.}-{3:3}:
             __lock_acquire+0x119a/0x1e10
             lock_acquire+0xc8/0x2f0
             __mutex_lock+0x86/0x740
             io_rsrc_put_work+0x13d/0x1a0
             process_one_work+0x236/0x530
             worker_thread+0x52/0x3b0
             kthread+0x135/0x160
             ret_from_fork+0x1f/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                                     lock(&ctx->uring_lock);
                                     lock((work_completion)(&(&ctx->rsrc_put_work)->work));
        lock(&ctx->uring_lock);
      
       *** DEADLOCK ***
      
      2 locks held by kworker/2:4/2684:
       #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
       #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
      
      stack backtrace:
      CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d10223 #5
      Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
      Workqueue: events io_rsrc_put_work
      Call Trace:
       dump_stack_lvl+0x6a/0x9a
       check_noncircular+0xfe/0x110
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       ? io_rsrc_put_work+0x13d/0x1a0
       __mutex_lock+0x86/0x740
       ? io_rsrc_put_work+0x13d/0x1a0
       ? io_rsrc_put_work+0x13d/0x1a0
       ? io_rsrc_put_work+0x13d/0x1a0
       ? process_one_work+0x1ce/0x530
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       ? process_one_work+0x530/0x530
       kthread+0x135/0x160
       ? set_kthread_struct+0x40/0x40
       ret_from_fork+0x1f/0x30
      
      which is due to holding the ctx->uring_lock when flushing existing
      pending work, while the pending work flushing may need to grab the uring
      lock if we're using IOPOLL.
      
      Fix this by dropping the uring_lock a bit earlier as part of the flush.
      
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/404Tested-by: default avatarAmmar Faizi <ammarfaizi2@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c018db4a
    • Hao Xu's avatar
      io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() · 47cae0c7
      Hao Xu authored
      There may be cases like:
              A                                 B
      spin_lock(wqe->lock)
      nr_workers is 0
      nr_workers++
      spin_unlock(wqe->lock)
                                           spin_lock(wqe->lock)
                                           nr_wokers is 1
                                           nr_workers++
                                           spin_unlock(wqe->lock)
      create_io_worker()
        acct->worker is 1
                                           create_io_worker()
                                             acct->worker is 1
      
      There should be one worker marked IO_WORKER_F_FIXED, but no one is.
      Fix this by introduce a new agrument for create_io_worker() to indicate
      if it is the first worker.
      
      Fixes: 3d4e4fac ("io-wq: fix no lock protection of acct->nr_worker")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      47cae0c7
    • Hao Xu's avatar
      io-wq: fix bug of creating io-wokers unconditionally · 49e7f0c7
      Hao Xu authored
      The former patch to add check between nr_workers and max_workers has a
      bug, which will cause unconditionally creating io-workers. That's
      because the result of the check doesn't affect the call of
      create_io_worker(), fix it by bringing in a boolean value for it.
      
      Fixes: 21698274 ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
      [axboe: drop hunk that isn't strictly needed]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      49e7f0c7
    • Jens Axboe's avatar
      io_uring: rsrc ref lock needs to be IRQ safe · 4956b9ea
      Jens Axboe authored
      Nadav reports running into the below splat on re-enabling softirqs:
      
      WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0
      Modules linked in:
      CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161
      Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
      RIP: 0010:__local_bh_enable_ip+0xaa/0xe0
      Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65
      RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000
      RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac
      RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9
      R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae
      R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890
      FS:  00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0
      Call Trace:
       _raw_spin_unlock_bh+0x31/0x40
       io_rsrc_node_ref_zero+0x13e/0x190
       io_dismantle_req+0x215/0x220
       io_req_complete_post+0x1b8/0x720
       __io_complete_rw.isra.0+0x16b/0x1f0
       io_complete_rw+0x10/0x20
      
      where it's clear we end up calling the percpu count release directly
      from the completion path, as it's in atomic mode and we drop the last
      ref. For file/block IO, this can be from IRQ context already, and the
      softirq locking for rsrc isn't enough.
      
      Just make the lock fully IRQ safe, and ensure we correctly safe state
      from the release path as we don't know the full context there.
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Tested-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4956b9ea
  2. 09 Aug, 2021 2 commits
    • Nadav Amit's avatar
      io_uring: Use WRITE_ONCE() when writing to sq_flags · 20c0b380
      Nadav Amit authored
      The compiler should be forbidden from any strange optimization for async
      writes to user visible data-structures. Without proper protection, the
      compiler can cause write-tearing or invent writes that would confuse the
      userspace.
      
      However, there are writes to sq_flags which are not protected by
      WRITE_ONCE(). Use WRITE_ONCE() for these writes.
      
      This is purely a theoretical issue. Presumably, any compiler is very
      unlikely to do such optimizations.
      
      Fixes: 75b28aff ("io_uring: allocate the two rings together")
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      20c0b380
    • Nadav Amit's avatar
      io_uring: clear TIF_NOTIFY_SIGNAL when running task work · ef98eb04
      Nadav Amit authored
      When using SQPOLL, the submission queue polling thread calls
      task_work_run() to run queued work. However, when work is added with
      TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
      set afterwards and is never cleared.
      
      Consequently, when the submission queue polling thread checks whether
      signal_pending(), it may always find a pending signal, if
      task_work_add() was ever called before.
      
      The impact of this bug might be different on different kernel versions.
      It appears that on 5.14 it would only cause unnecessary calculation and
      prevent the polling thread from sleeping. On 5.13, where the bug was
      found, it stops the polling thread from finding newly submitted work.
      
      Instead of task_work_run(), use tracehook_notify_signal() that clears
      TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
      current->task_works to avoid a race in which task_works is cleared but
      the TIF_NOTIFY_SIGNAL is set.
      
      Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ef98eb04
  3. 06 Aug, 2021 2 commits
    • Hao Xu's avatar
      io-wq: fix lack of acct->nr_workers < acct->max_workers judgement · 21698274
      Hao Xu authored
      There should be this judgement before we create an io-worker
      
      Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      21698274
    • Hao Xu's avatar
      io-wq: fix no lock protection of acct->nr_worker · 3d4e4fac
      Hao Xu authored
      There is an acct->nr_worker visit without lock protection. Think about
      the case: two callers call io_wqe_wake_worker(), one is the original
      context and the other one is an io-worker(by calling
      io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
      nr_worker to be larger than max_worker.
      Let's fix it by adding lock for it, and let's do nr_workers++ before
      create_io_worker. There may be a edge cause that the first caller fails
      to create an io-worker, but the second caller doesn't know it and then
      quit creating io-worker as well:
      
      say nr_worker = max_worker - 1
              cpu 0                        cpu 1
         io_wqe_wake_worker()          io_wqe_wake_worker()
            nr_worker < max_worker
            nr_worker++
            create_io_worker()         nr_worker == max_worker
               failed                  return
            return
      
      But the chance of this case is very slim.
      
      Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      [axboe: fix unconditional create_io_worker() call]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3d4e4fac
  4. 04 Aug, 2021 1 commit
  5. 28 Jul, 2021 2 commits
  6. 27 Jul, 2021 1 commit
  7. 26 Jul, 2021 2 commits
  8. 23 Jul, 2021 2 commits
  9. 22 Jul, 2021 1 commit
  10. 20 Jul, 2021 3 commits
    • Yang Yingliang's avatar
      io_uring: fix memleak in io_init_wq_offload() · 362a9e65
      Yang Yingliang authored
      I got memory leak report when doing fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888107310a80 (size 96):
      comm "syz-executor.6", pid 4610, jiffies 4295140240 (age 20.135s)
      hex dump (first 32 bytes):
      01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
      backtrace:
      [<000000001974933b>] kmalloc include/linux/slab.h:591 [inline]
      [<000000001974933b>] kzalloc include/linux/slab.h:721 [inline]
      [<000000001974933b>] io_init_wq_offload fs/io_uring.c:7920 [inline]
      [<000000001974933b>] io_uring_alloc_task_context+0x466/0x640 fs/io_uring.c:7955
      [<0000000039d0800d>] __io_uring_add_tctx_node+0x256/0x360 fs/io_uring.c:9016
      [<000000008482e78c>] io_uring_add_tctx_node fs/io_uring.c:9052 [inline]
      [<000000008482e78c>] __do_sys_io_uring_enter fs/io_uring.c:9354 [inline]
      [<000000008482e78c>] __se_sys_io_uring_enter fs/io_uring.c:9301 [inline]
      [<000000008482e78c>] __x64_sys_io_uring_enter+0xabc/0xc20 fs/io_uring.c:9301
      [<00000000b875f18f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      [<00000000b875f18f>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
      [<000000006b0a8484>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      CPU0                          CPU1
      io_uring_enter                io_uring_enter
      io_uring_add_tctx_node        io_uring_add_tctx_node
      __io_uring_add_tctx_node      __io_uring_add_tctx_node
      io_uring_alloc_task_context   io_uring_alloc_task_context
      io_init_wq_offload            io_init_wq_offload
      hash = kzalloc                hash = kzalloc
      ctx->hash_map = hash          ctx->hash_map = hash <- one of the hash is leaked
      
      When calling io_uring_enter() in parallel, the 'hash_map' will be leaked,
      add uring_lock to protect 'hash_map'.
      
      Fixes: e941894e ("io-wq: make buffered file write hashed work map per-ctx")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210720083805.3030730-1-yangyingliang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      362a9e65
    • Pavel Begunkov's avatar
      io_uring: remove double poll entry on arm failure · 46fee9ab
      Pavel Begunkov authored
      __io_queue_proc() can enqueue both poll entries and still fail
      afterwards, so the callers trying to cancel it should also try to remove
      the second poll entry (if any).
      
      For example, it may leave the request alive referencing a io_uring
      context but not accessible for cancellation:
      
      [  282.599913][ T1620] task:iou-sqp-23145   state:D stack:28720 pid:23155 ppid:  8844 flags:0x00004004
      [  282.609927][ T1620] Call Trace:
      [  282.613711][ T1620]  __schedule+0x93a/0x26f0
      [  282.634647][ T1620]  schedule+0xd3/0x270
      [  282.638874][ T1620]  io_uring_cancel_generic+0x54d/0x890
      [  282.660346][ T1620]  io_sq_thread+0xaac/0x1250
      [  282.696394][ T1620]  ret_from_fork+0x1f/0x30
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Reported-and-tested-by: syzbot+ac957324022b7132accf@syzkaller.appspotmail.com
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46fee9ab
    • Pavel Begunkov's avatar
      io_uring: explicitly count entries for poll reqs · 68b11e8b
      Pavel Begunkov authored
      If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc()
      failed, but it goes on with a third waitqueue, it may succeed and
      overwrite the error status. Count the number of poll entries we added,
      so we can set pt->error to zero at the beginning and find out when the
      mentioned scenario happens.
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      68b11e8b
  11. 11 Jul, 2021 13 commits
  12. 10 Jul, 2021 6 commits