1. 11 Jun, 2021 1 commit
  2. 08 Jun, 2021 2 commits
    • Coly Li's avatar
      bcache: avoid oversized read request in cache missing code path · 41fe8d08
      Coly Li authored
      In the cache missing code path of cached device, if a proper location
      from the internal B+ tree is matched for a cache miss range, function
      cached_dev_cache_miss() will be called in cache_lookup_fn() in the
      following code block,
      [code block 1]
        526         unsigned int sectors = KEY_INODE(k) == s->iop.inode
        527                 ? min_t(uint64_t, INT_MAX,
        528                         KEY_START(k) - bio->bi_iter.bi_sector)
        529                 : INT_MAX;
        530         int ret = s->d->cache_miss(b, s, bio, sectors);
      
      Here s->d->cache_miss() is the call backfunction pointer initialized as
      cached_dev_cache_miss(), the last parameter 'sectors' is an important
      hint to calculate the size of read request to backing device of the
      missing cache data.
      
      Current calculation in above code block may generate oversized value of
      'sectors', which consequently may trigger 2 different potential kernel
      panics by BUG() or BUG_ON() as listed below,
      
      1) BUG_ON() inside bch_btree_insert_key(),
      [code block 2]
         886         BUG_ON(b->ops->is_extents && !KEY_SIZE(k));
      2) BUG() inside biovec_slab(),
      [code block 3]
         51         default:
         52                 BUG();
         53                 return NULL;
      
      All the above panics are original from cached_dev_cache_miss() by the
      oversized parameter 'sectors'.
      
      Inside cached_dev_cache_miss(), parameter 'sectors' is used to calculate
      the size of data read from backing device for the cache missing. This
      size is stored in s->insert_bio_sectors by the following lines of code,
      [code block 4]
        909    s->insert_bio_sectors = min(sectors, bio_sectors(bio) + reada);
      
      Then the actual key inserting to the internal B+ tree is generated and
      stored in s->iop.replace_key by the following lines of code,
      [code block 5]
        911   s->iop.replace_key = KEY(s->iop.inode,
        912                    bio->bi_iter.bi_sector + s->insert_bio_sectors,
        913                    s->insert_bio_sectors);
      The oversized parameter 'sectors' may trigger panic 1) by BUG_ON() from
      the above code block.
      
      And the bio sending to backing device for the missing data is allocated
      with hint from s->insert_bio_sectors by the following lines of code,
      [code block 6]
        926    cache_bio = bio_alloc_bioset(GFP_NOWAIT,
        927                 DIV_ROUND_UP(s->insert_bio_sectors, PAGE_SECTORS),
        928                 &dc->disk.bio_split);
      The oversized parameter 'sectors' may trigger panic 2) by BUG() from the
      agove code block.
      
      Now let me explain how the panics happen with the oversized 'sectors'.
      In code block 5, replace_key is generated by macro KEY(). From the
      definition of macro KEY(),
      [code block 7]
        71 #define KEY(inode, offset, size)                                  \
        72 ((struct bkey) {                                                  \
        73      .high = (1ULL << 63) | ((__u64) (size) << 20) | (inode),     \
        74      .low = (offset)                                              \
        75 })
      
      Here 'size' is 16bits width embedded in 64bits member 'high' of struct
      bkey. But in code block 1, if "KEY_START(k) - bio->bi_iter.bi_sector" is
      very probably to be larger than (1<<16) - 1, which makes the bkey size
      calculation in code block 5 is overflowed. In one bug report the value
      of parameter 'sectors' is 131072 (= 1 << 17), the overflowed 'sectors'
      results the overflowed s->insert_bio_sectors in code block 4, then makes
      size field of s->iop.replace_key to be 0 in code block 5. Then the 0-
      sized s->iop.replace_key is inserted into the internal B+ tree as cache
      missing check key (a special key to detect and avoid a racing between
      normal write request and cache missing read request) as,
      [code block 8]
        915   ret = bch_btree_insert_check_key(b, &s->op, &s->iop.replace_key);
      
      Then the 0-sized s->iop.replace_key as 3rd parameter triggers the bkey
      size check BUG_ON() in code block 2, and causes the kernel panic 1).
      
      Another kernel panic is from code block 6, is by the bvecs number
      oversized value s->insert_bio_sectors from code block 4,
              min(sectors, bio_sectors(bio) + reada)
      There are two possibility for oversized reresult,
      - bio_sectors(bio) is valid, but bio_sectors(bio) + reada is oversized.
      - sectors < bio_sectors(bio) + reada, but sectors is oversized.
      
      From a bug report the result of "DIV_ROUND_UP(s->insert_bio_sectors,
      PAGE_SECTORS)" from code block 6 can be 344, 282, 946, 342 and many
      other values which larther than BIO_MAX_VECS (a.k.a 256). When calling
      bio_alloc_bioset() with such larger-than-256 value as the 2nd parameter,
      this value will eventually be sent to biovec_slab() as parameter
      'nr_vecs' in following code path,
         bio_alloc_bioset() ==> bvec_alloc() ==> biovec_slab()
      Because parameter 'nr_vecs' is larger-than-256 value, the panic by BUG()
      in code block 3 is triggered inside biovec_slab().
      
      From the above analysis, we know that the 4th parameter 'sector' sent
      into cached_dev_cache_miss() may cause overflow in code block 5 and 6,
      and finally cause kernel panic in code block 2 and 3. And if result of
      bio_sectors(bio) + reada exceeds valid bvecs number, it may also trigger
      kernel panic in code block 3 from code block 6.
      
      Now the almost-useless readahead size for cache missing request back to
      backing device is removed, this patch can fix the oversized issue with
      more simpler method.
      - add a local variable size_limit,  set it by the minimum value from
        the max bkey size and max bio bvecs number.
      - set s->insert_bio_sectors by the minimum value from size_limit,
        sectors, and the sectors size of bio.
      - replace sectors by s->insert_bio_sectors to do bio_next_split.
      
      By the above method with size_limit, s->insert_bio_sectors will never
      result oversized replace_key size or bio bvecs number. And split bio
      'miss' from bio_next_split() will always match the size of 'cache_bio',
      that is the current maximum bio size we can sent to backing device for
      fetching the cache missing data.
      
      Current problmatic code can be partially found since Linux v3.13-rc1,
      therefore all maintained stable kernels should try to apply this fix.
      Reported-by: default avatarAlexander Ullrich <ealex1979@gmail.com>
      Reported-by: default avatarDiego Ercolani <diego.ercolani@gmail.com>
      Reported-by: default avatarJan Szubiak <jan.szubiak@linuxpolska.pl>
      Reported-by: default avatarMarco Rebhan <me@dblsaiko.net>
      Reported-by: default avatarMatthias Ferdinand <bcache@mfedv.net>
      Reported-by: default avatarVictor Westerhuis <victor@westerhu.is>
      Reported-by: default avatarVojtech Pavlik <vojtech@suse.cz>
      Reported-and-tested-by: default avatarRolf Fokkens <rolf@rolffokkens.nl>
      Reported-and-tested-by: default avatarThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Takashi Iwai <tiwai@suse.com>
      Link: https://lore.kernel.org/r/20210607125052.21277-3-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      41fe8d08
    • Coly Li's avatar
      bcache: remove bcache device self-defined readahead · 1616a4c2
      Coly Li authored
      For read cache missing, bcache defines a readahead size for the read I/O
      request to the backing device for the missing data. This readahead size
      is initialized to 0, and almost no one uses it to avoid unnecessary read
      amplifying onto backing device and write amplifying onto cache device.
      Considering upper layer file system code has readahead logic allready
      and works fine with readahead_cache_policy sysfile interface, we don't
      have to keep bcache self-defined readahead anymore.
      
      This patch removes the bcache self-defined readahead for cache missing
      request for backing device, and the readahead sysfs file interfaces are
      removed as well.
      
      This is the preparation for next patch to fix potential kernel panic due
      to oversized request in a simpler method.
      Reported-by: default avatarAlexander Ullrich <ealex1979@gmail.com>
      Reported-by: default avatarDiego Ercolani <diego.ercolani@gmail.com>
      Reported-by: default avatarJan Szubiak <jan.szubiak@linuxpolska.pl>
      Reported-by: default avatarMarco Rebhan <me@dblsaiko.net>
      Reported-by: default avatarMatthias Ferdinand <bcache@mfedv.net>
      Reported-by: default avatarVictor Westerhuis <victor@westerhu.is>
      Reported-by: default avatarVojtech Pavlik <vojtech@suse.cz>
      Reported-and-tested-by: default avatarRolf Fokkens <rolf@rolffokkens.nl>
      Reported-and-tested-by: default avatarThorsten Knabe <linux@thorsten-knabe.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Takashi Iwai <tiwai@suse.com>
      Link: https://lore.kernel.org/r/20210607125052.21277-2-colyli@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1616a4c2
  3. 03 Jun, 2021 1 commit
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-06-03' of git://git.infradead.org/nvme into block-5.13 · e369edbb
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 5.13:
      
       - fix corruption in RDMA in-capsule SGLs (Sagi Grimberg)
       - nvme-loop reset fixes (Hannes Reinecke)
       - nvmet fix for freeing unallocated p2pmem (Max Gurtovoy)"
      
      * tag 'nvme-5.13-2021-06-03' of git://git.infradead.org/nvme:
        nvmet: fix freeing unallocated p2pmem
        nvme-loop: do not warn for deleted controllers during reset
        nvme-loop: check for NVME_LOOP_Q_LIVE in nvme_loop_destroy_admin_queue()
        nvme-loop: clear NVME_LOOP_Q_LIVE when nvme_loop_configure_admin_queue() fails
        nvme-loop: reset queue count to 1 in nvme_loop_destroy_io_queues()
        nvme-rdma: fix in-casule data send for chained sgls
      e369edbb
  4. 02 Jun, 2021 5 commits
  5. 31 May, 2021 1 commit
  6. 27 May, 2021 1 commit
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-05-27' of git://git.infradead.org/nvme into block-5.13 · a4b58f17
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 5.13
      
       - fix a memory leak in nvme_cdev_add (Guoqing Jiang)
       - fix inline data size comparison in nvmet_tcp_queue_response (Hou Pu)
       - fix false keep-alive timeout when a controller is torn down
         (Sagi Grimberg)
       - fix a nvme-tcp Kconfig dependency (Sagi Grimberg)
       - short-circuit reconnect retries for FC (Hannes Reinecke)
       - decode host pathing error for connect (Hannes Reinecke)"
      
      * tag 'nvme-5.13-2021-05-27' of git://git.infradead.org/nvme:
        nvmet: fix false keep-alive timeout when a controller is torn down
        nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response
        nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
        nvme-fabrics: decode host pathing error for connect
        nvme-fc: short-circuit reconnect retries
        nvme: fix potential memory leaks in nvme_cdev_add
      a4b58f17
  7. 26 May, 2021 5 commits
  8. 25 May, 2021 4 commits
  9. 20 May, 2021 3 commits
  10. 19 May, 2021 5 commits
    • James Smart's avatar
      nvme-fc: clear q_live at beginning of association teardown · a7d13914
      James Smart authored
      The __nvmf_check_ready() routine used to bounce all filesystem io if the
      controller state isn't LIVE.  However, a later patch changed the logic so
      that it rejection ends up being based on the Q live check.  The FC
      transport has a slightly different sequence from rdma and tcp for
      shutting down queues/marking them non-live.  FC marks its queue non-live
      after aborting all ios and waiting for their termination, leaving a
      rather large window for filesystem io to continue to hit the transport.
      Unfortunately this resulted in filesystem I/O or applications seeing I/O
      errors.
      
      Change the FC transport to mark the queues non-live at the first sign of
      teardown for the association (when I/O is initially terminated).
      
      Fixes: 73a53799 ("nvme-fabrics: allow to queue requests for live queues")
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a7d13914
    • Keith Busch's avatar
      nvme-tcp: rerun io_work if req_list is not empty · a0fdd141
      Keith Busch authored
      A possible race condition exists where the request to send data is
      enqueued from nvme_tcp_handle_r2t()'s will not be observed by
      nvme_tcp_send_all() if it happens to be running. The driver relies on
      io_work to send the enqueued request when it is runs again, but the
      concurrently running nvme_tcp_send_all() may not have released the
      send_mutex at that time. If no future commands are enqueued to re-kick
      the io_work, the request will timeout in the SEND_H2C state, resulting
      in a timeout error like:
      
        nvme nvme0: queue 1: timeout request 0x3 type 6
      
      Ensure the io_work continues to run as long as the req_list is not empty.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a0fdd141
    • Sagi Grimberg's avatar
      nvme-tcp: fix possible use-after-completion · 825619b0
      Sagi Grimberg authored
      Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
      context") added a second context that may perform a network send.
      This means that now RX and TX are not serialized in nvme_tcp_io_work
      and can run concurrently.
      
      While there is correct mutual exclusion in the TX path (where
      the send_mutex protect the queue socket send activity) RX activity,
      and more specifically request completion may run concurrently.
      
      This means we must guarantee that any mutation of the request state
      related to its lifetime, bytes sent must not be accessed when a completion
      may have possibly arrived back (and processed).
      
      The race may trigger when a request completion arrives, processed
      _and_ reused as a fresh new request, exactly in the (relatively short)
      window between the last data payload sent and before the request iov_iter
      is advanced.
      
      Consider the following race:
      1. 16K write request is queued
      2. The nvme command and the data is sent to the controller (in-capsule
         or solicited by r2t)
      3. After the last payload is sent but before the req.iter is advanced,
         the controller sends back a completion.
      4. The completion is processed, the request is completed, and reused
         to transfer a new request (write or read)
      5. The new request is queued, and the driver reset the request parameters
         (nvme_tcp_setup_cmd_pdu).
      6. Now context in (2) resumes execution and advances the req.iter
      
      ==> use-after-completion as this is already a new request.
      
      Fix this by making sure the request is not advanced after the last
      data payload send, knowing that a completion may have arrived already.
      
      An alternative solution would have been to delay the request completion
      or state change waiting for reference counting on the TX path, but besides
      adding atomic operations to the hot-path, it may present challenges in
      multi-stage R2T scenarios where a r2t handler needs to be deferred to
      an async execution.
      Reported-by: default avatarNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
      Tested-by: default avatarAnil Mishra <anil.mishra@wdc.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      825619b0
    • Wu Bo's avatar
      nvme-loop: fix memory leak in nvme_loop_create_ctrl() · 03504e3b
      Wu Bo authored
      When creating loop ctrl in nvme_loop_create_ctrl(), if nvme_init_ctrl()
      fails, the loop ctrl should be freed before jumping to the "out" label.
      
      Fixes: 3a85a5de ("nvme-loop: add a NVMe loopback host driver")
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      03504e3b
    • Wu Bo's avatar
      nvmet: fix memory leak in nvmet_alloc_ctrl() · fec356a6
      Wu Bo authored
      When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger
      than cntlid_max of the subsystem, and jumps to the
      "out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed.
      Fix this by jumping to the "out_free_sqs" label.
      
      Fixes: 94a39d61 ("nvmet: make ctrl-id configurable")
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      fec356a6
  11. 14 May, 2021 3 commits
  12. 13 May, 2021 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-05-13' of git://git.infradead.org/nvme into block-5.13 · 6bdf2fbc
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fix for Linux 5.13
      
       - correct the check for using the inline bio in nvmet
         (Chaitanya Kulkarni)
       - demote unsupported command warnings (Chaitanya Kulkarni)
       - fix corruption due to double initializing ANA state (me, Hou Pu)
       - reset ns->file when open fails (Daniel Wagner)
       - fix a NULL deref when SEND is completed with error in nvmet-rdma
         (Michal Kalderon)"
      
      * tag 'nvme-5.13-2021-05-13' of git://git.infradead.org/nvme:
        nvmet: use new ana_log_size instead the old one
        nvmet: seset ns->file when open fails
        nvmet: demote fabrics cmd parse err msg to debug
        nvmet: use helper to remove the duplicate code
        nvmet: demote discovery cmd parse err msg to debug
        nvmet-rdma: Fix NULL deref when SEND is completed with error
        nvmet: fix inline bio check for passthru
        nvmet: fix inline bio check for bdev-ns
        nvme-multipath: fix double initialization of ANA state
      6bdf2fbc
    • Hou Pu's avatar
      nvmet: use new ana_log_size instead the old one · e181811b
      Hou Pu authored
      The new ana_log_size should be used instead of the old one.
      Or kernel NULL pointer dereference will happen like below:
      
      [   38.957849][   T69] BUG: kernel NULL pointer dereference, address: 000000000000003c
      [   38.975550][   T69] #PF: supervisor write access in kernel mode
      [   38.975955][   T69] #PF: error_code(0x0002) - not-present page
      [   38.976905][   T69] PGD 0 P4D 0
      [   38.979388][   T69] Oops: 0002 [#1] SMP NOPTI
      [   38.980488][   T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54
      [   38.981254][   T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   38.982502][   T69] Workqueue: events nvme_loop_execute_work
      [   38.985219][   T69] RIP: 0010:memcpy_orig+0x68/0x10f
      [   38.986203][   T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2
      [   38.987677][   T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287
      [   38.987996][   T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010
      [   38.988327][   T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044
      [   38.988620][   T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000
      [   38.988991][   T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024
      [   38.989289][   T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024
      [   38.989845][   T69] FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
      [   38.990234][   T69] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   38.990490][   T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0
      [   38.991105][   T69] Call Trace:
      [   38.994157][   T69]  sg_copy_buffer+0xb8/0xf0
      [   38.995357][   T69]  nvmet_copy_to_sgl+0x48/0x6d
      [   38.995565][   T69]  nvmet_execute_get_log_page_ana+0xd4/0x1cb
      [   38.995792][   T69]  nvmet_execute_get_log_page+0xc9/0x146
      [   38.995992][   T69]  nvme_loop_execute_work+0x3e/0x44
      [   38.996181][   T69]  process_one_work+0x1c3/0x3c0
      [   38.996393][   T69]  worker_thread+0x44/0x3d0
      [   38.996600][   T69]  ? cancel_delayed_work+0x90/0x90
      [   38.996804][   T69]  kthread+0xf7/0x130
      [   38.996961][   T69]  ? kthread_create_worker_on_cpu+0x70/0x70
      [   38.997171][   T69]  ret_from_fork+0x22/0x30
      [   38.997705][   T69] Modules linked in:
      [   38.998741][   T69] CR2: 000000000000003c
      [   39.000104][   T69] ---[ end trace e719927b609d0fa0 ]---
      
      Fixes: 5e1f6899 ("nvme-multipath: fix double initialization of ANA state")
      Signed-off-by: default avatarHou Pu <houpu.main@gmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e181811b
  13. 12 May, 2021 6 commits
    • Daniel Wagner's avatar
      nvmet: seset ns->file when open fails · 85428bea
      Daniel Wagner authored
      Reset the ns->file value to NULL also in the error case in
      nvmet_file_ns_enable().
      
      The ns->file variable points either to file object or contains the
      error code after the filp_open() call. This can lead to following
      problem:
      
      When the user first setups an invalid file backend and tries to enable
      the ns, it will fail. Then the user switches over to a bdev backend
      and enables successfully the ns. The first received I/O will crash the
      system because the IO backend is chosen based on the ns->file value:
      
      static u16 nvmet_parse_io_cmd(struct nvmet_req *req)
      {
      	[...]
      
      	if (req->ns->file)
      		return nvmet_file_parse_io_cmd(req);
      
      	return nvmet_bdev_parse_io_cmd(req);
      }
      Reported-by: default avatarEnzo Matsumiya <ematsumiya@suse.com>
      Signed-off-by: default avatarDaniel Wagner <dwagner@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      85428bea
    • Sun Ke's avatar
      nbd: share nbd_put and return by goto put_nbd · bedf78c4
      Sun Ke authored
      Replace the following two statements by the statement “goto put_nbd;”
      
      	nbd_put(nbd);
      	return 0;
      Signed-off-by: default avatarSun Ke <sunke32@huawei.com>
      Suggested-by: default avatarMarkus Elfring <Markus.Elfring@web.de>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20210512114331.1233964-3-sunke32@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bedf78c4
    • Sun Ke's avatar
      nbd: Fix NULL pointer in flush_workqueue · 79ebe911
      Sun Ke authored
      Open /dev/nbdX first, the config_refs will be 1 and
      the pointers in nbd_device are still null. Disconnect
      /dev/nbdX, then reference a null recv_workq. The
      protection by config_refs in nbd_genl_disconnect is useless.
      
      [  656.366194] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [  656.368943] #PF: supervisor write access in kernel mode
      [  656.369844] #PF: error_code(0x0002) - not-present page
      [  656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0
      [  656.371693] Oops: 0002 [#1] SMP
      [  656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 5.11.0-rc5-00040-g76c057c8 #1
      [  656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      [  656.375904] RIP: 0010:mutex_lock+0x29/0x60
      [  656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00 <f0> 48 0f b1 55 d
      [  656.378934] RSP: 0018:ffffc900005eb9b0 EFLAGS: 00010246
      [  656.379350] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  656.379915] RDX: ffff888104cf2600 RSI: ffffffffaae8f452 RDI: 0000000000000020
      [  656.380473] RBP: 0000000000000020 R08: 0000000000000000 R09: ffff88813bd6b318
      [  656.381039] R10: 00000000000000c7 R11: fefefefefefefeff R12: ffff888102710b40
      [  656.381599] R13: ffffc900005eb9e0 R14: ffffffffb2930680 R15: ffff88810770ef00
      [  656.382166] FS:  00007fdf117ebb40(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
      [  656.382806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  656.383261] CR2: 0000000000000020 CR3: 0000000100c84000 CR4: 00000000000006e0
      [  656.383819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  656.384370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  656.384927] Call Trace:
      [  656.385111]  flush_workqueue+0x92/0x6c0
      [  656.385395]  nbd_disconnect_and_put+0x81/0xd0
      [  656.385716]  nbd_genl_disconnect+0x125/0x2a0
      [  656.386034]  genl_family_rcv_msg_doit.isra.0+0x102/0x1b0
      [  656.386422]  genl_rcv_msg+0xfc/0x2b0
      [  656.386685]  ? nbd_ioctl+0x490/0x490
      [  656.386954]  ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0
      [  656.387354]  netlink_rcv_skb+0x62/0x180
      [  656.387638]  genl_rcv+0x34/0x60
      [  656.387874]  netlink_unicast+0x26d/0x590
      [  656.388162]  netlink_sendmsg+0x398/0x6c0
      [  656.388451]  ? netlink_rcv_skb+0x180/0x180
      [  656.388750]  ____sys_sendmsg+0x1da/0x320
      [  656.389038]  ? ____sys_recvmsg+0x130/0x220
      [  656.389334]  ___sys_sendmsg+0x8e/0xf0
      [  656.389605]  ? ___sys_recvmsg+0xa2/0xf0
      [  656.389889]  ? handle_mm_fault+0x1671/0x21d0
      [  656.390201]  __sys_sendmsg+0x6d/0xe0
      [  656.390464]  __x64_sys_sendmsg+0x23/0x30
      [  656.390751]  do_syscall_64+0x45/0x70
      [  656.391017]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      To fix it, just add if (nbd->recv_workq) to nbd_disconnect_and_put().
      
      Fixes: e9e006f5 ("nbd: fix max number of supported devs")
      Signed-off-by: default avatarSun Ke <sunke32@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20210512114331.1233964-2-sunke32@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79ebe911
    • Lin Feng's avatar
      blkdev.h: remove unused codes blk_account_rq · 190515f6
      Lin Feng authored
      Last users of blk_account_rq gone with patch commit a1ce35fa
      ("block: remove dead elevator code") and now it gets no caller, it can
      be safely removed.
      Signed-off-by: default avatarLin Feng <linf@wangsu.com>
      Link: https://lore.kernel.org/r/20210512100124.173769-1-linf@wangsu.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      190515f6
    • Paolo Valente's avatar
      block, bfq: avoid circular stable merges · 7ea96eef
      Paolo Valente authored
      BFQ may merge a new bfq_queue, stably, with the last bfq_queue
      created. In particular, BFQ first waits a little bit for some I/O to
      flow inside the new queue, say Q2, if this is needed to understand
      whether it is better or worse to merge Q2 with the last queue created,
      say Q1. This delayed stable merge is performed by assigning
      bic->stable_merge_bfqq = Q1, for the bic associated with Q1.
      
      Yet, while waiting for some I/O to flow in Q2, a non-stable queue
      merge of Q2 with Q1 may happen, causing the bic previously associated
      with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
      Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
      be recycled as a non-shared bfq_queue. In that case, Q1 may then
      happen to undergo a stable merge with the bfq_queue pointed by
      bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
      Q1. So Q1 would be merged with itself.
      
      This commit fixes this error by intercepting this situation, and
      canceling the schedule of the stable merge.
      
      Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
      Signed-off-by: default avatarPietro Pedroni <pedroni.pietro.96@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ea96eef
    • Tejun Heo's avatar
      blk-iocost: fix weight updates of inner active iocgs · e9f4eee9
      Tejun Heo authored
      When the weight of an active iocg is updated, weight_updated() is called
      which in turn calls __propagate_weights() to update the active and inuse
      weights so that the effective hierarchical weights are update accordingly.
      
      The current implementation is incorrect for inner active nodes. For an
      active leaf iocg, inuse can be any value between 1 and active and the
      difference represents how much the iocg is donating. When weight is updated,
      as long as inuse is clamped between 1 and the new weight, we're alright and
      this is what __propagate_weights() currently implements.
      
      However, that's not how an active inner node's inuse is set. An inner node's
      inuse is solely determined by the ratio between the sums of inuse's and
      active's of its children - ie. they're results of propagating the leaves'
      active and inuse weights upwards. __propagate_weights() incorrectly applies
      the same clamping as for a leaf when an active inner node's weight is
      updated. Consider a hierarchy which looks like the following with saturating
      workloads in AA and BB.
      
           R
         /   \
        A     B
        |     |
       AA     BB
      
      1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
      
      2. echo 200 > A/io.weight
      
      3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
         it's already between 1 and the new active, making A:active=200,
         A:inuse=100. As R's active_sum is updated along with A's active,
         A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
         hwi's remain unchanged at 0.5.
      
      4. The weight of A is now twice that of B but AA and BB still have the same
         hwi of 0.5 and thus are doing the same amount of IOs.
      
      Fix it by making __propgate_weights() always calculate the inuse of an
      active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDan Schatzberg <dschatzberg@fb.com>
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9f4eee9
  14. 11 May, 2021 1 commit